Weekend Sale - Limited Time 70% Discount Offer - Ends in 0d 00h 00m 00s - Coupon code: mxmas70

Home > NVIDIA > NVIDIA-Certified Professional > NCP-AIO

NCP-AIO NVIDIA AI Operations Question and Answers

Question # 4

You are setting up a Kubernetes cluster on NVIDIA DGX systems using BCM, and you need to initialize the control-plane nodes.

What is the most important step to take before initializing these nodes?

A.

Set up a load balancer before initializing any control-plane node.

B.

Disable swap on all control-plane nodes before initializing them.

C.

Ensure that Docker is installed and running on all control-plane nodes.

D.

Configure each control-plane node with its own external IP address.

Full Access
Question # 5

A system administrator is troubleshooting a Docker container that is repeatedly failing to start. They want to gather more detailed information about the issue by generating debugging logs.

Why would generating debugging logs be an important step in resolving this issue?

A.

Debugging logs disable other logging mechanisms, reducing noise in the output.

B.

Debugging logs provide detailed insights into the Docker daemon's internal operations.

C.

Debugging logs prevent the container from being removed after it stops, allowing for easier inspection.

D.

Debugging logs fix issues related to container performance and resource allocation.

Full Access
Question # 6

You are deploying AI applications at the edge and want to ensure they continue running even if one of the servers at an edge location fails.

How can you configure NVIDIA Fleet Command to achieve this?

A.

Use Secure NFS support for data redundancy.

B.

Set up over-the-air updates to automatically restart failed applications.

C.

Enable high availability for edge clusters.

D.

Configure Fleet Command's multi-instance GPU (MIG) to handle failover.

Full Access
Question # 7

When troubleshooting Slurm job scheduling issues, a common source of problems is jobs getting stuck in a pending state indefinitely.

Which Slurm command can be used to view detailed information about all pending jobs and identify the cause of the delay?

A.

scontrol

B.

sacct

C.

sinfo

Full Access
Question # 8

A data scientist is training a deep learning model and notices slower than expected training times. The data scientist alerts a system administrator to inspect the issue. The system administrator suspects the disk IO is the issue.

What command should be used?

A.

tcpdump

B.

iostat

C.

nvidia-smi

D.

htop

Full Access
Question # 9

A DGX H100 system in a cluster is showing performance issues when running jobs.

Which command should be run to generate system logs related to the health report?

A.

nvsm show logs --save

B.

nvsm get logs

C.

nvsm dump health

D.

nvsm health --dump-log

Full Access
Question # 10

You are managing a Kubernetes cluster running AI training jobs using TensorFlow. The jobs require access to multiple GPUs across different nodes, but inter-node communication seems slow, impacting performance.

What is a potential networking configuration you would implement to optimize inter-node communication for distributed training?

A.

Increase the number of replicas for each job to reduce the load on individual nodes.

B.

Use standard Ethernet networking with jumbo frames enabled to reduce packet overhead during communication.

C.

Configure a dedicated storage network to handle data transfer between nodes during training.

D.

Use InfiniBand networking between nodes to reduce latency and increase throughput for distributed training jobs.

Full Access
Question # 11

You are managing multiple edge AI deployments using NVIDIA Fleet Command. You need to ensure that each AI application running on the same GPU is isolated from others to prevent interference.

Which feature of Fleet Command should you use to achieve this?

A.

Remote Console

B.

Secure NFS support

C.

Multi-Instance GPU (MIG) support

D.

Over-the-air updates

Full Access
Question # 12

A Slurm user is experiencing a frequent issue where a Slurm job is getting stuck in the “PENDING” state and unable to progress to the “RUNNING” state.

Which Slurm command can help the user identify the reason for the job’s pending status?

A.

sinfo -R

B.

scontrol show job

C.

sacct -j

D.

squeue -u

Full Access
Question # 13

What is the primary purpose of assigning a provisioning role to a node in NVIDIA Base Command Manager (BCM)?

A.

To configure the node as a container orchestration manager

B.

To enable the node to monitor GPU utilization across the cluster

C.

To allow the node to manage software images and provision other nodes

D.

To assign the node as a storage manager for certified storage

Full Access
Question # 14

You are deploying an AI workload on a Kubernetes cluster that requires access to GPUs for training deep learning models. However, the pods are not able to detect the GPUs on the nodes.

What would be the first step to troubleshoot this issue?

A.

Verify that the NVIDIA GPU Operator is installed and running on the cluster.

B.

Ensure that all pods are using the latest version of TensorFlow or PyTorch.

C.

Check if the nodes have sufficient memory allocated for AI workloads.

D.

Increase the number of CPU cores allocated to each pod to ensure better resource utilization.

Full Access
Question # 15

You are using BCM for configuring an active-passive high availability (HA) cluster for a firewall system. To ensure seamless failover, what is one best practice related to session synchronization between the active and passive nodes?

A.

Configure both nodes with different zone names to avoid conflicts during failover.

B.

Use heartbeat network for session synchronization between active and passive nodes.

C.

Ensure that both nodes use different firewall models for redundancy.

D.

Set up manual synchronization procedures to transfer session data when needed.

Full Access
Question # 16

You have noticed that users can access all GPUs on a node even when they request only one GPU in their job script using --gres=gpu:1. This is causing resource contention and inefficient GPU usage.

What configuration change would you make to restrict users’ access to only their allocated GPUs?

A.

Increase the memory allocation per job to limit access to other resources on the node.

B.

Enable cgroup enforcement in cgroup.conf by setting ConstrainDevices=yes.

C.

Set a higher priority for Jobs requesting fewer GPUs, so they finish faster and free up resources sooner.

D.

Modify the job script to include additional resource requests for CPU cores alongside GPUs.

Full Access
Question # 17

A new researcher needs access to GPU resources but should not have permission to modify cluster settings or manage other users.

What role should you assign them in Run:ai?

A.

L1 Researcher

B.

Department Administrator

C.

Application Administrator

D.

Research Manager

Full Access
Question # 18

What should an administrator check if GPU-to-GPU communication is slow in a distributed system using Magnum IO?

A.

Limit the number of GPUs used in the system to reduce congestion.

B.

Increase the system's RAM capacity to improve communication speed.

C.

Disable InfiniBand to reduce network complexity.

D.

Verify the configuration of NCCL or NVSHMEM.

Full Access
Question # 19

A cloud engineer is looking to deploy a digital fingerprinting pipeline using NVIDIA Morpheus and the NVIDIA AI Enterprise Virtual Machine Image (VMI).

Where would the cloud engineer find the VMI?

A.

Github and Dockerhub

B.

Azure, Google, Amazon Marketplaces

C.

NVIDIA NGC

D.

Developer Forums

Full Access