Pre-Summer Sale Special - Limited Time 70% Discount Offer - Ends in 0d 00h 00m 00s - Coupon code: mxmas70

Home > NVIDIA > NVIDIA-Certified Professional > NCP-AII

NCP-AII NVIDIA AI Infrastructure Question and Answers

Question # 4

ClusterKit’s NCCL bandwidth test shows 350 GB/s on a 400G InfiniBand fabric. How should this result be interpreted?

A.

Critical failure; expected is greater than 390 GB/s for HDR InfiniBand.

B.

Suboptimal performance; requires FEC tuning to reach 380+ GB/s.

C.

Optimal performance, indicating healthy fabric and GPUDirect RDMA.

D.

Inconclusive; rerun with --stress=cpu to validate.

Full Access
Question # 5

You are expanding a DGX-based deep learning cluster to train on large, high-resolution images that cannot fit into local cache. Multiple nodes will access this data concurrently and require high performance. Which storage and networking solution best meets these requirements?

A.

Increase the SSD RAID-0 local cache size in each node so it can absorb most training data, making network storage type and speed less important for performance.

B.

Implement a standard NFS server on a 10GbE network because the cluster can access the export and job performance will not be impacted.

C.

Deploy a high-performance parallel file system across InfiniBand or 40/100GbE, ensuring at least 3 GB/s per node and scalable aggregate bandwidth for all cluster workloads.

D.

Recommend general-purpose object storage for all training data because it is optimized for deep learning workloads and distributed data access at any scale.

Full Access
Question # 6

A financial services firm is deploying an AI model for fraud detection that requires rapid inference and data retrieval across multiple sites. Which feature should their storage system prioritize?

A.

Multi-protocol data access with low latency.

B.

Tape backup systems.

C.

Low-cost HDD solutions.

D.

High capacity with moderate speed.

Full Access
Question # 7

A system administrator needs to install a GPU/DPU in a server. The server has a free PCI-e slot, there are enough free PCI-e lanes, and there is enough room for the card. Which procedure should be followed?

A.

Ensure the server has enough power. Verify compatibility of cables with server ' s platform. Make sure the server is down to remove cables safely. Do not wear an ESD bracelet.

B.

Ensure the server has enough power. Make sure the server is down to remove cables safely. Wear an ESD bracelet.

C.

Ensure the server has enough power. Make sure the server is up and running with attached cables. Wear an ESD bracelet.

D.

Ensure the server has enough power. Verify compatibility of cables with server ' s platform. Make sure the server is down to remove cables safely. Wear an ESD bracelet.

Full Access
Question # 8

During multi-node HPL burn-in, GPUs show uneven utilization. Which configuration ensures balanced workload distribution?

A.

Enable HPL_USE_NVSHMEM=1 for shared memory acceleration

B.

HPL_RUN_GEMM_TESTS to skip validation

C.

Set --gpu-affinity and --cpu-affinity to align GPU and NUMA nodes

D.

HPL_OOC_TILE_M to 8192 for larger blocks

Full Access
Question # 9

After Spectrum-X fabric deployment, NCCL tests show intermittent latency spikes. Which network condition most severely impacts East-West bandwidth?

A.

Multiple transceiver firmware mismatches.

B.

400G port utilization at 70% on several nodes during tests.

C.

Jitter below 5 ps with consistent latency.

D.

Packet loss greater than 0.001% causing NCCL pipeline stalls.

Full Access
Question # 10

After configuring HA, the administrator runs cmsh status and notices the secondary head node reports mysql [FAIL]. What is the most likely cause?

A.

The BCM license expired after HA configuration.

B.

Network connectivity issues between the primary and secondary head nodes.

C.

The secondary head node lacks NVIDIA GPU drivers.

D.

The cluster nodes are powered on during the HA configuration.

Full Access
Question # 11

After updating BlueField-3 DPU BMC firmware via Redfish, the engineer observes “TaskState: Running” but no progress after 15 minutes. How should they track the update’s completion status?

A.

Check /var/log/messages on the DPU operating system for update logs.

B.

Query the DPU BMC with the Task ID of the installation process.

C.

Power cycle the DPU immediately to force a rollback.

D.

Run bfrec --status on the DPU to view flash progress.

Full Access
Question # 12

An AI training cluster with NVIDIA GPUs experiences prolonged data loading times during checkpoint reloading, causing GPUs to idle frequently. CPU utilization during data transfers remains high. Which solution most effectively optimizes storage-to-GPU throughput while reducing CPU overhead?

A.

Increase batch sizes to reduce the frequency of storage access.

B.

Migrate datasets to SATA SSDs with RAID 0 for higher sequential read speeds.

C.

Add more GPUs to the cluster to parallelize data loading tasks.

D.

Implement GPUDirect Storage to enable direct data transfers.

Full Access
Question # 13

Which statement best explains why maintaining high cable signal quality is essential in modern high-speed data centers?

A.

High cable signal quality ensures that cable length and connector type do not play as big a role in deploying new infrastructure in the data center.

B.

High cable signal quality minimizes bit error rates and supports reliable, high-throughput communication, reducing retransmissions and congestion across the network.

C.

High cable signal quality reduces electromagnetic interference (EMI) and crosstalk, helping prevent unexpected packet drops during sustained workloads.

D.

High cable signal quality enables effective use of Forward Error Correction (FEC), which is required for reliable operation at high data rates such as 200GbE and above.

Full Access
Question # 14

If two ports must be connected, but one is SFP and one is QSFP, for example, to connect a 25 GbE Host Channel Adapter to a QSFP port capable of both 100 GbE and 25 GbE, which solution would best meet this requirement?

A.

QSA adapter.

B.

SFP connectors.

C.

SFP-to-1G BASE-T RJ45 adapter.

D.

Standard QSFP-to-QSFP DAC cable.

Full Access
Question # 15

If two ports must be connected, but one is SFP and one is QSFP, for example, to connect a 25 GbE HOST CHANNEL ADAPTER to a QSFP port capable of both 100 GbE and 25 GbE, which of the following solutions would best meet this requirement?

A.

SFP Connectors

B.

SFP to 1G BASE-T (RJ45) adapter

C.

QSA Adapter

Full Access
Question # 16

A system administrator receives an alert about a potential hardware fault on an NVIDIA DGX A100. The GPU performance seems degraded, and the system fans are operating loudly. What step should be recommended to identify and troubleshoot the hardware fault?

A.

Run a deep learning workload to stress test the GPUs and check whether the issue persists.

B.

Check the NVIDIA System Management Interface (nvidia-smi) for GPU status and temperatures.

C.

Power drain then restart the DGX and check if the performance degradation resolves.

D.

Increase the fan speed to maximum and check whether the performance improves.

Full Access
Question # 17

After NCCL burn-in reports " transport retry count exceeded, " which corrective action addresses the underlying fabric issue?

A.

Switch from Ring to Tree algorithms via NCCL_ALGO=TREE

B.

Reduce message size to decrease network utilization

C.

Increase NCCL_IB_TIMEOUT to tolerate longer latencies

D.

Inspect InfiniBand link quality metrics (BER, symbol errors) and replace faulty cables

Full Access
Question # 18

An InfiniBand server stops working, and a system administrator runs the " ibstat " command that provides the following output:

CA ' mlx5_1 '

CA type: MT4115

Number of ports: 2

Firmware version: 10.20.1010

Hardware version: 0

Node GUID: 0x0002c90300002f78

System image GUID: 0x0002c90300002f7b

Port 1:

State: Initializing

Physical state: Linkup

Rate: 100

Base lid: 0

LMC: 0

SM lid: 0

Capability mask: 0x0251086a

Port GUID: 0x0002c90300002f79

Link layer: InfiniBand

What is the cause of the issue?

A.

The HCA port is faulty.

B.

There is no running SM in the fabric.

C.

The neighboring switch port is faulty.

D.

The cable is disconnected.

Full Access
Question # 19

During a multi-day NeMo burn-in, intermittent " GPU fell off bus " errors occur. Which diagnostic approach isolates hardware faults?

A.

Enable HPL_USE_NVSHMEM for alternative memory sharing.

B.

Run DCGM diagnostics alongside burn-in to monitor GPU health metrics.

C.

Switch from BERT to GPT models for simpler computations.

D.

Reduce blocksize to 500MB to lower memory pressure.

Full Access
Question # 20

During server maintenance, a system administrator wants to ensure that the NVIDIA DGX server has sufficient disk space for operational activities. The administrator is scripting an alert system that will notify the team if disk space falls below a threshold. Which command could be included in the maintenance script to check the available disk space on the server?

A.

nvidia-smi --query-disk-space

B.

du -sh /home/*

C.

df -h | grep ' /var '

D.

lsof +L1

Full Access
Question # 21

A cluster administrator is preparing to update the firmware on a DGX H100 system, including the GPU tray (baseboard). What is the correct sequence of steps to perform a safe and successful firmware upgrade?

A.

Update the BMC and skip the GPU tray and motherboard tray updates if the system appears healthy.

B.

Perform a cold reset, stop all GPU activity, update and reboot the BMC, update motherboard and tray components, and verify completion.

C.

Update the GPU tray first, then the motherboard tray, and reboot the BMC after all updates are complete.

D.

Stop all GPU activity, update and reboot the BMC, update motherboard and tray components, perform a cold reset, and verify completion.

Full Access
Question # 22

After a firmware upgrade on a DGX H100, the administrator notices that one GPU is not detected by the system. Which troubleshooting step should be performed first to identify the root cause?

A.

Review firmware update logs and run nvsm show health to check for hardware or firmware errors on the affected GPU.

B.

Remove the GPU from the system and replace it with a new one before any diagnostics.

C.

Ignore the issue and proceed with production workloads if the other GPUs are operational.

D.

Immediately re-run the firmware upgrade on all system components.

Full Access
Question # 23

You are leading a project to enhance the energy efficiency of a data center that heavily relies on AI workloads. NVIDIA suggests moving beyond traditional metrics like Power Usage Effectiveness (PUE) to better capture the efficiency of modern data centers. Which strategy should you prioritize?

A.

Use Power Usage Effectiveness as the primary metric while supplementing it with additional measures of useful work done per unit of energy.

B.

Use watts used as the primary measure of efficiency, as it accurately reflects the power input at any given time.

C.

Develop benchmarks tailored to specific workloads, such as MLPerf for AI applications, to better understand energy use in real-world scenarios.

D.

Focus on integrating kilowatt-hours into existing metrics to better reflect the actual energy used for productive work.

Full Access
Question # 24

A team is installing the NVIDIA Run:ai control plane on a Kubernetes cluster. Which two (2) options are most critical to validate before proceeding? (Pick the 2 correct responses below)

A.

Helm is installed on the installer machine.

B.

Ensure Kubernetes is running on the cluster.

C.

All cluster nodes have NVIDIA GPUs installed.

D.

NTP is disabled to simplify time synchronization.

Full Access
Question # 25

What command is needed to measure BER (Bit Error Rate)?

A.

mlxconfig -d < device > q

B.

ethtool -S < device >

C.

mlxlink -d < device > -c -e

D.

mstflint -d < device > q full

Full Access
Question # 26

During BCM cluster setup, an engineer must configure bonded network interfaces on DGX nodes for high availability. Which cmsh command sequence properly configures a bond0 interface with two physical NICs?

A.

device use dgx001 ; interfaces add vlan vlan100 ; set parent bond0 ; set mode 1 ; set network internalnet

B.

device use dgx001 ; interfaces add bond bond0 ; append interfaces enp225s0f1np1 enp97s0f1np1 ; set mode 1 ; set network internalnet

C.

device use dgx001 ; interfaces set enp225s0f1np1 network internalnet ; interfaces set enp97s0f1np1 network internalnet

D.

device use dgx001 ; interfaces delete enp225s0f1np1 ; interfaces delete enp97s0f1np1

Full Access
Question # 27

ClusterKit ' s NCCL bandwidth test shows 350 GB/s on a 400G InfiniBand fabric. How should this result be interpreted?

A.

Optimal performance, indicating healthy fabric and GPUDirect RDMA.

B.

Suboptimal performance; requires FEC tuning to reach 380+ GB/s.

C.

Critical failure; expected is > 390 GB/s for HDR InfiniBand.

D.

Inconclusive; rerun with --stress=cpu to validate.

Full Access
Question # 28

A user encounters " permission denied " errors when running GPU-accelerated containers on a Secure Boot-enabled system. What resolves this?

A.

Enroll the MOK and sign NVIDIA kernel modules.

B.

Reinstall Docker without the NVIDIA runtime.

C.

Disable SELinux to relax unnecessary security policies.

D.

Run Docker with sudo for elevated privileges.

Full Access
Question # 29

After ClusterKit reports " GPU-Host latency exceeds threshold, " which NVIDIA diagnostic tool should be used to isolate hardware faults?

A.

Re-run ClusterKit with --stress=gpu -Y 60 to extend test duration

B.

nvidia-smi topo -m to inspect GPU topology connections

C.

DCGM Diags dcgmi diag -r 2

D.

ib_write_bw to measure InfiniBand bandwidth between nodes

Full Access
Question # 30

What is the purpose of using NCCL in verifying East-West fabric in an NVIDIA AI Factory?

Pick the 2 correct responses below.

A.

To measure the storage network performance.

B.

To measure the latency between GPUs.

C.

To measure the power consumption of GPUs.

D.

To measure bandwidth between GPUs.

Full Access
Question # 31

A System Administrator needs to change the scheduling behavior of a single GPU to use a fixed share scheduler. What command achieves this?

A.

esxcli system module parameters set -m nvidia -p

B.

esxcli -i 0 -mig 18

C.

nvidia-smi -i 0 -mig 1

D.

mlxconfig -d /dev/mst/mt4123_pciconf0 set LINK_TYPE_P1 =2

Full Access
Question # 32

A network engineer is tasked with configuring the management, storage, and compute networks for a new DGX BasePOD deployment. Which statement best describes the network segmentation required for optimal operation?

A.

A single VLAN for all types of network traffic.

B.

Two networks: one for management and one for compute.

C.

Four networks: compute, storage, out-of-band, and management.

Full Access
Question # 33

A single-node stress test fails during the PCIe bandwidth validation phase. Which troubleshooting step is recommended first?

A.

Reduce PCIe Gen4 speed to Gen3 speed in BIOS settings.

B.

Reseat the GPU, then rerun the test.

C.

Disable NVLink in BIOS to isolate PCIe performance.

D.

Reinstall NVIDIA drivers using apt-get install nvidia-driver-550.

Full Access
Question # 34

After installing NGC CLI on RHEL, a user runs ngc registry image list but sees no results. The API key and organization are correctly configured. What resolves this?

A.

Disable SELinux to eliminate unnecessary security restrictions.

B.

Run ngc config set --team team-name to specify a team.

C.

Reinstall the CLI using the yum command instead of manual installation.

D.

Ensure the user ' s NGC account has REGISTRY_READ permissions for the organization.

Full Access
Question # 35

A team is validating a DGX BasePOD deployment. Using cmsh, they run a command to check GPU health across all nodes. What indicates that the system is ready for AI workloads?

A.

The command output is ignored if the system powers on without errors.

B.

At least half of the GPUs report Status_Health = OK.

C.

All GPUs report Status_Health = OK and Health = OK for each device.

D.

Only the head node ' s GPUs need to be healthy.

Full Access
Question # 36

To validate bisectional bandwidth across two racks in a Spectrum-X Ethernet fabric, which NCCL test configuration isolates East-West traffic?

A.

NCCL_TESTS_SPLIT= " OR 0x7 " ./all_reduce_perf -g 8

B.

Run without splits and analyze per-rack averages.

C.

NCCL_TESTS_SPLIT= " MOD 2 " ./all_reduce_perf -g 8

D.

NCCL_TESTS_SPLIT= " DIV 8 " ./all_reduce_perf -g 1

Full Access