Spring Sale Special - Limited Time 70% Discount Offer - Ends in 0d 00h 00m 00s - Coupon code: mxmas70

Home > NVIDIA > NVIDIA-Certified Professional > NCP-AII

NCP-AII NVIDIA AI Infrastructure Question and Answers

Question # 4

Your tasked with updating both NVIDIA GPU drivers and DOCA drivers on a set of servers used for AI workloads. The environment previously had an older driver stack and custom kernel modules. What is the most important step to successfully upgrade the drivers without causing conflicts?

A.

Update the GPU driver leaving the DOCA and OFED drivers unchanged as long as they are detecting the hardware properly.

B.

Validate the driver version post-install since the fresh install will overwrite the legacy drivers.

C.

Keep the older driver running alongside the new version in case you need to roll back the upgrade.

D.

Uninstall all existing GPU and DOCA-related drivers and associated kernel modules before the new install.

Full Access
Question # 5

During cluster deployment, the UFM Cable Validation Tool reports "Wrong-neighbor" errors on multiple InfiniBand links. What is the most efficient way to resolve this issue?

A.

Reboot all leaf switches to force LLDP rediscovery.

B.

Replace all affected cables with higher-grade OM5 fiber optics.

C.

Verify LLDP data against topology files and remediate.

D.

Disable FEC on all switches to bypass neighbor validation.

Full Access
Question # 6

A financial services firm is deploying an AI model for fraud detection that requires rapid inference and data retrieval across multiple sites. Which feature should their storage system prioritize?

A.

Multi-protocol data access with low latency.

B.

High capacity with moderate speed.

C.

Tape backup systems.

D.

Low-cost HDD solutions.

Full Access
Question # 7

A cluster administrator is preparing to update the firmware on a DGX H100 system, including the GPU tray (baseboard). What is the correct sequence of steps to perform a safe and successful firmware upgrade?

A.

Update the BMC and skip the GPU tray and motherboard tray updates if the system appears healthy.

B.

Perform a cold reset, stop all GPU activity, update and reboot the BMC, update motherboard and tray components, and verify completion.

C.

Update the GPU tray first, then the motherboard tray, and reboot the BMC after all updates are complete.

D.

Stop all GPU activity, update and reboot the BMC, update motherboard and tray components, perform a cold reset, and verify completion.

Full Access
Question # 8

A network engineer is tasked with configuring the management, storage, and compute networks for a new DGX BasePOD deployment. Which statement best describes the network segmentation required for optimal operation?

A.

A single VLAN for all types of network traffic.

B.

Two networks: one for management and one for compute.

C.

Four networks: compute, storage, out-of-band, and management.

Full Access
Question # 9

You are evaluating the integration of NVIDIA BlueField DPUs into your data center's storage architecture to optimize AI workloads. The storage solution chosen has incorporated BlueField DPUs to enhance performance and efficiency. Which of the following benefits directly results from this integration?

A.

Unlimited scalability by adding more DPUs without architectural changes.

B.

Elimination of latency issues in data processing tasks.

C.

Reduced CPU load by offloading data processing tasks to DPUs.

D.

Enhanced I/O performance with NVMe storage access speeds.

Full Access
Question # 10

During multi-node HPL burn-in, GPUs show uneven utilization. Which configuration ensures balanced workload distribution?

A.

Enable HPL_USE_NVSHMEM=1 for shared memory acceleration

B.

HPL_RUN_GEMM_TESTS to skip validation

C.

Set --gpu-affinity and --cpu-affinity to align GPU and NUMA nodes

D.

HPL_OOC_TILE_M to 8192 for larger blocks

Full Access
Question # 11

You are a network administrator responsible for configuring an East-West (E/W) Spectrum-X fabric using SuperNIC. The Bluefield-3 devices in your network should be set to NIC mode with RoCE enabled to optimize data flow between servers. You have access to the Spectrum-X management tools and the necessary documentation. You need to use specific configuration commands to achieve this setup. Which of the following steps and commands are necessary to configure the Bluefield-3 devices in NIC mode for the E/W Spectrum-X fabric using SuperNIC? (Pick the 2 correct responses below)

A.

Use the command sudo mlxconfig -d /dev/mst/ set LINK_TYPE_P1=2 to enable Ethernet on the Bluefield-3 devices.

B.

Use the command sudo mlxconfig -d /dev/mst/ set DISABLE_SPECTRUM_X=1 to reduce overhead.

C.

Use the command sudo mlxconfig -d /dev/mst/ set INTERNAL_CPU_OFFLOAD_ENGINE=1 to configure the SuperNIC to operate in NIC mode.

D.

Use the command sudo mlxconfig -d /dev/mst/ set DPU_MODE=1 to set up the Bluefield-3 devices in DPU mode.

Full Access
Question # 12

You are following the official steps to install the NVIDIA Container Toolkit using a package manager on Ubuntu. After importing the NVIDIA package repository and GPG key, what is the next action?

A.

Reboot the host system to apply the repository changes and proceed.

B.

Install the nvidia-container-toolkit package using your package manager.

C.

Format the disk to clear any existing NVIDIA-related dependencies first.

D.

Download the CUDA toolkit installer from NVIDIA'S official website.

Full Access
Question # 13

When updating the firmware on an NVLink switch transceiver, how can an engineer apply new firmware without interrupting the network?

A.

mlxfwreset -d -lid 27 reset --yes to reset the transceiver

B.

Physically disconnect and reconnect the transceiver.

C.

flint -d -lid 27 --linkx --linkx_auto_update --activate

D.

nv action reboot system to force immediate activation.

Full Access
Question # 14

After configuring NGC CLI with ngc config set, a user receives ”Authentication failed” errors when pulling containers. What step was most likely omitted?

A.

Installing the CLI with apt-get instead of manual extraction.

B.

Entering the API key during ngc config set or storing it in ~/.ngc/config.

C.

Setting --format_type=json to enable API interactions.

D.

Running sudo systemctl restart docker after configuration.

Full Access
Question # 15

A system administrator needs to configure a BlueField DPU and enable RShim on the baseboard management controller (BMC). Which command should be executed?

A.

ipmitool raw 0x32 0x6a 1

B.

systemctl restart rshim

C.

systemctl enable bmc-rshim.service

D.

scp root@:/dev/rshim0/boot

Full Access
Question # 16

After ClusterKit reports "GPU-Host latency exceeds threshold," which NVIDIA diagnostic tool should be used to isolate hardware faults?

A.

Re-run ClusterKit with --stress=gpu -Y 60 to extend test duration

B.

nvidia-smi topo -m to inspect GPU topology connections

C.

DCGM Diags dcgmi diag -r 2

D.

ib_write_bw to measure InfiniBand bandwidth between nodes

Full Access
Question # 17

During a 48-hour NeMo question-answering model burn-in test, GPU memory errors occur when processing large datasets. Which configuration strategy prevents Out-of-Memory (OOM) errors while maintaining processing efficiency?

A.

Set blocksize="1GB" for data loading and enable RMM asynchronous allocation.

B.

Switch from FP16 to FP32 precision for numerical stability.

C.

Disable add_filename for Parquet files to reduce metadata.

D.

Increase files_per_partition to 1000 for larger batch processing.

Full Access
Question # 18

An InfiniBand administrator needs to run performance benchmarks on new devices added to the fabric. What tool should be used to check the latency?

A.

tcpdump

B.

ib_write_lat

C.

ibdiagnet

D.

perfmon

Full Access
Question # 19

A system administrator noticed a failure on a DGX H100 server. After a reboot, only the BMC is available. What could be the reason for this behavior?

A.

The network card has no link / connection.

B.

A boot disk has failed.

C.

Multiple GPUs have failed.

D.

There are more than two failed power supplies.

Full Access
Question # 20

A company has a registered NGC account and their server has NGC CLI installed. What step should be taken first to gain access to NGC?

A.

ngc config get

B.

ngc init

C.

ngc config set

D.

ngc config update

Full Access
Question # 21

Why is it important to provide a large and high-performance local cache (using SSDs configured as RAID-0) for deep learning workloads on DGX systems?

A.

Local SSD cache allows users to increase the number of NFS threads on the server without impacting storage reliability.

B.

Using local SSD cache in RAID-0 enables direct GPU access to files without host CPU involvement, further boosting performance.

C.

Local SSD cache in RAID-0 is necessary to provide redundancy in case one of the drives fails during long training runs.

D.

A local SSD cache in RAID-0 ensures that most training data is read only once from the network, significantly reducing NFS traffic.

Full Access