ClusterKit’s NCCL bandwidth test shows 350 GB/s on a 400G InfiniBand fabric. How should this result be interpreted?
You are expanding a DGX-based deep learning cluster to train on large, high-resolution images that cannot fit into local cache. Multiple nodes will access this data concurrently and require high performance. Which storage and networking solution best meets these requirements?
A financial services firm is deploying an AI model for fraud detection that requires rapid inference and data retrieval across multiple sites. Which feature should their storage system prioritize?
A system administrator needs to install a GPU/DPU in a server. The server has a free PCI-e slot, there are enough free PCI-e lanes, and there is enough room for the card. Which procedure should be followed?
During multi-node HPL burn-in, GPUs show uneven utilization. Which configuration ensures balanced workload distribution?
After Spectrum-X fabric deployment, NCCL tests show intermittent latency spikes. Which network condition most severely impacts East-West bandwidth?
After configuring HA, the administrator runs cmsh status and notices the secondary head node reports mysql [FAIL]. What is the most likely cause?
After updating BlueField-3 DPU BMC firmware via Redfish, the engineer observes “TaskState: Running†but no progress after 15 minutes. How should they track the update’s completion status?
An AI training cluster with NVIDIA GPUs experiences prolonged data loading times during checkpoint reloading, causing GPUs to idle frequently. CPU utilization during data transfers remains high. Which solution most effectively optimizes storage-to-GPU throughput while reducing CPU overhead?
Which statement best explains why maintaining high cable signal quality is essential in modern high-speed data centers?
If two ports must be connected, but one is SFP and one is QSFP, for example, to connect a 25 GbE Host Channel Adapter to a QSFP port capable of both 100 GbE and 25 GbE, which solution would best meet this requirement?
If two ports must be connected, but one is SFP and one is QSFP, for example, to connect a 25 GbE HOST CHANNEL ADAPTER to a QSFP port capable of both 100 GbE and 25 GbE, which of the following solutions would best meet this requirement?
A system administrator receives an alert about a potential hardware fault on an NVIDIA DGX A100. The GPU performance seems degraded, and the system fans are operating loudly. What step should be recommended to identify and troubleshoot the hardware fault?
After NCCL burn-in reports " transport retry count exceeded, " which corrective action addresses the underlying fabric issue?
An InfiniBand server stops working, and a system administrator runs the " ibstat " command that provides the following output:
CA ' mlx5_1 '
CA type: MT4115
Number of ports: 2
Firmware version: 10.20.1010
Hardware version: 0
Node GUID: 0x0002c90300002f78
System image GUID: 0x0002c90300002f7b
Port 1:
State: Initializing
Physical state: Linkup
Rate: 100
Base lid: 0
LMC: 0
SM lid: 0
Capability mask: 0x0251086a
Port GUID: 0x0002c90300002f79
Link layer: InfiniBand
What is the cause of the issue?
During a multi-day NeMo burn-in, intermittent " GPU fell off bus " errors occur. Which diagnostic approach isolates hardware faults?
During server maintenance, a system administrator wants to ensure that the NVIDIA DGX server has sufficient disk space for operational activities. The administrator is scripting an alert system that will notify the team if disk space falls below a threshold. Which command could be included in the maintenance script to check the available disk space on the server?
A cluster administrator is preparing to update the firmware on a DGX H100 system, including the GPU tray (baseboard). What is the correct sequence of steps to perform a safe and successful firmware upgrade?
After a firmware upgrade on a DGX H100, the administrator notices that one GPU is not detected by the system. Which troubleshooting step should be performed first to identify the root cause?
You are leading a project to enhance the energy efficiency of a data center that heavily relies on AI workloads. NVIDIA suggests moving beyond traditional metrics like Power Usage Effectiveness (PUE) to better capture the efficiency of modern data centers. Which strategy should you prioritize?
A team is installing the NVIDIA Run:ai control plane on a Kubernetes cluster. Which two (2) options are most critical to validate before proceeding? (Pick the 2 correct responses below)
During BCM cluster setup, an engineer must configure bonded network interfaces on DGX nodes for high availability. Which cmsh command sequence properly configures a bond0 interface with two physical NICs?
ClusterKit ' s NCCL bandwidth test shows 350 GB/s on a 400G InfiniBand fabric. How should this result be interpreted?
A user encounters " permission denied " errors when running GPU-accelerated containers on a Secure Boot-enabled system. What resolves this?
After ClusterKit reports " GPU-Host latency exceeds threshold, " which NVIDIA diagnostic tool should be used to isolate hardware faults?
What is the purpose of using NCCL in verifying East-West fabric in an NVIDIA AI Factory?
Pick the 2 correct responses below.
A System Administrator needs to change the scheduling behavior of a single GPU to use a fixed share scheduler. What command achieves this?
A network engineer is tasked with configuring the management, storage, and compute networks for a new DGX BasePOD deployment. Which statement best describes the network segmentation required for optimal operation?
A single-node stress test fails during the PCIe bandwidth validation phase. Which troubleshooting step is recommended first?
After installing NGC CLI on RHEL, a user runs ngc registry image list but sees no results. The API key and organization are correctly configured. What resolves this?
A team is validating a DGX BasePOD deployment. Using cmsh, they run a command to check GPU health across all nodes. What indicates that the system is ready for AI workloads?
To validate bisectional bandwidth across two racks in a Spectrum-X Ethernet fabric, which NCCL test configuration isolates East-West traffic?