You are setting up a Kubernetes cluster on NVIDIA DGX systems using BCM, and you need to initialize the control-plane nodes.
What is the most important step to take before initializing these nodes?
A system administrator is troubleshooting a Docker container that is repeatedly failing to start. They want to gather more detailed information about the issue by generating debugging logs.
Why would generating debugging logs be an important step in resolving this issue?
You are deploying AI applications at the edge and want to ensure they continue running even if one of the servers at an edge location fails.
How can you configure NVIDIA Fleet Command to achieve this?
When troubleshooting Slurm job scheduling issues, a common source of problems is jobs getting stuck in a pending state indefinitely.
Which Slurm command can be used to view detailed information about all pending jobs and identify the cause of the delay?
A data scientist is training a deep learning model and notices slower than expected training times. The data scientist alerts a system administrator to inspect the issue. The system administrator suspects the disk IO is the issue.
What command should be used?
A DGX H100 system in a cluster is showing performance issues when running jobs.
Which command should be run to generate system logs related to the health report?
You are managing a Kubernetes cluster running AI training jobs using TensorFlow. The jobs require access to multiple GPUs across different nodes, but inter-node communication seems slow, impacting performance.
What is a potential networking configuration you would implement to optimize inter-node communication for distributed training?
You are managing multiple edge AI deployments using NVIDIA Fleet Command. You need to ensure that each AI application running on the same GPU is isolated from others to prevent interference.
Which feature of Fleet Command should you use to achieve this?
A Slurm user is experiencing a frequent issue where a Slurm job is getting stuck in the “PENDING†state and unable to progress to the “RUNNING†state.
Which Slurm command can help the user identify the reason for the job’s pending status?
What is the primary purpose of assigning a provisioning role to a node in NVIDIA Base Command Manager (BCM)?
You are deploying an AI workload on a Kubernetes cluster that requires access to GPUs for training deep learning models. However, the pods are not able to detect the GPUs on the nodes.
What would be the first step to troubleshoot this issue?
You are using BCM for configuring an active-passive high availability (HA) cluster for a firewall system. To ensure seamless failover, what is one best practice related to session synchronization between the active and passive nodes?
You have noticed that users can access all GPUs on a node even when they request only one GPU in their job script using --gres=gpu:1. This is causing resource contention and inefficient GPU usage.
What configuration change would you make to restrict users’ access to only their allocated GPUs?
A new researcher needs access to GPU resources but should not have permission to modify cluster settings or manage other users.
What role should you assign them in Run:ai?
What should an administrator check if GPU-to-GPU communication is slow in a distributed system using Magnum IO?
A cloud engineer is looking to deploy a digital fingerprinting pipeline using NVIDIA Morpheus and the NVIDIA AI Enterprise Virtual Machine Image (VMI).
Where would the cloud engineer find the VMI?