A system administrator needs to configure and manage multiple installations of NVIDIA hardware ranging from single DGX BasePOD to SuperPOD.
Which software stack should be used?
What two (2) platforms should be used with Fabric Manager? (Choose two.)
A system administrator is experiencing issues with Docker containers failing to start due to volume mounting problems. They suspect the issue is related to incorrect file permissions on shared volumes between the host and containers.
How should the administrator troubleshoot this issue?
An administrator wants to check if the BlueMan service can access the DPU.
How can this be done?
Which of the following correctly identifies the key components of a Kubernetes cluster and their roles?
You are configuring networking for a new AI cluster in your data center. The cluster will handle large-scale distributed training jobs that require fast communication between servers.
What type of networking architecture can maximize performance for these AI workloads?
A system administrator needs to optimize the delivery of their AI applications to the edge.
What NVIDIA platform should be used?
Your Kubernetes cluster is running a mixture of AI training and inference workloads. You want to ensure that inference services have higher priority over training jobs during peak resource usage times.
How would you configure Kubernetes to prioritize inference workloads?
A GPU administrator needs to virtualize AI/ML training in an HGX environment.
How can the NVIDIA Fabric Manager be used to meet this demand?
A system administrator wants to run these two commands in Base Command Manager.
main
showprofile device status apc01
What command should the system administrator use from the management node system shell?
If a Magnum IO-enabled application experiences delays during the ETL phase, what troubleshooting step should be taken?
What must be done before installing new versions of DOCA drivers on a BlueField DPU?
What steps should an administrator take if they encounter errors related to RDMA (Remote Direct Memory Access) when using Magnum IO?
You are monitoring the resource utilization of a DGX SuperPOD cluster using NVIDIA Base Command Manager (BCM). The system is experiencing slow performance, and you need to identify the cause.
What is the most effective way to monitor GPU usage across nodes?
A system administrator is troubleshooting a Docker container that is repeatedly failing to start. They want to gather more detailed information about the issue by generating debugging logs.
Why would generating debugging logs be an important step in resolving this issue?
A system administrator notices that jobs are failing intermittently on Base Command Manager due to incorrect GPU configurations in Slurm. The administrator needs to ensure that jobs utilize GPUs correctly.
How should they troubleshoot this issue?
An administrator is troubleshooting issues with NVIDIA GPUDirect storage and must ensure optimal data transfer performance.
What step should be taken first?
A system administrator is troubleshooting a Docker container that crashes unexpectedly due to a segmentation fault. They want to generate and analyze core dumps to identify the root cause of the crash.
Why would generating core dumps be a critical step in troubleshooting this issue?
What is the primary purpose of assigning a provisioning role to a node in NVIDIA Base Command Manager (BCM)?