HyperAIHyperAI

Command Palette

Search for a command to run...

Understanding the Host-Device Paradigm in Multi-GPU AI Computing

AI in Multiple GPUs: Understanding the Host and Device Paradigm is part of a series on distributed AI across multiple GPUs. This guide introduces the core concepts of how a CPU (the host) and a GPU (the device) work together, focusing on NVIDIA GPUs, which are widely used in AI workloads. Integrated GPUs like those in Apple Silicon are not covered here. The Key Idea: Host and Device Your program begins on the CPU. When you need the GPU to perform a task—like matrix multiplication—the CPU sends the data and instructions to the GPU. This interaction is central to GPU computing. How the CPU Talks to the GPU The CPU communicates with the GPU through a queuing system. When your code runs a command like tensor.to('cuda'), the CPU doesn’t wait. Instead, it adds the command to a queue called a CUDA Stream. This allows the CPU to continue executing the next line of code immediately—this is asynchronous execution. While the GPU works on one task, the CPU can prepare the next batch of data, improving overall efficiency. CUDA Streams: Managing GPU Work A CUDA Stream is an ordered queue of GPU operations. Operations within a single stream run one after another. However, multiple streams can run concurrently, enabling the GPU to handle several tasks at once. By default, PyTorch uses the default stream. This ensures operations run in sequence and is simple to use. But it limits performance because each operation waits for the previous one to finish. Using multiple streams allows you to overlap computation and data transfer. For example, while the GPU processes one batch, the CPU can copy the next batch from system memory to GPU memory. In PyTorch, you can create and manage streams using context managers: compute_stream = torch.cuda.Stream() transfer_stream = torch.cuda.Stream() with torch.cuda.stream(transfer_stream): next_batch = next_batch_cpu.to('cuda', non_blocking=True) with torch.cuda.stream(compute_stream): output = model(current_batch) The non_blocking=True flag is crucial—it lets the CPU continue without waiting for the data transfer to finish. Synchronizing Streams Because streams are independent, you must explicitly manage dependencies. The most basic method is torch.cuda.synchronize(), which waits for all streams to complete. But this blocks the CPU and is inefficient. A better approach uses CUDA Events. An event marks a point in a stream, and another stream can wait on it without halting the CPU. event = torch.cuda.Event() with torch.cuda.stream(transfer_stream): next_batch = next_batch_cpu.to('cuda', non_blocking=True) event.record() with torch.cuda.stream(compute_stream): compute_stream.wait_event(event) output = model(next_batch) This way, only the compute stream waits on the transfer, and the CPU stays free to queue more work. Understanding Streams Helps Debug While most PyTorch training code doesn’t require manual stream management, features like DataLoader(pin_memory=True) and prefetching rely on this mechanism. Knowing how streams work helps you spot and fix performance issues. PyTorch Tensors: Host vs. Device PyTorch tensors have metadata (shape, dtype) and data. Metadata lives in the CPU’s RAM. The actual data is stored in the GPU’s VRAM. When you run print(t.shape), the CPU can access the shape immediately. But print(t) requires the GPU’s data, which may not yet be ready. This forces a host-device synchronization—a major performance bottleneck. The CPU must wait for the GPU to finish and copy the data back to RAM before printing. This blocking behavior slows down execution. Efficiency Tip: torch.randn(100, 100, device=device) creates the tensor directly on the GPU. torch.randn(100, 100).to(device) creates it on the CPU first, then transfers it—less efficient. Minimizing synchronization keeps both host and device busy. Scaling to Multiple GPUs: Ranks Training large models like LLMs often requires multiple GPUs. This is where distributed computing comes in. Each GPU runs its own process, known as a Rank. On a single machine with multiple GPUs, each Rank runs independently, sharing no memory. Rank 0 uses cuda:0, Rank 1 uses cuda:1. Both run the same code but can be assigned different tasks—like processing different data chunks. The rank ID is used to coordinate work. This setup enables data parallelism and is the foundation for scaling AI training across multiple devices. In the next post, we’ll explore point-to-point and collective operations that allow multiple GPUs to work together efficiently in training large models. You now understand the core host-device model, streams, synchronization, tensor placement, and the role of ranks in distributed AI. This mental model is essential for building fast, scalable AI systems.

Related Links