To run PyTorch DDP scripts, at least one NVIDIA GPU is required, preferably two or more.
Provider recommendation for renting GPUs: RunPod, offering the RTX 3090 at $0.22 per hour.
Torch version 2.4.0 is recommended for PyTorch DDP.
Initializing torch.dist and wrapping the model in a DistributedDataParallel container are essential steps.
Using the nccl backend for Linux and understanding terms like rank and world size in DDP scripts are important.
Processes are started per GPU, and torch.cuda.set_device(dist.get_rank()) sets the GPU for each process.
The DistributedDataParallel container handles model synchronization during training.
DistributedSampler is used to evenly distribute training data across all ranks or GPUs.
Loss values are local to each rank, and all_gather_object() can be used to gather loss across all ranks.
Training on multiple GPUs with DDP is similar to increasing the batch size, and gradient accumulation allows training with larger batch sizes within hardware constraints.