原贴链接

So I never really found anyone posting conclusive evidence of the speedup that can be gained from using NVLink on RTX 3090 GPUs. The general consensus is that it is mostly useful for training models when spanning across two GPUs using training methods such as Deepspeed Zero or FSDP, but no one really posted the gains they got between NVLink and no NVLink. So I am here to show what I found on this subject.

My training rig consists of 2x MSI RTX 3090 Ti Suprim X 24GB NVLinked together on a Asus Rampage V Edition 10 with a Xeon 2679 v4 and 256GB of RAM. The important thing about the platform is that the RAM is at DDR4 2424MHz at 101MHz BCLK and have extremely fine tuned subtimings, the memory bandwidth ends up at about 75GB/s and 68ns on aida64.

My Ultimate Dual RTX 3090 Ti LLM Dev PC :

This means even without NVLink and without P2P communication between the GPUs through PCIe, the memory has enough performance to not bottleneck GPU communications using DMA through the PCIe 3.0 x16 slots. Having PCIe 3.0 x16 to both GPUs also means that in this platform I have the same bandwidth to each GPU as in modern platforms with PCIe 4.0 x8 slots to each GPU.

However, we also know that there exists the modded Nvidia Linux drivers that theoretically allow P2P communication as seen in this repo: tinygrad/open-gpu-kernel-modules: NVIDIA Linux open GPU with P2P support (github.com)

I couldn’t get this to do any kind of improvement on my setup though. Not sure what’s wrong since my GPUs support Rebar and my motherboard has 4G decoding enabled and a Rebar modded BIOS which I can confirm works showing 32GB addressable for both GPUs.

I tested running NCCL-Tests All Reduce Performance tests.

P2P Disabled No NVLink Official Nvidia-Driver-550:

./all_reduce_perf -b 8 -e 128M -f 2 -g 2 part
# nThread 1 nGpus 2 minBytes 8 maxBytes 134217728 step: 2(factor) warmup iters: 5 iters: 20 agg iters: 1 validation: 1 graph: 0
#
# Using devices
#  Rank  0 Group  0 Pid   3156 on owen-train-pc device  0 [0x01] NVIDIA GeForce RTX 3090 Ti
#  Rank  1 Group  0 Pid   3156 on owen-train-pc device  1 [0x02] NVIDIA GeForce RTX 3090 Ti
#
#                                                              out-of-place                       in-place          
#       size         count      type   redop    root     time   algbw   busbw #wrong     time   algbw   busbw #wrong
#        (B)    (elements)                               (us)  (GB/s)  (GB/s)            (us)  (GB/s)  (GB/s)       
           8             2     float     sum      -1     9.64    0.00    0.00      0     9.29    0.00    0.00      0
          16             4     float     sum      -1    10.21    0.00    0.00      0     9.13    0.00    0.00      0
          32             8     float     sum      -1    10.28    0.00    0.00      0     9.27    0.00    0.00      0
          64            16     float     sum      -1    10.25    0.01    0.01      0     9.56    0.01    0.01      0
         128            32     float     sum      -1    10.19    0.01    0.01      0     9.24    0.01    0.01      0
         256            64     float     sum      -1    10.24    0.02    0.02      0     9.22    0.03    0.03      0
         512           128     float     sum      -1    10.24    0.05    0.05      0     9.24    0.06    0.06      0
        1024           256     float     sum      -1    10.81    0.09    0.09      0     9.47    0.11    0.11      0
        2048           512     float     sum      -1     9.45    0.22    0.22      0     9.44    0.22    0.22      0
        4096          1024     float     sum      -1     9.52    0.43    0.43      0    17.09    0.24    0.24      0
        8192          2048     float     sum      -1    10.19    0.80    0.80      0     9.57    0.86    0.86      0
       16384          4096     float     sum      -1    10.91    1.50    1.50      0    10.84    1.51    1.51      0
       32768          8192     float     sum      -1    14.85    2.21    2.21      0    14.77    2.22    2.22      0
       65536         16384     float     sum      -1    22.70    2.89    2.89      0    22.18    2.95    2.95      0
      131072         32768     float     sum      -1    41.96    3.12    3.12      0    42.03    3.12    3.12      0
      262144         65536     float     sum      -1    58.08    4.51    4.51      0    57.29    4.58    4.58      0
      524288        131072     float     sum      -1    90.93    5.77    5.77      0    90.12    5.82    5.82      0
     1048576        262144     float     sum      -1    158.5    6.61    6.61      0    157.5    6.66    6.66      0
     2097152        524288     float     sum      -1    306.7    6.84    6.84      0    293.8    7.14    7.14      0
     4194304       1048576     float     sum      -1    622.6    6.74    6.74      0    558.8    7.51    7.51      0
     8388608       2097152     float     sum      -1   1139.7    7.36    7.36      0   1102.9    7.61    7.61      0
    16777216       4194304     float     sum      -1   2276.6    7.37    7.37      0   2173.2    7.72    7.72      0
    33554432       8388608     float     sum      -1   4430.2    7.57    7.57      0   4321.7    7.76    7.76      0
    67108864      16777216     float     sum      -1   8737.3    7.68    7.68      0   8632.1    7.77    7.77      0
   134217728      33554432     float     sum      -1    17165    7.82    7.82      0    17101    7.85    7.85      0
# Out of bounds values : 0 OK
# Avg bus bandwidth    : 3.2276 

P2P Modded Driver No NVLink:

./all_reduce_perf -b 8 -e 128M -f 2 -g 2 part
# nThread 1 nGpus 2 minBytes 8 maxBytes 134217728 step: 2(factor) warmup iters: 5 iters: 20 agg iters: 1 validation: 1 graph: 0
#
# Using devices
#  Rank  0 Group  0 Pid   2444 on owen-train-pc device  0 [0x01] NVIDIA GeForce RTX 3090 Ti
#  Rank  1 Group  0 Pid   2444 on owen-train-pc device  1 [0x02] NVIDIA GeForce RTX 3090 Ti
#
#                                                              out-of-place                       in-place          
#       size         count      type   redop    root     time   algbw   busbw #wrong     time   algbw   busbw #wrong
#        (B)    (elements)                               (us)  (GB/s)  (GB/s)            (us)  (GB/s)  (GB/s)       
           8             2     float     sum      -1     9.43    0.00    0.00      0     9.35    0.00    0.00      0
          16             4     float     sum      -1    10.31    0.00    0.00      0     9.46    0.00    0.00      0
          32             8     float     sum      -1    10.28    0.00    0.00      0     9.23    0.00    0.00      0
          64            16     float     sum      -1    10.22    0.01    0.01      0     9.26    0.01    0.01      0
         128            32     float     sum      -1     9.48    0.01    0.01      0     9.28    0.01    0.01      0
         256            64     float     sum      -1     9.44    0.03    0.03      0    10.41    0.02    0.02      0
         512           128     float     sum      -1    10.24    0.05    0.05      0     9.27    0.06    0.06      0
        1024           256     float     sum      -1    10.47    0.10    0.10      0     9.46    0.11    0.11      0
        2048           512     float     sum      -1     9.37    0.22    0.22      0     9.24    0.22    0.22      0
        4096          1024     float     sum      -1     9.52    0.43    0.43      0     9.47    0.43    0.43      0
        8192          2048     float     sum      -1    16.91    0.48    0.48      0    10.18    0.80    0.80      0
       16384          4096     float     sum      -1    11.03    1.48    1.48      0    10.94    1.50    1.50      0
       32768          8192     float     sum      -1    14.79    2.21    2.21      0    14.77    2.22    2.22      0
       65536         16384     float     sum      -1    22.97    2.85    2.85      0    22.46    2.92    2.92      0
      131072         32768     float     sum      -1    42.12    3.11    3.11      0    41.93    3.13    3.13      0
      262144         65536     float     sum      -1    58.25    4.50    4.50      0    58.33    4.49    4.49      0
      524288        131072     float     sum      -1    93.68    5.60    5.60      0    92.54    5.67    5.67      0
     1048576        262144     float     sum      -1    160.7    6.52    6.52      0    160.7    6.52    6.52      0
     2097152        524288     float     sum      -1    293.2    7.15    7.15      0    345.4    6.07    6.07      0
     4194304       1048576     float     sum      -1    581.1    7.22    7.22      0    570.5    7.35    7.35      0
     8388608       2097152     float     sum      -1   1147.2    7.31    7.31      0   1120.8    7.48    7.48      0
    16777216       4194304     float     sum      -1   2312.3    7.26    7.26      0   2202.6    7.62    7.62      0
    33554432       8388608     float     sum      -1   4481.7    7.49    7.49      0   4366.8    7.68    7.68      0
    67108864      16777216     float     sum      -1   8814.9    7.61    7.61      0   8729.6    7.69    7.69      0
   134217728      33554432     float     sum      -1    17439    7.70    7.70      0    17367    7.73    7.73      0
# Out of bounds values : 0 OK
# Avg bus bandwidth    : 3.18197 

NVLink Enabled Official Nvidia-Driver-550:

/all_reduce_perf -b 8 -e 128M -f 2 -g 2 part
# nThread 1 nGpus 2 minBytes 8 maxBytes 134217728 step: 2(factor) warmup iters: 5 iters: 20 agg iters: 1 validation: 1 graph: 0
#
# Using devices
#  Rank  0 Group  0 Pid   7975 on owen-train-pc device  0 [0x01] NVIDIA GeForce RTX 3090 Ti
#  Rank  1 Group  0 Pid   7975 on owen-train-pc device  1 [0x02] NVIDIA GeForce RTX 3090 Ti
#
#                                                              out-of-place                       in-place          
#       size         count      type   redop    root     time   algbw   busbw #wrong     time   algbw   busbw #wrong
#        (B)    (elements)                               (us)  (GB/s)  (GB/s)            (us)  (GB/s)  (GB/s)       
           8             2     float     sum      -1    20.80    0.00    0.00      0    20.65    0.00    0.00      0
          16             4     float     sum      -1    20.59    0.00    0.00      0    19.27    0.00    0.00      0
          32             8     float     sum      -1    19.34    0.00    0.00      0    19.19    0.00    0.00      0
          64            16     float     sum      -1    19.82    0.00    0.00      0    17.99    0.00    0.00      0
         128            32     float     sum      -1    17.99    0.01    0.01      0    18.03    0.01    0.01      0
         256            64     float     sum      -1    18.00    0.01    0.01      0    17.97    0.01    0.01      0
         512           128     float     sum      -1    18.00    0.03    0.03      0    17.94    0.03    0.03      0
        1024           256     float     sum      -1    16.92    0.06    0.06      0    16.88    0.06    0.06      0
        2048           512     float     sum      -1    16.92    0.12    0.12      0    17.45    0.12    0.12      0
        4096          1024     float     sum      -1    17.57    0.23    0.23      0    16.72    0.24    0.24      0
        8192          2048     float     sum      -1    16.10    0.51    0.51      0    16.05    0.51    0.51      0
       16384          4096     float     sum      -1    17.02    0.96    0.96      0    15.42    1.06    1.06      0
       32768          8192     float     sum      -1    16.13    2.03    2.03      0    15.44    2.12    2.12      0
       65536         16384     float     sum      -1    15.40    4.26    4.26      0    15.29    4.29    4.29      0
      131072         32768     float     sum      -1    13.95    9.39    9.39      0    12.90   10.16   10.16      0
      262144         65536     float     sum      -1    17.90   14.65   14.65      0    17.79   14.73   14.73      0
      524288        131072     float     sum      -1    35.99   14.57   14.57      0    36.09   14.53   14.53      0
     1048576        262144     float     sum      -1    46.56   22.52   22.52      0    46.48   22.56   22.56      0
     2097152        524288     float     sum      -1    68.79   30.49   30.49      0    67.78   30.94   30.94      0
     4194304       1048576     float     sum      -1    125.2   33.51   33.51      0    114.4   36.66   36.66      0
     8388608       2097152     float     sum      -1    207.3   40.47   40.47      0    205.1   40.90   40.90      0
    16777216       4194304     float     sum      -1    407.4   41.18   41.18      0    399.0   42.05   42.05      0
    33554432       8388608     float     sum      -1    769.9   43.58   43.58      0    752.9   44.56   44.56      0
    67108864      16777216     float     sum      -1   1505.6   44.57   44.57      0   1502.3   44.67   44.67      0
   134217728      33554432     float     sum      -1   3072.1   43.69   43.69      0   2945.3   45.57   45.57      0
# Out of bounds values : 0 OK
# Avg bus bandwidth    : 14.0534 

As you can see here using the official Nvidia driver or the modded P2P driver made no difference and testing using P2P tests in cuda-samples says that P2P stays disabled, so maybe the driver only works for RTX 4090s which are what tinygrad are using in their machines.

On the other hand using NVLink significantly improved the bandwidth and I think most importantly the time required to complete the tests, which is probably because P2P communication between the GPUs through NVLink significantly improves the latency of communications between the GPUs.

So what does this mean for actual training performance? Quite a huge difference actually. I tested using Axolotl training Llama 3.1 8B Instruct through a small dataset using LORA and FSDP at 8192 context so that it requires more than 24GB worth of VRAM and shards the model across the two RTX 3090 Ti.

Axolotl config:

base_model: /home/user/models/Meta-Llama-3.1-8B-Instruct
model_type: LlamaForCausalLM
tokenizer_type: AutoTokenizer
  
train_on_inputs: false
group_by_length: false
load_in_8bit: false
load_in_4bit: false
strict: false
sequence_len: 4096
bf16: auto
fp16: 
tf32: false
flash_attention: true

shuffle_merged_datasets: false

# Data
datasets:
  - path: ./jakartaresearch_indoqa_sharegpt_test.jsonl
    type: sharegpt
    conversation: llama-3

warmup_steps: 10
dataset_prepared_path: ./lora_last_run_prepared

# Iterations
num_epochs: 1
saves_per_epoch: 1

# Evaluation
val_set_size: 0.0025
eval_max_new_tokens: 128
eval_sample_packing: false
evals_per_epoch: 0

# LoRA
output_dir: ./lora_out
adapter: lora
lora_r: 8
lora_alpha: 16
lora_dropout: 0.05
lora_target_linear: true
save_safetensors: true

# Sampling
sample_packing: false
pad_to_sequence_len: true

# Batching
gradient_accumulation_steps: 16
micro_batch_size: 1
gradient_checkpointing: true

# Optimizer
optimizer: adamw_torch
lr_scheduler: cosine
learning_rate: 0.0002

# Misc
auto_resume_from_checkpoints: true
logging_steps: 1
weight_decay: 0.1
special_tokens:
   pad_token: <|end_of_text|>
   
fsdp:
  - full_shard
  - auto_wrap
fsdp_config:
  fsdp_limit_all_gathers: true
  fsdp_sync_module_states: true
  fsdp_offload_params: false
  fsdp_use_orig_params: false
  fsdp_cpu_ram_efficient_loading: true
  fsdp_auto_wrap_policy: TRANSFORMER_BASED_WRAP
  fsdp_transformer_layer_cls_to_wrap: LlamaDecoderLayer
  fsdp_state_dict_type: FULL_STATE_DICT
  fsdp_sharding_strategy: FULL_SHARD

NVLink Disabled:

[2024-08-09 00:01:49,148] [INFO] [wandb.__setitem__:151] [PID:5370] config set model/num_parameters = 3500277760 - None
[2024-08-09 00:01:49,169] [INFO] [axolotl.callbacks.on_train_begin:785] [PID:5370] [RANK:0] The Axolotl config has been saved to the WandB run under files.
  0%|                                                                                              | 0/9 [00:00<?, ?it/s]You're using a PreTrainedTokenizerFast tokenizer. Please note that with a fast tokenizer, using the `__call__` method is faster than using a method to encode the text followed by a call to the `pad` method to get a padded encoding.
{'loss': 0.649, 'grad_norm': 3.750765323638916, 'learning_rate': 2e-05, 'epoch': 0.11}                                   
 11%|█████████▍                                                                           | 1/9 [01:49<14:37, 109.74s/it][2024-08-09 00:05:28,168] [INFO] [axolotl.callbacks.on_step_end:128] [PID:5370] [RANK:0] GPU memory usage while training: 7.612GB (+12.988GB cache, +0.877GB misc)
 22%|██████████████████▉                                                                  | 2/9 [03:38<12:46, 109.46s/it][2024-08-09 00:05:28,172] [INFO] [axolotl.callbacks.on_step_end:128] [PID:5371] [RANK:1] GPU memory usage while training: 7.612GB (+12.988GB cache, +0.761GB misc)
{'loss': 0.6425, 'grad_norm': 4.116180419921875, 'learning_rate': 4e-05, 'epoch': 0.21}                                  
{'loss': 0.6107, 'grad_norm': 3.7736430168151855, 'learning_rate': 6e-05, 'epoch': 0.32}                                 
{'loss': 0.3526, 'grad_norm': 3.506711006164551, 'learning_rate': 8e-05, 'epoch': 0.43}                                  
{'loss': 0.255, 'grad_norm': 2.3486344814300537, 'learning_rate': 0.0001, 'epoch': 0.53}                                 
{'loss': 0.2153, 'grad_norm': 1.1310781240463257, 'learning_rate': 0.00012, 'epoch': 0.64}                               
{'loss': 0.2319, 'grad_norm': 1.7600951194763184, 'learning_rate': 0.00014, 'epoch': 0.75}                               
{'loss': 0.2309, 'grad_norm': 1.3958746194839478, 'learning_rate': 0.00016, 'epoch': 0.85}                               
{'loss': 0.2094, 'grad_norm': 1.0824881792068481, 'learning_rate': 0.00018, 'epoch': 0.96}                               
100%|█████████████████████████████████████████████████████████████████████████████████████| 9/9 [16:23<00:00, 109.29s/it][2024-08-09 00:18:53,793] [INFO] [accelerate.utils.fsdp_utils.save_fsdp_model:71] [PID:5370] Saving model to ./lora_out/checkpoint-9/pytorch_model_fsdp.bin
[2024-08-09 00:18:53,891] [INFO] [accelerate.utils.fsdp_utils.save_fsdp_model:73] [PID:5370] Model saved to ./lora_out/checkpoint-9/pytorch_model_fsdp.bin
[2024-08-09 00:18:54,492] [INFO] [accelerate.utils.fsdp_utils.save_fsdp_optimizer:175] [PID:5370] Saving Optimizer state to ./lora_out/checkpoint-9/optimizer.bin
[2024-08-09 00:18:54,720] [INFO] [accelerate.utils.fsdp_utils.save_fsdp_optimizer:177] [PID:5370] Optimizer state saved in ./lora_out/checkpoint-9/optimizer.bin
{'eval_loss': 0.15709075331687927, 'eval_runtime': 2.423, 'eval_samples_per_second': 0.413, 'eval_steps_per_second': 0.413, 'epoch': 0.96}                                                                                                        
100%|█████████████████████████████████████████████████████████████████████████████████████| 9/9 [17:07<00:00, 109.29s/it[2024-08-09 00:19:37,114] [INFO] [accelerate.utils.fsdp_utils.save_fsdp_model:71] [PID:5370] Saving model to ./lora_out/checkpoint-9/pytorch_model_fsdp.bin
[2024-08-09 00:19:37,249] [INFO] [accelerate.utils.fsdp_utils.save_fsdp_model:73] [PID:5370] Model saved to ./lora_out/checkpoint-9/pytorch_model_fsdp.bin
[2024-08-09 00:19:37,854] [INFO] [accelerate.utils.fsdp_utils.save_fsdp_optimizer:175] [PID:5370] Saving Optimizer state to ./lora_out/checkpoint-9/optimizer.bin
[2024-08-09 00:19:38,156] [INFO] [accelerate.utils.fsdp_utils.save_fsdp_optimizer:177] [PID:5370] Optimizer state saved in ./lora_out/checkpoint-9/optimizer.bin
{'train_runtime': 1069.9897, 'train_samples_per_second': 0.279, 'train_steps_per_second': 0.008, 'train_loss': 0.37749431199497646, 'epoch': 0.96}
100%|█████████████████████████████████████████████████████████████████████████████████████| 9/9 [17:49<00:00, 118.78s/it]
[2024-08-09 00:19:38,176] [INFO] [axolotl.train.train:190] [PID:5370] [RANK:0] Training Completed!!! Saving pre-trained model to ./lora_out
[2024-08-09 00:19:38,185] [INFO] [axolotl.train.train:199] [PID:5370] [RANK:0] Set FSDP state dict type to FULL_STATE_DICT for saving.

NVLink Enabled:

[2024-08-09 01:23:35,937] [INFO] [wandb.__setitem__:151] [PID:2578] config set model/num_parameters = 3500277760 - None
[2024-08-09 01:23:35,979] [INFO] [axolotl.callbacks.on_train_begin:785] [PID:2578] [RANK:0] The Axolotl config has been saved to the WandB run under files.
  0%|                                                                                              | 0/9 [00:00<?, ?it/s]You're using a PreTrainedTokenizerFast tokenizer. Please note that with a fast tokenizer, using the `__call__` method is faster than using a method to encode the text followed by a call to the `pad` method to get a padded encoding.
{'loss': 0.649, 'grad_norm': 3.9961297512054443, 'learning_rate': 2e-05, 'epoch': 0.11}                                  
 11%|█████████▌                                                                            | 1/9 [01:04<08:36, 64.60s/it][2024-08-09 01:25:44,944] [INFO] [axolotl.callbacks.on_step_end:128] [PID:2578] [RANK:0] GPU memory usage while training: 7.612GB (+12.988GB cache, +1.037GB misc)
 22%|███████████████████                                                                   | 2/9 [02:08<07:31, 64.46s/it][2024-08-09 01:25:44,946] [INFO] [axolotl.callbacks.on_step_end:128] [PID:2579] [RANK:1] GPU memory usage while training: 7.612GB (+12.988GB cache, +0.836GB misc)
{'loss': 0.6425, 'grad_norm': 4.386759281158447, 'learning_rate': 4e-05, 'epoch': 0.21}                                  
{'loss': 0.6108, 'grad_norm': 3.9862568378448486, 'learning_rate': 6e-05, 'epoch': 0.32}                                 
{'loss': 0.3464, 'grad_norm': 3.628135919570923, 'learning_rate': 8e-05, 'epoch': 0.43}                                  
{'loss': 0.2468, 'grad_norm': 2.3137495517730713, 'learning_rate': 0.0001, 'epoch': 0.53}                                
{'loss': 0.2128, 'grad_norm': 1.144849181175232, 'learning_rate': 0.00012, 'epoch': 0.64}                                
{'loss': 0.2318, 'grad_norm': 1.719062328338623, 'learning_rate': 0.00014, 'epoch': 0.75}                                
{'loss': 0.2271, 'grad_norm': 1.3542813062667847, 'learning_rate': 0.00016, 'epoch': 0.85}                               
{'loss': 0.2019, 'grad_norm': 1.0137834548950195, 'learning_rate': 0.00018, 'epoch': 0.96}                               
100%|██████████████████████████████████████████████████████████████████████████████████████| 9/9 [09:41<00:00, 64.67s/it][2024-08-09 01:33:56,499] [INFO] [accelerate.utils.fsdp_utils.save_fsdp_model:71] [PID:2578] Saving model to ./lora_out/checkpoint-9/pytorch_model_fsdp.bin
[2024-08-09 01:33:56,596] [INFO] [accelerate.utils.fsdp_utils.save_fsdp_model:73] [PID:2578] Model saved to ./lora_out/checkpoint-9/pytorch_model_fsdp.bin
[2024-08-09 01:33:57,202] [INFO] [accelerate.utils.fsdp_utils.save_fsdp_optimizer:175] [PID:2578] Saving Optimizer state to ./lora_out/checkpoint-9/optimizer.bin
[2024-08-09 01:33:57,429] [INFO] [accelerate.utils.fsdp_utils.save_fsdp_optimizer:177] [PID:2578] Optimizer state saved in ./lora_out/checkpoint-9/optimizer.bin
{'eval_loss': 0.16556888818740845, 'eval_runtime': 1.7681, 'eval_samples_per_second': 0.566, 'eval_steps_per_second': 0.566, 'epoch': 0.96}                                                                                                       
100%|██████████████████████████████████████████████████████████████████████████████████████| 9/9 [10:23<00:00, 64.67s/it[2024-08-09 01:34:37,507] [INFO] [accelerate.utils.fsdp_utils.save_fsdp_model:71] [PID:2578] Saving model to ./lora_out/checkpoint-9/pytorch_model_fsdp.bin
[2024-08-09 01:34:37,641] [INFO] [accelerate.utils.fsdp_utils.save_fsdp_model:73] [PID:2578] Model saved to ./lora_out/checkpoint-9/pytorch_model_fsdp.bin
[2024-08-09 01:34:38,250] [INFO] [accelerate.utils.fsdp_utils.save_fsdp_optimizer:175] [PID:2578] Saving Optimizer state to ./lora_out/checkpoint-9/optimizer.bin
[2024-08-09 01:34:38,551] [INFO] [accelerate.utils.fsdp_utils.save_fsdp_optimizer:177] [PID:2578] Optimizer state saved in ./lora_out/checkpoint-9/optimizer.bin
{'train_runtime': 663.2972, 'train_samples_per_second': 0.451, 'train_steps_per_second': 0.014, 'train_loss': 0.37435382604599, 'epoch': 0.96}
100%|██████████████████████████████████████████████████████████████████████████████████████| 9/9 [11:02<00:00, 73.62s/it]
[2024-08-09 01:34:38,571] [INFO] [axolotl.train.train:190] [PID:2578] [RANK:0] Training Completed!!! Saving pre-trained model to ./lora_out
[2024-08-09 01:34:38,580] [INFO] [axolotl.train.train:199] [PID:2578] [RANK:0] Set FSDP state dict type to FULL_STATE_DICT for saving.

The result is about a 40% time savings (16:23 vs 9:41) with NVLink enabled vs without NVLink. That is an insanely large time saving for such a short training time. I mean a 10-day training time would become a 6-day training time when you enable NVLink.

So my conclusion is that for anyone looking to build a 48GB VRAM dual RTX 3090(Ti) build for playing around with LLMs, definitely try and get a motherboard with a 4-slot spacing so that you can run an NVLink bridge. The performance gains when training using FSDP is massive.

Which also makes it unfortunate that the new RTX 4090 does not have official P2P support in addition to not having an NVLink connector. With the 4090 being much faster than the RTX 3090 I can’t imagine it is doing well without a fast connection between two GPUs. On my RTX 3090 Ti when using NVLink the GPU power consumption during training hovers around 430W while not using NVLink it drops to 300W or so which indicates the GPU is waiting for data and not being fully utilized. I haven’t personally tested P2P on the RTX 4090 since I only have a single RTX 4090, so if anyone has a dual RTX 4090 setup let me know your findings if P2P using the modded driver actually works.

To get 48GB of VRAM for training you can of course also buy Nvidia RTX A6000 or RTX 6000 Ada (who tf comes up with these names) which has 48GB all in one GPU. But then you’re probably also training slower than dual RTX 3090(Ti) GPUs since using FSDP performance scales almost linearly with GPUs and even the AD102 GPU in the RTX 4090 and RTX 6000 Ada aren’t really 2x the performance of the GA102 in the RTX 3090.

Not to mention the insane costs of the workstation GPUs, where you can get 4x RTX 3090s for a single RTX A6000 lol. In which case even with a 40% performance hit without NVLink across 4 GPUs you’re probably still much faster and have 96GB VRAM to boot. I also haven’t tested the performance benefits of using NVLink paired across two GPUs in a 4x 3090 setup, but will do that testing soon on my 4x3090 machine.

So really my conclusion is that Dual RTX 3090 or RTX 3090 Ti with NVLink is the ultimate at-home AI/Machine Learning/LLM development GPU. Hopefully you guys don’t raise the price of RTX 3090s because I’m gonna buy some more brb.

TLDR: NVLink improves FSDP training by 40% and modded P2P driver does not work for RTX 3090. So try and use NVLink if you can.

讨论总结

帖子作者通过实际测试展示了NVLink在RTX 3090 GPU上的显著性能提升,特别是在使用FSDP(Fully Sharded Data Parallel)训练方法时。测试结果显示,启用NVLink后,训练时间减少了约40%,显著提高了训练效率。此外,作者还探讨了P2P(Peer-to-Peer)通信在不同驱动版本下的表现,并得出结论认为NVLink对提升训练性能至关重要。帖子还涉及了硬件配置、驱动程序选择以及未来GPU配置的建议,为AI/机器学习领域的开发者和研究人员提供了宝贵的参考。

主要观点

  1. 👍 NVLink显著提升了RTX 3090 GPU的训练性能

    • 支持理由:通过实际测试,启用NVLink后训练时间减少了约40%,显著提高了训练效率。
    • 反对声音:无明显反对声音,多数评论支持NVLink的性能提升效果。
  2. 🔥 使用FSDP等训练方法时,NVLink的性能提升尤为明显

    • 正方观点:FSDP训练方法在高性能计算中广泛应用,NVLink的启用进一步优化了数据并行处理。
    • 反方观点:无明显反方观点,多数评论认可FSDP与NVLink结合的性能优势。
  3. 💡 P2P通信在某些驱动下并未带来性能提升

    • 解释:尽管尝试了修改版的P2P驱动,但并未观察到明显的性能提升,NVLink仍然是关键因素。
  4. 🚀 对于LLM(Large Language Model)开发,双RTX 3090或RTX 3090 Ti配合NVLink是最佳选择

    • 解释:结合NVLink的双GPU配置在处理大规模语言模型时表现出更高的效率和性能。
  5. 🌟 RTX 4090缺乏官方P2P支持和NVLink连接,可能影响其双GPU配置下的性能

    • 解释:由于缺乏关键的P2P和NVLink支持,RTX 4090在多GPU配置下的性能可能受限。

金句与有趣评论

  1. “😂 NVLink improves FSDP training by 40% and modded P2P driver does not work for RTX 3090. So try and use NVLink if you can.”

    • 亮点:简洁明了地总结了NVLink的优势和P2P驱动的无效性。
  2. “🤔 I mean the 5t/s difference you reported is a huge 50% boost even on inference. I wouldn’t say don’t bother.”

    • 亮点:强调了NVLink在推理性能上的显著提升,鼓励用户投入使用。
  3. “👀 Great info, and that you tested for real in Axolotl. Would love to learn the speedup with 2 NVLinks in your 4x3090 when you test that.”

    • 亮点:对实际测试的认可,并期待更多关于多NVLink配置的测试结果。

情感分析

讨论总体上呈现出积极和探索的情感倾向。多数评论者对NVLink的性能提升表示认可,并对作者的详细测试和分析表示赞赏。争议点主要集中在P2P驱动的有效性和RTX 4090的配置限制上,但这些并未影响整体讨论的正面氛围。

趋势与预测

  • 新兴话题:多GPU配置中使用多个NVLink的性能提升。
  • 潜在影响:NVLink技术的进一步优化和普及可能会推动AI/机器学习领域的发展,特别是在处理大规模数据和模型时。

详细内容:

标题:NVLink 对 RTX 3090 训练性能的显著提升

在 Reddit 上,一篇关于“NVLink boosts training performance by A LOT”的帖子引发了热烈讨论。该帖子的作者通过详细的测试和实验,深入探讨了 NVLink 在 RTX 3090 GPU 训练中的作用。此帖获得了众多关注,评论数众多,主要讨论了 NVLink 对训练性能的影响,以及与其他相关技术的比较。

讨论焦点主要集中在 NVLink 启用与否所带来的性能差异。作者指出,使用官方 Nvidia 驱动或修改的 P2P 驱动,在测试中未显示出明显差异,且 P2P 在其测试设置中处于禁用状态。然而,启用 NVLink 显著提高了带宽和训练完成测试所需的时间,尤其在使用 Axolotl 训练 Llama 3.1 8B Instruct 时,NVLink 使训练时间节省约 40%。

有用户如[u/Inkbot_dev]表示非常欣赏作者分享的这些信息,希望能有更多人分享实际测试结果。[u/nero10578]称还会进行更多测试,包括推理方面和 4x3090 的性能测试。[u/LostGoatOnHill]期待关于推理测试和 4x3090 的信息。[u/ResidentPositive4122]则简洁地总结“nvlink 好”。

关于是否能将 3090 与 3090Ti 进行 NVLink 连接,作者[u/nero10578]表示这在物理和软件上均不可行。对于使用 P2P ,[u/DeltaSqueezer]认为大多数消费级主板不支持,建议使用 PCIe 交换机。

总之,通过这些讨论可以看出,NVLink 在提升 RTX 3090 训练性能方面效果显著,但在实际应用中仍存在一些技术和硬件上的限制和挑战。