Train GPT2 with 2nodes, 2gpu, tp=2 IndexError at megatron/core/tensor_parallel/cross_entropy.py #457

yuanpeng-zhu · 2025-01-13T03:55:48Z

hardware:
2 nodes
1 4090 per node

The master node config

#!/bin/bash

# Runs the "345M" parameter model

export CUDA_DEVICE_MAX_CONNECTIONS=1

GPUS_PER_NODE=1
# Change for multinode config
MASTER_ADDR=10.0.0.7
MASTER_PORT=6000
NNODES=2
NODE_RANK=0
WORLD_SIZE=$(($GPUS_PER_NODE*$NNODES))

CHECKPOINT_PATH=checkpoints/gpt2
VOCAB_FILE=../../data/gpt2-vocab.json 
MERGE_FILE=../../data/gpt2-merges.txt
DATA_PATH=../../data/meg-gpt2_text_document

DISTRIBUTED_ARGS="
    --nproc_per_node $GPUS_PER_NODE \
    --nnodes $NNODES \
    --node_rank $NODE_RANK \
    --master_addr $MASTER_ADDR \
    --master_port $MASTER_PORT
"

GPT_ARGS="
    --tensor-model-parallel-size 2 \
    --pipeline-model-parallel-size 1 \
    --sequence-parallel \
    --num-layers 24 \
    --hidden-size 1024 \
    --num-attention-heads 16 \
    --seq-length 1024 \
    --max-position-embeddings 1024 \
    --micro-batch-size 4 \
    --global-batch-size 16 \
    --lr 0.00015 \
    --train-iters 500000 \
    --lr-decay-iters 320000 \
    --lr-decay-style cosine \
    --min-lr 1.0e-5 \
    --weight-decay 1e-2 \
    --lr-warmup-fraction .01 \
    --clip-grad 1.0 \
    --fp16
"

DATA_ARGS="
    --data-path $DATA_PATH \
    --vocab-file $VOCAB_FILE \
    --merge-file $MERGE_FILE \
    --data-impl mmap \
    --split 949,50,1
"

OUTPUT_ARGS="
    --log-interval 100 \
    --save-interval 10000 \
    --eval-interval 1000 \
    --eval-iters 10
"

torchrun $DISTRIBUTED_ARGS pretrain_gpt.py \
    $GPT_ARGS \
    $DATA_ARGS \
    $OUTPUT_ARGS \
    --distributed-backend nccl \
    --save $CHECKPOINT_PATH \
    --load $CHECKPOINT_PATH

client is similar with only NODE_RANK different

start training

then error occured

 > loading shuffle-idx mapping from ../../data/index-cache/91eec3f91588e8e2e456859caa3dcd56_shuffle_idx.npy
    loaded indexed file in 0.000 seconds
    total number of samples: 369
    total number of epochs: 1
> finished creating GPT datasets ...
[after dataloaders are built] datetime: 2025-01-13 03:27:29 
done with setup ...
training ...
[before the start of training step] datetime: 2025-01-13 03:27:29 
Traceback (most recent call last):
  File "/workspace/Megatron-DeepSpeed_new/Megatron-DeepSpeed/pretrain_gpt.py", line 360, in <module>
    pretrain(train_valid_test_datasets_provider,
  File "/workspace/Megatron-DeepSpeed_new/Megatron-DeepSpeed/megatron/training.py", line 235, in pretrain
    iteration = train(forward_step_func,
  File "/workspace/Megatron-DeepSpeed_new/Megatron-DeepSpeed/megatron/training.py", line 1269, in train
    train_step(forward_step_func,
  File "/workspace/Megatron-DeepSpeed_new/Megatron-DeepSpeed/megatron/training.py", line 709, in train_step
    losses_reduced = forward_backward_func(
  File "/workspace/Megatron-DeepSpeed_new/Megatron-DeepSpeed/megatron/core/pipeline_parallel/schedules.py", line 349, in forward_backward_no_pipelining
    output_tensor = forward_step(forward_step_func, data_iterator, model, num_microbatches,
  File "/workspace/Megatron-DeepSpeed_new/Megatron-DeepSpeed/megatron/core/pipeline_parallel/schedules.py", line 199, in forward_step
    output_tensor, loss_func = forward_step_func(data_iterator, model)
  File "/workspace/Megatron-DeepSpeed_new/Megatron-DeepSpeed/pretrain_gpt.py", line 286, in forward_step
    output_tensor, other_losses = model(tokens, position_ids, attention_mask,
  File "/usr/local/lib/python3.10/dist-packages/torch/nn/modules/module.py", line 1518, in _wrapped_call_impl
    return self._call_impl(*args, **kwargs)
  File "/usr/local/lib/python3.10/dist-packages/torch/nn/modules/module.py", line 1527, in _call_impl
    return forward_call(*args, **kwargs)
  File "/workspace/Megatron-DeepSpeed_new/Megatron-DeepSpeed/megatron/model/distributed.py", line 58, in forward
    return self.module(*inputs, **kwargs)
  File "/usr/local/lib/python3.10/dist-packages/torch/nn/modules/module.py", line 1518, in _wrapped_call_impl
    return self._call_impl(*args, **kwargs)
  File "/usr/local/lib/python3.10/dist-packages/torch/nn/modules/module.py", line 1527, in _call_impl
    return forward_call(*args, **kwargs)
  File "/workspace/Megatron-DeepSpeed_new/Megatron-DeepSpeed/megatron/model/module.py", line 191, in forward
    outputs = self.module(*inputs, **kwargs)
  File "/usr/local/lib/python3.10/dist-packages/torch/nn/modules/module.py", line 1518, in _wrapped_call_impl
    return self._call_impl(*args, **kwargs)
  File "/usr/local/lib/python3.10/dist-packages/torch/nn/modules/module.py", line 1527, in _call_impl
    return forward_call(*args, **kwargs)
  File "/workspace/Megatron-DeepSpeed_new/Megatron-DeepSpeed/megatron/model/gpt_model.py", line 258, in forward
    lm_output = post_language_model_processing(
  File "/workspace/Megatron-DeepSpeed_new/Megatron-DeepSpeed/megatron/model/gpt_model.py", line 68, in post_language_model_processing
    loss = cross_entropy(output.float(), labels)
  File "/workspace/Megatron-DeepSpeed_new/Megatron-DeepSpeed/megatron/core/tensor_parallel/cross_entropy.py", line 142, in vocab_parallel_cross_entropy
    return _VocabParallelCrossEntropy.apply(vocab_parallel_logits, target, label_smoothing)
  File "/usr/local/lib/python3.10/dist-packages/torch/autograd/function.py", line 539, in apply
    return super().apply(*args, **kwargs)  # type: ignore[misc]
  File "/workspace/Megatron-DeepSpeed_new/Megatron-DeepSpeed/megatron/core/tensor_parallel/cross_entropy.py", line 47, in forward
    predicted_logits_1d = logits_2d[arange_1d, masked_target_1d]
IndexError: shape mismatch: indexing tensors could not be broadcast together with shapes [2048], [4096]
[2025-01-13 03:27:34,178] torch.distributed.elastic.multiprocessing.api: [ERROR] failed (exitcode: 1) local_rank: 0 (pid: 20947) of binary: /usr/bin/python
Traceback (most recent call last):
  File "/usr/local/bin/torchrun", line 33, in <module>
    sys.exit(load_entry_point('torch==2.1.0a0+32f93b1', 'console_scripts', 'torchrun')())
  File "/usr/local/lib/python3.10/dist-packages/torch/distributed/elastic/multiprocessing/errors/__init__.py", line 346, in wrapper
    return f(*args, **kwargs)
  File "/usr/local/lib/python3.10/dist-packages/torch/distributed/run.py", line 806, in main
    run(args)
  File "/usr/local/lib/python3.10/dist-packages/torch/distributed/run.py", line 797, in run
    elastic_launch(
  File "/usr/local/lib/python3.10/dist-packages/torch/distributed/launcher/api.py", line 134, in __call__
    return launch_agent(self._config, self._entrypoint, list(args))
  File "/usr/local/lib/python3.10/dist-packages/torch/distributed/launcher/api.py", line 264, in launch_agent
    raise ChildFailedError(
torch.distributed.elastic.multiprocessing.errors.ChildFailedError:

output of some variables

print(f"target shape: {target.shape}, vocab_parallel_logits shape: {vocab_parallel_logits.shape}")
print(f"logits_2d shape: {logits_2d.shape}")
print(f"arange_1d shape: {arange_1d.shape}")
print(f"masked_target_1d shape: {masked_target_1d.shape}")

output:
target shape: torch.Size([1024, 4]), vocab_parallel_logits shape: torch.Size([512, 4, 25216])
logits_2d shape: torch.Size([2048, 25216])
arange_1d shape: torch.Size([2048])
masked_target_1d shape: torch.Size([4096])

The text was updated successfully, but these errors were encountered:

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Train GPT2 with 2nodes, 2gpu, tp=2 IndexError at megatron/core/tensor_parallel/cross_entropy.py #457

Train GPT2 with 2nodes, 2gpu, tp=2 IndexError at megatron/core/tensor_parallel/cross_entropy.py #457

yuanpeng-zhu commented Jan 13, 2025 •

edited

Loading

Train GPT2 with 2nodes, 2gpu, tp=2 IndexError at megatron/core/tensor_parallel/cross_entropy.py #457

Train GPT2 with 2nodes, 2gpu, tp=2 IndexError at megatron/core/tensor_parallel/cross_entropy.py #457

Comments

yuanpeng-zhu commented Jan 13, 2025 • edited Loading

The master node config

start training

output of some variables

yuanpeng-zhu commented Jan 13, 2025 •

edited

Loading