Update README.md
#3
by
astachowicz
- opened
Lazy mode is now deprecated for Roberta Large, recommended way is to use torch.compile
@astachowicz When running
PT_HPU_LAZY_MODE=0 python run_qa.py \
--model_name_or_path roberta-large \
--gaudi_config_name Habana/roberta-large \
--dataset_name squad \
--do_train \
--do_eval \
--per_device_train_batch_size 12 \
--per_device_eval_batch_size 8 \
--learning_rate 3e-5 \
--num_train_epochs 2 \
--max_seq_length 384 \
--output_dir /tmp/squad/ \
--use_habana \
--torch_compile_backend hpu_backend \
--torch_compile \
--use_lazy_mode false \
--throughput_warmup_steps 3 \
--bf16
I get the following error during training:
Traceback (most recent call last):
File "/root/workspace/optimum-habana/examples/question-answering/run_qa.py", line 732, in <module>
main()
File "/root/workspace/optimum-habana/examples/question-answering/run_qa.py", line 678, in main
train_result = trainer.train(resume_from_checkpoint=checkpoint)
File "/usr/local/lib/python3.10/dist-packages/optimum/habana/transformers/trainer.py", line 545, in train
return inner_training_loop(
File "/usr/local/lib/python3.10/dist-packages/optimum/habana/transformers/trainer.py", line 910, in _inner_training_loop
for step, inputs in enumerate(epoch_iterator):
File "/usr/local/lib/python3.10/dist-packages/accelerate/data_loader.py", line 461, in __iter__
current_batch = send_to_device(current_batch, self.device)
File "/usr/local/lib/python3.10/dist-packages/accelerate/utils/operations.py", line 167, in send_to_device
{
File "/usr/local/lib/python3.10/dist-packages/accelerate/utils/operations.py", line 168, in <dictcomp>
k: t if k in skip_keys else send_to_device(t, device, non_blocking=non_blocking, skip_keys=skip_keys)
File "/usr/local/lib/python3.10/dist-packages/accelerate/utils/operations.py", line 186, in send_to_device
return tensor.to(device, non_blocking=non_blocking)
RuntimeError: [Rank:0] FATAL ERROR :: MODULE:PT_BRIDGE Exception in Lowering thread...
synNodeCreateWithId failed for node: identity with synStatus 1 [Invalid argument]. .
[Rank:0] Habana exception raised from add_node at graph.cpp:481
This was on Gaudi2 with SynapseAI v1.16.0
I've got the same error if I leave /tmpt/squad with different model checkpoint. If I remove /tmp/squad directory the error is gone.
LGTM!
regisss
changed pull request status to
merged