NeMo
English
nvidia
steerlm
llama3
reward model

CheckpointingException | nvidia/Llama3-70B-SteerLM-RM NOT a distributed checkpoint of Megatron

#4
by DeeLearning - opened

Hi Nvidia SteerLM team,
Recently I'm trying to annotate Daring Anteater with this model.
I started and mounted the nvidia/Llama3-70B-SteerLM-RM in the image nvcr.io/nvidia/nemo:24.01.framework, then followed the instruction on this page. However, when starting the serve_reward_model.py, I encounter this error:

File "/opt/megatron-lm/megatron/core/dist_checkpointing/serialization.py", line 125, in _verify_checkpoint_and_load_strategy
raise CheckpointingException(f'{checkpoint_dir} is not a distributed checkpoint')
megatron.core.dist_checkpointing.core.CheckpointingException: Llama3-70B-SteerLM-RM/model_weights is not a distributed checkpoint

It seems like the reward model I downloaded here isn't compatible with what is required by serve_reward_model.py.
Did I miss any step or did anything wrong? Please enlighten me.

I think the issue is that you're pointing to the directory "model_weights" while you should point to the precedent one, where the config file is saved. I didn't serve exactly this one (I used the 3.1), it was a painful nightmare to make it work, but I did, so if you can't, please let me know.

Sign up or log in to comment