Save_pretained showing larger files than the one in the repo
#Hi, I ran the below steps.
tokenizer = AutoTokenizer.from_pretrained("databricks/dolly-v2-3b", padding_side="left")
base_model = AutoModelForCausalLM.from_pretrained("databricks/dolly-v2-3b", device_map="auto")
#I saved the model after this.
base_model.save_pretrained("/home/ec2-user/SageMaker/models/dolly-v2-3b", from_pt=True)
When I see the saved files, it is different from the one you see in the repo.
For instance, I see one 5.68GB bin file in the repo but the saved model file downloaded 2 bin files. One file is 10.1GB and other is 1.15GB. This does not match the files in this repo.
Any idea why this is happening? What are the implications of this large model size?
Here is what I get after saving the pretrained model.
It's because you did not load in 16-bit, I'd imagine. You're saving weights in 2x the precision and storage space.
I see, so when I tried running the results from the saved model, the latency was 3-4 times higher than the one from_pretrained. Shouldn't the latency be the same in both the cases?
No, because you are doing more than twice the work in 32-bit math. I don't see why you are doing it this way?
Based on what you say, I am loading it originally from HuggingFace in 32 bit as well. Is that right? But the latency is really low on that one. How is that happening?
Ah ok I mistook the setup, you're benchmarking loading this way too without saving. Yeah should be the same thing. Check the torch_dtype in both cases to confirm. Otherwise not sure why or maybe I'm wrong about the precision being the issue.
Are you sure you are unloading the first model before loading the second ? Otherwise you might load the second only partly on the GPU