Discrepancy between kv_proj in .safetensors and .pt?
Hi,
I seem to get different weights for kv_proj when opening the .pt file vs .safetensors file.
f = safe_open("model-00001-of-00019.safetensors", framework="pt")
weights = torch.load("consolidated.00.pt")
#### row 0 --- same
print(f.get_tensor("model.layers.0.self_attn.k_proj.weight")][0])
>>> tensor([ 0.0003, 0.0061, -0.0005, ..., -0.0029, -0.0003, -0.0003],
dtype=torch.bfloat16)
print(weights["layers.0.attention.wk.weight"][0])
>>> tensor([ 0.0003, 0.0061, -0.0005, ..., -0.0029, -0.0003, -0.0003],
dtype=torch.bfloat16)
#### row 1 --- different!
print(f.get_tensor("model.layers.0.self_attn.k_proj.weight")[1])
>>> tensor([-0.0001, -0.0060, 0.0006, ..., -0.0012, -0.0002, 0.0001],
dtype=torch.bfloat16)
print(weights["layers.0.attention.wk.weight"][1])
>>> tensor([-0.0004, 0.0073, 0.0002, ..., 0.0437, 0.0005, -0.0003],
dtype=torch.bfloat16)
From the second row on it seems the k_projections are different between the files. It only affects k_projs.
I checked file checksums and they seem to be ok.
Update - the formats seem off for other weights as well. What's going on?
bb = sweights["model.layers.0.block_sparse_moe.experts.0.w2.weight"]
aa = tweights["layers.0.block_sparse_moe.w2"].reshape((8, -1, 14336))[0]
aa-bb.sum()
>>> 0 (the same)
bb = sweights["model.layers.0.block_sparse_moe.experts.1.w2.weight"]
aa = tweights["layers.0.block_sparse_moe.w2"].reshape((8, -1, 14336))[1]
>>> -0.1250 (difference!)
For expert 0 it's the same.
Was the model converted into safetensors properly? Or is it a different version?
My implementation breaks when using the new weights, but it works fine when using .pt. Not sure if I'm missing sth or what :/
Also, where can I find any info on changes like this? E.g. in Mistral it used to be w1/w2/w3 for ffn weights - that's how it's implemented in reference, now it's gate/down/up. I had to find it in HF implementation to make sure I'm getting the new names right.
(aside from all that, thanks for releasing the model!)