missing nlpaueb/legal-bert-base/resolve/main/tokenizer_config.json ?
doing what I generally to do load a model:
tokenizer = AutoTokenizer.from_pretrained("nlpaueb/legal-bert-base")
model = AutoModelForSeq2SeqLM.from_pretrained("nlpaueb/legal-bert-base")
generates this error:
Traceback (most recent call last):
File "/Users/rik/data/pkg/miniconda3/envs/ai4law/lib/python3.11/site-packages/huggingface_hub/utils/_errors.py", line 270, in hf_raise_for_status
response.raise_for_status()
File "/Users/rik/data/pkg/miniconda3/envs/ai4law/lib/python3.11/site-packages/requests/models.py", line 1021, in raise_for_status
raise HTTPError(http_error_msg, response=self)
requests.exceptions.HTTPError: 404 Client Error: Not Found for url: https://huggingface.co/nlpaueb/legal-bert-base/resolve/main/tokenizer_config.json
But the file seems to be there?
https://huggingface.co/nlpaueb/legal-bert-base-uncased/blob/main/tokenizer_config.json
Hi
@rkbelew
, it seems you were trying to load nlpaueb/legal-bert-base
instead of nlpaueb/legal-bert-base-uncased
?
duh! thanks very much @fendiprime for noticing my bug.
and so i'm now able to load the model, but bumping into:
input_ids = tokenizer(input_text, return_tensors="pt").input_ids.to("mps")
outputs = model.generate(input_ids)
TypeError: The current model class (BertModel) is not compatible with .generate()
, as it doesn't have a language model head. Please use one of the following classes instead: {'BertLMHeadModel'}
this despite the fact that dir(model)
includes generate
as one of its attributes? am i just being thick again?
Happy to help
@rkbelew
, I think you're correct that the generate
method can't be used despite being part of the Bert
class.
That being said, it is possible to use the model for masking tasks. I tried that successfully:
from transformers import BertForMaskedLM, AutoTokenizer
from torch import no_grad
model_name = "nlpaueb/legal-bert-base-uncased"
model = BertForMaskedLM.from_pretrained(model_name)
tokenizer = AutoTokenizer.from_pretrained(model_name)
inputs = tokenizer("This [MASK] Agreement is between General Motors and John Murray.", return_tensors="pt")
with no_grad():
logits = model(**inputs).logits
mask_token_index = (inputs.input_ids == tokenizer.mask_token_id)[0].nonzero(as_tuple=True)[0]
predicted_token_id = logits[0, mask_token_index].argmax(axis=-1)
tokenizer.decode(predicted_token_id)
I also managed to use the BertLMHeadModel
class mentioned in the error for the masking task but making that work was even more hacky than the approach above. I'll be happy to share if you're interested though.
Cheers
i can't even appreciate how your example is hacky, so I'm sure I'd be interested in your other experiment with BertLMHeadModel. still finding my way in this LLM ecosystem. thanks for your help.
Alright then, here's how you could go about generating outputs from the model:
from transformers import AutoTokenizer, BertLMHeadModel
def decode_predictions(tokenizer, hidden_states, num_labels=1):
"""Decode top-k most likely tokens from the given hidden states"""
top_k = topk(hidden_states.logits, k=num_labels)
top_k_indices = top_k.indices
decoded_tokens = []
for idx in top_k_indices:
decoded_tokens.append(tokenizer.convert_ids_to_tokens(idx))
return ' '.join(decoded_tokens[0])
model_name = "nlpaueb/legal-bert-base-uncased"
tokenizer = AutoTokenizer.from_pretrained(model_name)
model = BertLMHeadModel.from_pretrained(model_name, is_decoder=False)
input_text = "This [MASK] Agreement is between General Motors and John Murray ."
inputs = tokenizer(input_text, return_tensors="pt")
outputs = model(**inputs, labels=inputs["input_ids"])
predicted_tokens = decode_predictions(tokenizer, outputs)
print(predicted_tokens)
I hope this helps in your explorations @rkbelew
I am getting following error., even though tokenizer_config.json is present, the tokenizer.json is not present.
Error: Could not download model artifacts
Caused by:
0: request error: HTTP status client error (404 Not Found) for url (https://huggingface.co/nlpaueb/legal-bert-base-uncased/resolve/main/tokenizer.json)