Max output tokens for Llama 3.1

#6
by abhirup-sainapse - opened

I do not see any literature relating to the number of maximum output tokens supported by these models. Does anyone have any additional information?

Probably 8192. This is the base value from which they scaled the context.

In config.json:
"max_position_embeddings": 131072
So it's 128K.

In config.json:
"max_position_embeddings": 131072
So it's 128K.

that's the context window, not necessarily the max output tokens its trained to produce as a response.

@abhirup-sainapse it's 4096. Couldn't find it anywhere so did binary search to figure out where it raises an exception. I don't know why it's so hard to find the max outputs and/or parameters for most models.

In config.json:
"max_position_embeddings": 131072
So it's 128K.

that's the context window, not necessarily the max output tokens its trained to produce as a response.

@chrischain , my understanding of how LLMs work is to iteratively predict each next token. This would mean that the model is not trained to produce multiple tokens, rather, each pass through the model is generating one token in the context length, given all previous tokens. Wouldn't this mean that max output tokens is always [total context] - [input tokens]?

Or are you saying that the post-training dataset only includes examples where the assistant responds with X number of tokens, therefore the model will be more likely to output the eos_token before/at this point?

@icahill, with an instruction-tuned model (like this one) the training-data is typically structured in a multi-turn conversation format. In this instance, the max output tokens would be the maximum size of the response it had seen during training. Since context was scaled from 8192 (via RoPe), we can safely assume the maximum size prompt it saw was 4096 tokens and the maximum size output it saw was 4096 tokens.

Thanks @chrischain , that makes sense.

@abhirup-sainapse it's 4096. Couldn't find it anywhere so did binary search to figure out where it raises an exception. I don't know why it's so hard to find the max outputs and/or parameters for most models.

Is this true for all the llama 3.1 model flavors or is it only true for the 405B instruct?

Sign up or log in to comment