Safetensors
llama

How to inference the model?

#3
by frankgu3528 - opened

Can we directly using vLLM to do the inference for this model?

Hi!

Our model uses exactly the same architecture as Llama-3 so technically you should be able to use vLLM just like Llama-3 (though we haven't tested it and not sure if vLLM will affect the precision in long-context applications).

@princeton-nlp How do you reckon quantization will effect the model performance at long context tasks? When you're comparing to proprietary models you're likely comparing to optimized quants that are served at inference. Do you plan to or have you done any tests of how data aware quants affect your results on HELMET?

A little bit of context for the question:

I did a small study into "how many unique vocab tokens from llama 3 8b are likely to be used in 99.9% of English language conversations spanning 512k context".

I used a study from the 2014 [focusing on text from the 2000s] to estimate how Heaps Law and Zipfs Law would affect the number of unique tokens used in 99.9% of English language conversations (this ofc has limitations due to typos, use of symbols in coding, etc.).

After filtering for alphanumerics and common symbols the llama 3 vocab drops to 50483 tokens.

Running it against a combination of English language dictionaries (corpora/words & corpora/wordnet using pythons nltk) and using Spacy for Part-of-Speech (POS) Tagging to improve robustness. This is the result I got (after filtering down to 50_483 tokens which only contain alphanumerics and common symbols [re.fullmatch(r'[A-Za-z0-9!"#$%&\'()*+,\-./:;<=>?@\[\\\]^_{|}~]+', token)]`])

image.png

So only 15.4% (after taking into account the dictionary and PoS + regex filter all together) of llama 3 tokens are part of regular language.

Assuming quantized 8bit cache. This led to a worse case estimate of 12GB of vRAM (4GB KV cache and 8GB 4bit quantized weights) being needed (best case 10.14GB vRAM - if the token scaling is close to lexicon scaling from the study above).

Which could lead to some interesting results (if you're following the current research such as GIVE and Think on KG) - your fantastic model could be used as part of a larger system to keep a long conversation context going - however to make this system available on last generation hardware (A100, 2xL40s) quantization would be necessary (the low vRAM usage is necessary, as is the increased compute speed thanks to static data aware quantization - bnb-type active quants would start choking due to compute limitations of the hardware at long context even with vLLMs prompt caching, paged attention, chunked pre-fill, etc.)

This of course operates on quite a few assumptions (but I tried to integrate worse-case scenarios).

Hence why it would be fantastic to know the performance deterioration using for example this type of (static) data aware quant.

Would it necessitate that the quantization is ran with close-to 512k max sequence length when sampling from the calibration dataset?

Anyway, to sum it up; I'd be interested to know what your intuitions/experience with respect to this. Happy to share my code and methodologies - but we're an early stage start-up that aims to provide open and explainable AI systems which are not held behind guarded walls and use a combination of symbolic/graph theory and LLMs - and the AI R&D team is currently consisting of only two people so my time is highly limited unfortunately (I'll try to find the time to run some tests next weekend or so, but would like your opinion on where to start).

Thanks for your time and congratulations on the fantastic work!

Sign up or log in to comment