wenbopan
/

Faro-Yi-9B

Text Generation

text-generation-inference

Inference Endpoints

Model card Files Files and versions Community

wenbopan commited on Apr 2

Commit

4868f40

•

1 Parent(s): 8601645

Complete how-to-use

Files changed (1) hide show

README.md +4 -2

README.md CHANGED Viewed

@@ -17,7 +17,8 @@ Faro-Yi-9B-200K is an improved [Yi-9B-200K](https://huggingface.co/01-ai/Yi-9B-2
 ## How to Use
-Faro-Yi-9B-200K uses chatml template. I recommend using vLLM for long inputs.
 ```python
 import io
@@ -25,7 +26,7 @@ import requests
 from PyPDF2 import PdfReader
 from vllm import LLM, SamplingParams
-llm = LLM(model="wenbopan/Faro-Yi-9B-200K")
 pdf_data = io.BytesIO(requests.get("https://arxiv.org/pdf/2303.08774.pdf").content)
 document = "".join(page.extract_text() for page in PdfReader(pdf_data).pages) # 100 pages
@@ -39,6 +40,7 @@ print(output[0].outputs[0].text)
 # Faro-Yi-9B-200K: GPT-4 does not have a publicly disclosed parameter count due to the competitive landscape and safety implications of large-scale models like GPT-4. ...
 ```
 <details> <summary>Or With Transformers</summary>
 ```python

 ## How to Use
+Faro-Yi-9B-200K uses the chatml template and performs well in both short and long contexts. For longer inputs, I recommend to use vLLM to have a max prompt of 32K under 24GB of VRAM. Setting `kv_cache_dtype="fp8_e5m2"` allows for 48K input length. 4bit-AWQ quantization on top of that can boost input length to 160K, albeit with some performance impact. Adjust `max_model_len` arg in vLLM or `config.json` to avoid OOM.
 ```python
 import io
 from PyPDF2 import PdfReader
 from vllm import LLM, SamplingParams
+llm = LLM(model="wenbopan/Faro-Yi-9B-200K", kv_cache_dtype="fp8_e5m2", max_model_len=100000)
 pdf_data = io.BytesIO(requests.get("https://arxiv.org/pdf/2303.08774.pdf").content)
 document = "".join(page.extract_text() for page in PdfReader(pdf_data).pages) # 100 pages
 # Faro-Yi-9B-200K: GPT-4 does not have a publicly disclosed parameter count due to the competitive landscape and safety implications of large-scale models like GPT-4. ...
 ```
 <details> <summary>Or With Transformers</summary>
 ```python