|
Model Info (Interal): |
|
- Size: 7B |
|
- Dataset: The Pile v2 |
|
- `contaminated(P3) + lower_code(5%) + wiki(fixed) + books3(fixed & broken)` |
|
- Batch size (in tokens): 8M |
|
- Checkpoint path (AWS East): `/fsx/ckpts/7b_tok=neox_data=pilev2-recontam_lower-code_bs=8m_tp=4_pp=1_init=wang-small-init/global_step69000_hf` |
|
|
|
Notes: |
|
- Trained for 36k steps with incorrectly tokenized Books3 dataset (GPT-2 tokenizer instead of NeoX tokenizer) |
|
- tp=2 (not 4) |
|
|
|
W&B Report: https://stability.wandb.io/stability-llm/stable-lm/reports/StableLM-7B-alpha---Vmlldzo2MjA |
|
|
|
Usage: |
|
|
|
```python |
|
import transformers |
|
|
|
model = transformers.AutoModelForCausalLM.from_pretrained("CarperAI/7b-alpha") |
|
tokenizer = transformers.AutoTokenizer.from_pretrained("CarperAI/7b-alpha") |
|
tokenizer.pad_token = tokenizer.eos_token |
|
tokenizer.paddding_side = "left" |
|
|
|
prompts = [ |
|
"User1: The dog sat on a man's lap and barked 3 times.\nUser2: How many times did the dog bark?" |
|
"Curious Person Question: A group of genetically identical individuals is called what?\nSmart Person Answer: a clone\n\nCurious Person Question: Who proposed the theory of evolution by natural selection?\nSmart Person Answer:" |
|
] |
|
batch_encoding = tokenizer(prompts, return_tensors="pt", padding=True) |
|
|
|
print(f"Generating {len(prompts)} prompts...") |
|
samples = model.generate( |
|
**batch_encoding, |
|
max_new_tokens=64, |
|
temperature=0.0, |
|
do_sample=False, |
|
) |
|
samples = tokenizer.batch_decode(samples, skip_special_tokens=True) |
|
for prompt, sample in zip(prompts, samples): |
|
print(f"Prompt: {prompt}\nSample: {sample}\n") |
|
``` |