Update README.md
Browse files
README.md
CHANGED
@@ -96,6 +96,28 @@ print(generated_text)
|
|
96 |
# wooden deck. The deck's planks, which are a mix of light and dark brown with ...
|
97 |
```
|
98 |
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
99 |
## Evaluations
|
100 |
|
101 |
| Model | Average Score on 11 Academic Benchmarks | Human Preference Elo Rating |
|
|
|
96 |
# wooden deck. The deck's planks, which are a mix of light and dark brown with ...
|
97 |
```
|
98 |
|
99 |
+
To make inference more efficient, run with autocast:
|
100 |
+
|
101 |
+
with torch.autocast(device_type="cuda", enabled=True, dtype=torch.bfloat16):
|
102 |
+
output = model.generate_from_batch(
|
103 |
+
inputs,
|
104 |
+
GenerationConfig(max_new_tokens=200, stop_strings="<|endoftext|>"),
|
105 |
+
tokenizer=processor.tokenizer
|
106 |
+
)
|
107 |
+
We did most of our evaluations in this setting (autocast on, but float32 weights)
|
108 |
+
|
109 |
+
To even further reduce the memory requirements, the model can be run with bfloat16 weights:
|
110 |
+
|
111 |
+
model.to(dtype=torch.bfloat16)
|
112 |
+
inputs["images"] = inputs["images"].to(torch.bfloat16)
|
113 |
+
output = model.generate_from_batch(
|
114 |
+
inputs,
|
115 |
+
GenerationConfig(max_new_tokens=200, stop_strings="<|endoftext|>"),
|
116 |
+
tokenizer=processor.tokenizer
|
117 |
+
)
|
118 |
+
Note that this can sometimes change the output of the model compared to running with float32 weights.
|
119 |
+
|
120 |
+
|
121 |
## Evaluations
|
122 |
|
123 |
| Model | Average Score on 11 Academic Benchmarks | Human Preference Elo Rating |
|