Update README.md
Browse files
README.md
CHANGED
@@ -92,6 +92,27 @@ print(generated_text)
|
|
92 |
# The puppy is positioned in the center of the frame, looking up at the camera...
|
93 |
```
|
94 |
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
95 |
## Evaluations
|
96 |
|
97 |
| Model | Average Score on 11 Academic Benchmarks | Human Preference Elo Rating |
|
|
|
92 |
# The puppy is positioned in the center of the frame, looking up at the camera...
|
93 |
```
|
94 |
|
95 |
+
To make inference more efficient, run with autocast:
|
96 |
+
|
97 |
+
with torch.autocast(device_type="cuda", enabled=True, dtype=torch.bfloat16):
|
98 |
+
output = model.generate_from_batch(
|
99 |
+
inputs,
|
100 |
+
GenerationConfig(max_new_tokens=200, stop_strings="<|endoftext|>"),
|
101 |
+
tokenizer=processor.tokenizer
|
102 |
+
)
|
103 |
+
We did most of our evaluation in this setting (autocast on, but float32 weights)
|
104 |
+
|
105 |
+
To even further reduce the memory requirements, the model can be run with bfloat16 weights:
|
106 |
+
|
107 |
+
model.to(dtype=torch.bfloat16)
|
108 |
+
inputs["images"] = inputs["images"].to(torch.bfloat16)
|
109 |
+
output = model.generate_from_batch(
|
110 |
+
inputs,
|
111 |
+
GenerationConfig(max_new_tokens=200, stop_strings="<|endoftext|>"),
|
112 |
+
tokenizer=processor.tokenizer
|
113 |
+
)
|
114 |
+
Note that we have observed that this can change the output of the model compared to running with float32 weights.
|
115 |
+
|
116 |
## Evaluations
|
117 |
|
118 |
| Model | Average Score on 11 Academic Benchmarks | Human Preference Elo Rating |
|