Add idefics instruct comparison table
Browse files
README.md
CHANGED
@@ -257,20 +257,6 @@ We note that since IDEFICS was trained on PMD (which contains COCO), the evaluat
|
|
257 |
| | 16 | 57.0 | 48.4 | 27.9 | 42.6 | 67.4 | 99.7 | 89.4 | 64.5 | - | 50.9 | - | 67.8 | - |
|
258 |
| | 32 | 57.9 | 49.6 | 28.3 | 43.7 | 68.1 | 98.0 | 90.5 | 64.4 | - | 49.8 | - | 67.0 | - |
|
259 |
|
260 |
-
Idefics Instruct Evaluations:
|
261 |
-
| Model | Shots | <nobr>VQAv2<br>OE VQA acc.</nobr> | <nobr>OKVQA<br>OE VQA acc.</nobr> | <nobr>TextVQA<br>OE VQA acc.</nobr> | <nobr>VizWiz<br>OE VQA acc.</nobr> | <nobr>TextCaps<br>CIDEr</nobr> | <nobr>Coco<br>CIDEr</nobr> | <nobr>NoCaps<br>CIDEr</nobr> | <nobr>Flickr<br>CIDEr</nobr> | <nobr>VisDial<br>NDCG</nobr> | <nobr>HatefulMemes<br>ROC AUC</nobr> | <nobr>ScienceQA<br>acc.</nobr> | <nobr>RenderedSST2<br>acc.</nobr> | <nobr>Winoground<br>group (text/image)</nobr> |
|
262 |
-
|:---------------------|--------:|---------------------:|---------------------:|-----------------------:|----------------------:|-------------------:|---------------:|-----------------:|-----------------:|-----------------:|-------------------------:|-----------------------:|--------------------------:|----------------------------------:|
|
263 |
-
| 80B IDEFICS Instruct | 0 | 37.4 | 36.9 | 32.9 | 26.2 | 76.5 | 117.2 | 104.5 | 65.3 | 49.3 | 58.9 | 69.5 | 67.3 | 9.2 (20.0/25.0) |
|
264 |
-
| | 4 | 67.5 | 54.0 | 37.8 | 39.8 | 71.7 | 116.9 | 104.0 | 67.1 | 48.9 | 57.5 | 60.5 | 65.5 | - |
|
265 |
-
| | 8 | 68.1 | 56.9 | 38.2 | 44.8 | 72.7 | 116.8 | 104.8 | 70.7 | 48.2 | 58.0 | - | 68.6 | - |
|
266 |
-
| | 16 | 68.6 | 58.2 | 39.1 | 48.7 | 77.0 | 120.5 | 107.4 | 76.0 | - | 56.4 | - | 70.1 | - |
|
267 |
-
| | 32 | 68.8 | 59.5 | 39.3 | 51.2 | 79.7 | 123.2 | 108.4 | 78.4 | - | 54.9 | - | 70.5 | - |
|
268 |
-
| 9B IDEFICS Instruct | 0 | 65.8 | 46.1 | 29.2 | 41.2 | 67.1 | 129.1 | 101.1 | 71.9 | 49.2 | 53.5 | 60.6 | 62.8 | 5.8 (20.0/18.0) |
|
269 |
-
| | 4 | 66.2 | 48.7 | 31.0 | 39.0 | 68.2 | 128.2 | 100.9 | 74.8 | 48.9 | 51.8 | 53.8 | 60.6 | - |
|
270 |
-
| | 8 | 66.5 | 50.8 | 31.0 | 41.9 | 70.0 | 128.8 | 101.5 | 75.5 | 48.2 | 51.7 | - | 61.3 | - |
|
271 |
-
| | 16 | 66.8 | 51.7 | 31.6 | 44.8 | 70.2 | 128.8 | 101.5 | 75.8 | - | 51.7 | - | 63.3 | - |
|
272 |
-
| | 32 | 66.9 | 52.3 | 32.0 | 46.0 | 71.7 | 127.8 | 101.0 | 76.3 | - | 50.8 | - | 60.9 | - |
|
273 |
-
|
274 |
For ImageNet-1k, we also report results where the priming samples are selected to be similar (i.e. close in a vector space) to the queried instance. This is the Retrieval-based In-Context Example Selection (RICES in short) approach introduced by [Yang et al. (2021)](https://arxiv.org/abs/2109.05014).
|
275 |
|
276 |
| Model | Shots | Support set size | Shots selection | ImageNet-1k<br>Top-1 acc. |
|
@@ -300,7 +286,37 @@ Fairness Evaluations:
|
|
300 |
|
301 |
Similarly to the base IDEFICS models, we performed checkpoint selection to stop the training. Given that M3IT contains in the training set a handful of the benchmarks we were evaluating on, we used [MMBench](https://huggingface.co/papers/2307.06281) as a held-out validation benchmark to perform checkpoint selection. We select the checkpoint at step 3'000 for IDEFICS-80b-instruct and at step 8'000 for IDEFICS-9b-instruct.
|
302 |
|
303 |
-
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
304 |
|
305 |
# Technical Specifications
|
306 |
|
|
|
257 |
| | 16 | 57.0 | 48.4 | 27.9 | 42.6 | 67.4 | 99.7 | 89.4 | 64.5 | - | 50.9 | - | 67.8 | - |
|
258 |
| | 32 | 57.9 | 49.6 | 28.3 | 43.7 | 68.1 | 98.0 | 90.5 | 64.4 | - | 49.8 | - | 67.0 | - |
|
259 |
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
260 |
For ImageNet-1k, we also report results where the priming samples are selected to be similar (i.e. close in a vector space) to the queried instance. This is the Retrieval-based In-Context Example Selection (RICES in short) approach introduced by [Yang et al. (2021)](https://arxiv.org/abs/2109.05014).
|
261 |
|
262 |
| Model | Shots | Support set size | Shots selection | ImageNet-1k<br>Top-1 acc. |
|
|
|
286 |
|
287 |
Similarly to the base IDEFICS models, we performed checkpoint selection to stop the training. Given that M3IT contains in the training set a handful of the benchmarks we were evaluating on, we used [MMBench](https://huggingface.co/papers/2307.06281) as a held-out validation benchmark to perform checkpoint selection. We select the checkpoint at step 3'000 for IDEFICS-80b-instruct and at step 8'000 for IDEFICS-9b-instruct.
|
288 |
|
289 |
+
Idefics Instruct Evaluations:
|
290 |
+
| Model | Shots | <nobr>VQAv2<br>OE VQA acc.</nobr> | <nobr>OKVQA<br>OE VQA acc.</nobr> | <nobr>TextVQA<br>OE VQA acc.</nobr> | <nobr>VizWiz<br>OE VQA acc.</nobr> | <nobr>TextCaps<br>CIDEr</nobr> | <nobr>Coco<br>CIDEr</nobr> | <nobr>NoCaps<br>CIDEr</nobr> | <nobr>Flickr<br>CIDEr</nobr> | <nobr>VisDial<br>NDCG</nobr> | <nobr>HatefulMemes<br>ROC AUC</nobr> | <nobr>ScienceQA<br>acc.</nobr> | <nobr>RenderedSST2<br>acc.</nobr> | <nobr>Winoground<br>group (text/image)</nobr> |
|
291 |
+
|:---------------------|--------:|---------------------:|---------------------:|-----------------------:|----------------------:|-------------------:|---------------:|-----------------:|-----------------:|-----------------:|-------------------------:|-----------------------:|--------------------------:|----------------------------------:|
|
292 |
+
| 80B IDEFICS Instruct | 0 | 37.4 | 36.9 | 32.9 | 26.2 | 76.5 | 117.2 | 104.5 | 65.3 | 49.3 | 58.9 | 69.5 | 67.3 | 9.2 (20.0/25.0) |
|
293 |
+
| | 4 | 67.5 | 54.0 | 37.8 | 39.8 | 71.7 | 116.9 | 104.0 | 67.1 | 48.9 | 57.5 | 60.5 | 65.5 | - |
|
294 |
+
| | 8 | 68.1 | 56.9 | 38.2 | 44.8 | 72.7 | 116.8 | 104.8 | 70.7 | 48.2 | 58.0 | - | 68.6 | - |
|
295 |
+
| | 16 | 68.6 | 58.2 | 39.1 | 48.7 | 77.0 | 120.5 | 107.4 | 76.0 | - | 56.4 | - | 70.1 | - |
|
296 |
+
| | 32 | 68.8 | 59.5 | 39.3 | 51.2 | 79.7 | 123.2 | 108.4 | 78.4 | - | 54.9 | - | 70.5 | - |
|
297 |
+
<br>
|
298 |
+
| 9B IDEFICS Instruct | 0 | 65.8 | 46.1 | 29.2 | 41.2 | 67.1 | 129.1 | 101.1 | 71.9 | 49.2 | 53.5 | 60.6 | 62.8 | 5.8 (20.0/18.0) |
|
299 |
+
| | 4 | 66.2 | 48.7 | 31.0 | 39.0 | 68.2 | 128.2 | 100.9 | 74.8 | 48.9 | 51.8 | 53.8 | 60.6 | - |
|
300 |
+
| | 8 | 66.5 | 50.8 | 31.0 | 41.9 | 70.0 | 128.8 | 101.5 | 75.5 | 48.2 | 51.7 | - | 61.3 | - |
|
301 |
+
| | 16 | 66.8 | 51.7 | 31.6 | 44.8 | 70.2 | 128.8 | 101.5 | 75.8 | - | 51.7 | - | 63.3 | - |
|
302 |
+
| | 32 | 66.9 | 52.3 | 32.0 | 46.0 | 71.7 | 127.8 | 101.0 | 76.3 | - | 50.8 | - | 60.9 | - |
|
303 |
+
|
304 |
+
IDEFICS vs IDEFICS-instruct.
|
305 |
+
| Model | Shots | <nobr>VQAv2<br>OE VQA acc.</nobr> | <nobr>OKVQA<br>OE VQA acc.</nobr> | <nobr>TextVQA<br>OE VQA acc.</nobr> | <nobr>VizWiz<br>OE VQA acc.</nobr> | <nobr>TextCaps<br>CIDEr</nobr> | <nobr>Coco<br>CIDEr</nobr> | <nobr>NoCaps<br>CIDEr</nobr> | <nobr>Flickr<br>CIDEr</nobr> | <nobr>VisDial<br>NDCG</nobr> | <nobr>HatefulMemes<br>ROC AUC</nobr> | <nobr>ScienceQA<br>acc.</nobr> | <nobr>RenderedSST2<br>acc.</nobr> | <nobr>Winoground<br>group (text/image)</nobr> |
|
306 |
+
|:----------------------------------------|:--------|---------------------:|---------------------:|-----------------------:|----------------------:|-------------------:|---------------:|-----------------:|-----------------:|-----------------:|-------------------------:|-----------------------:|--------------------------:|----------------------------------:|
|
307 |
+
| Difference IDEFICS 80B Base vs Instruct | 0 | -22.7 | -8.2 | 1.9 | -9.8 | 19.7 | 25.4 | 39.5 | 11.7 | 0.4 | -1.7 | 0.5 | 6.8 | 1.2 |
|
308 |
+
| | 4 | 4.0 | 1.7 | 3.5 | -0.7 | - | 6.6 | 4.4 | -6.6 | 0.5 | -0.3 | 1.6 | -1.1 | - |
|
309 |
+
| | 8 | 3.4 | 1.8 | 2.5 | -1.3 | -4.9 | 2.5 | -0.9 | -5.9 | 0.3 | -0.2 | - | 0.8 | - |
|
310 |
+
| | 16 | 3.2 | 1.4 | 2.8 | 0.4 | -4.5 | 4.0 | 0.4 | -4.1 | - | 0.7 | - | 2.4 | - |
|
311 |
+
| | 32 | 2.9 | 1.8 | 2.6 | 1.2 | -3.0 | 6.5 | 1.0 | -2.7 | - | 2.4 | - | 3.2 | - |
|
312 |
+
| Average Difference 80B | | -1.8 | -0.3 | 2.6 | -2.0 | 1.3 | 9.0 | 8.9 | -1.5 | 0.4 | 0.2 | 1.1 | 2.4 | 1.2 |
|
313 |
+
<br>
|
314 |
+
| Difference IDEFICS 9B Base vs Instruct | 0 | 15.0 | 7.6 | 3.3 | 5.6 | 41.7 | 83.0 | 64.3 | 44.6 | 0.5 | 1.8 | 16.4 | 1.0 | 0.8 |
|
315 |
+
| | 4 | 10.8 | 3.3 | 3.4 | 2.1 | 8.2 | 35.1 | 19.6 | 15.0 | 1.0 | 1.1 | 16.4 | -1.8 | - |
|
316 |
+
| | 8 | 10.2 | 3.1 | 3.5 | 1.6 | 6.7 | 31.8 | 14.8 | 13.6 | 0.6 | 0.6 | - | -4.9 | - |
|
317 |
+
| | 16 | 9.8 | 3.3 | 3.7 | 2.3 | 2.7 | 29.1 | 12.2 | 11.4 | - | 0.7 | - | -4.6 | - |
|
318 |
+
| | 32 | 9.0 | 2.7 | 3.7 | 2.2 | 3.6 | 29.8 | 10.5 | 11.9 | - | 1.0 | - | -6.1 | - |
|
319 |
+
| Average Difference 9B | | 10.9 | 4.0 | 3.5 | 2.8 | 12.6 | 41.8 | 24.3 | 19.3 | 0.7 | 1.0 | 16.4 | -3.3 | 0.8 |
|
320 |
|
321 |
# Technical Specifications
|
322 |
|