Leyo commited on
Commit
7d92337
1 Parent(s): 27edbef

fix legend + display

Browse files
Files changed (1) hide show
  1. README.md +5 -4
README.md CHANGED
@@ -243,15 +243,15 @@ As opposed to Flamingo, we did not train IDEFICS on video-text pairs datasets, a
243
 
244
  We note that since IDEFICS was trained on PMD (which contains COCO), the evaluation numbers on COCO are not directly comparable with Flamingo and OpenFlamingo since they did not explicitely have this dataset in the training mixture. Additionally, Flamingo is trained with images of resolution 320 x 320 while IDEFICS and OpenFlamingo were trained with images of 224 x 224 resolution.
245
 
246
- | Model | Shots | <nobr>VQAv2<br>OE VQA acc.</nobr> | <nobr>OKVQA<br>OE VQA acc.</nobr> | <nobr>TextVQA<br>OE VQA acc.</nobr> | <nobr>VizWiz<br>OE VQA acc.</nobr> | <nobr>TextCaps<br>CIDEr</nobr> | <nobr>Coco<br>CIDEr</nobr> | <nobr>NoCaps<br>CIDEr</nobr> | <nobr>Flickr<br>CIDEr</nobr> | <nobr>VisDial<br>NDCG</nobr> | <nobr>HatefulMemes<br>ROC AUC</nobr> | <nobr>ScienceQA<br>acc.</nobr> | <nobr>RenderedSST2<br>acc.</nobr> | <nobr>Winoground<br>group (text/image)</nobr> |
247
  |:------------|--------:|---------------------:|---------------------:|-----------------------:|----------------------:|-------------------:|---------------:|-----------------:|-----------------:|-----------------:|-------------------------:|-----------------------:|--------------------------:|----------------------------------:|
248
- | IDEFICS 80B | 0 | 60.0 | 45.2 | 30.9 | 36.0 | 56.8 | 91.8 | 65.0 | 53.7 | 48.8 | 60.6 | 68.9 | 60.5 | 8.0 (18.75/22.5)|
249
  | | 4 | 63.6 | 52.4 | 34.4 | 40.4 | 72.7 | 110.3 | 99.6 | 73.7 | 48.4 | 57.8 | 58.9 | 66.6 | - |
250
  | | 8 | 64.8 | 55.1 | 35.7 | 46.1 | 77.6 | 114.3 | 105.7 | 76.6 | 47.9 | 58.2 | - | 67.8 | - |
251
  | | 16 | 65.4 | 56.8 | 36.3 | 48.3 | 81.4 | 116.6 | 107.0 | 80.1 | - | 55.8 | - | 67.7 | - |
252
  | | 32 | 65.9 | 57.8 | 36.7 | 50.0 | 82.7 | 116.6 | 107.5 | 81.1 | - | 52.5 | - | 67.3 | - |
253
  <br>
254
- | IDEFICS 9B | 0 | 50.9 | 38.4 | 25.9 | 35.5 | 25.4 | 46.0 | 36.8 | 27.3 | 48.7 | 51.7 | 44.2 | 61.8 | 5.0 (16.8/20.8) |
255
  | | 4 | 55.4 | 45.5 | 27.6 | 36.9 | 60.0 | 93.0 | 81.3 | 59.7 | 47.9 | 50.7 | 37.4 | 62.3 | - |
256
  | | 8 | 56.4 | 47.7 | 27.5 | 40.4 | 63.2 | 97.0 | 86.8 | 61.9 | 47.6 | 51.0 | - | 66.3 | - |
257
  | | 16 | 57.0 | 48.4 | 27.9 | 42.6 | 67.4 | 99.7 | 89.4 | 64.5 | - | 50.9 | - | 67.8 | - |
@@ -271,7 +271,7 @@ For ImageNet-1k, we also report results where the priming samples are selected t
271
 
272
  Similarly to the base IDEFICS models, we performed checkpoint selection to stop the training. Given that M3IT contains in the training set a handful of the benchmarks we were evaluating on, we used [MMBench](https://huggingface.co/papers/2307.06281) as a held-out validation benchmark to perform checkpoint selection. We select the checkpoint at step 3'000 for IDEFICS-80b-instruct and at step 8'000 for IDEFICS-9b-instruct.
273
 
274
- | Model | Shots | <nobr>VQAv2 <br>OE VQA acc.</nobr> | <nobr>OKVQA <br>OE VQA acc.</nobr> | <nobr>TextVQA <br>OE VQA acc.</nobr> | <nobr>VizWiz<br>OE VQA acc.</nobr> | <nobr>TextCaps <br>CIDEr</nobr> | <nobr>Coco <br>CIDEr</nobr> | <nobr>NoCaps<br>CIDEr</nobr> | <nobr>Flickr<br>CIDEr</nobr> | <nobr>VisDial <br>NDCG</nobr> | <nobr>HatefulMemes<br>ROC AUC</nobr> | <nobr>ScienceQA <br>acc.</nobr> | <nobr>RenderedSST2<br>acc.</nobr> | <nobr>Winoground<br>group (text/image)</nobr> |
275
  | :--------------------- | --------: | ---------------------: | ---------------------: | -----------------------: | ----------------------: | -------------------: | ---------------: | -----------------: | -----------------: | -----------------: | -------------------------: | -----------------------: | --------------------------: | ----------------------------------: |
276
  | Finetuning data **does not** contain the evaluation dataset | - | &#10006; | &#10006; | &#10006; | &#10004; | &#10006; | &#10006; | &#10006; | &#10004; | &#10006; | &#10004; | &#10006; | &#10004; | &#10006; |
277
  | <nobr>IDEFICS 80B Instruct<br> | 0 | 37.4 (-22.7) | 36.9 (-8.2) | 32.9 (1.9) | 26.2 (-9.8) | 76.5 (19.7) | 117.2 (25.4) | 104.5 (39.5) | 65.3 (11.7) | 49.3 (0.4) | 58.9 (-1.7) | 69.5 (0.5) | 67.3 (6.8) | 9.2/20.0/25.0 (1.2/1.2/2.5) |
@@ -286,6 +286,7 @@ Similarly to the base IDEFICS models, we performed checkpoint selection to stop
286
  | | 16 | 66.8 (9.8) | 51.7 (3.3) | 31.6 (3.7) | 44.8 (2.3) | 70.2 (2.7) | 128.8 (29.1) | 101.5 (12.2) | 75.8 (11.4) | - | 51.7 (0.7) | - | 63.3 (-4.6) | - |
287
  | | 32 | 66.9 (9.0) | 52.3 (2.7) | 32.0 (3.7) | 46.0 (2.2) | 71.7 (3.6) | 127.8 (29.8) | 101.0 (10.5) | 76.3 (11.9) | - | 50.8 (1.0) | - | 60.9 (-6.1) | - |
288
 
 
289
 
290
  # Technical Specifications
291
 
 
243
 
244
  We note that since IDEFICS was trained on PMD (which contains COCO), the evaluation numbers on COCO are not directly comparable with Flamingo and OpenFlamingo since they did not explicitely have this dataset in the training mixture. Additionally, Flamingo is trained with images of resolution 320 x 320 while IDEFICS and OpenFlamingo were trained with images of 224 x 224 resolution.
245
 
246
+ | Model | Shots | <nobr>VQAv2<br>OE VQA acc.</nobr> | <nobr>OKVQA<br>OE VQA acc.</nobr> | <nobr>TextVQA<br>OE VQA acc.</nobr> | <nobr>VizWiz<br>OE VQA acc.</nobr> | <nobr>TextCaps<br>CIDEr</nobr> | <nobr>Coco<br>CIDEr</nobr> | <nobr>NoCaps<br>CIDEr</nobr> | <nobr>Flickr<br>CIDEr</nobr> | <nobr>VisDial<br>NDCG</nobr> | <nobr>HatefulMemes<br>ROC AUC</nobr> | <nobr>ScienceQA<br>acc.</nobr> | <nobr>RenderedSST2<br>acc.</nobr> | <nobr>Winoground<br>group/text/image</nobr> |
247
  |:------------|--------:|---------------------:|---------------------:|-----------------------:|----------------------:|-------------------:|---------------:|-----------------:|-----------------:|-----------------:|-------------------------:|-----------------------:|--------------------------:|----------------------------------:|
248
+ | IDEFICS 80B | 0 | 60.0 | 45.2 | 30.9 | 36.0 | 56.8 | 91.8 | 65.0 | 53.7 | 48.8 | 60.6 | 68.9 | 60.5 | 8.0/18.75/22.5|
249
  | | 4 | 63.6 | 52.4 | 34.4 | 40.4 | 72.7 | 110.3 | 99.6 | 73.7 | 48.4 | 57.8 | 58.9 | 66.6 | - |
250
  | | 8 | 64.8 | 55.1 | 35.7 | 46.1 | 77.6 | 114.3 | 105.7 | 76.6 | 47.9 | 58.2 | - | 67.8 | - |
251
  | | 16 | 65.4 | 56.8 | 36.3 | 48.3 | 81.4 | 116.6 | 107.0 | 80.1 | - | 55.8 | - | 67.7 | - |
252
  | | 32 | 65.9 | 57.8 | 36.7 | 50.0 | 82.7 | 116.6 | 107.5 | 81.1 | - | 52.5 | - | 67.3 | - |
253
  <br>
254
+ | IDEFICS 9B | 0 | 50.9 | 38.4 | 25.9 | 35.5 | 25.4 | 46.0 | 36.8 | 27.3 | 48.7 | 51.7 | 44.2 | 61.8 | 5.0 /16.8/20.8 |
255
  | | 4 | 55.4 | 45.5 | 27.6 | 36.9 | 60.0 | 93.0 | 81.3 | 59.7 | 47.9 | 50.7 | 37.4 | 62.3 | - |
256
  | | 8 | 56.4 | 47.7 | 27.5 | 40.4 | 63.2 | 97.0 | 86.8 | 61.9 | 47.6 | 51.0 | - | 66.3 | - |
257
  | | 16 | 57.0 | 48.4 | 27.9 | 42.6 | 67.4 | 99.7 | 89.4 | 64.5 | - | 50.9 | - | 67.8 | - |
 
271
 
272
  Similarly to the base IDEFICS models, we performed checkpoint selection to stop the training. Given that M3IT contains in the training set a handful of the benchmarks we were evaluating on, we used [MMBench](https://huggingface.co/papers/2307.06281) as a held-out validation benchmark to perform checkpoint selection. We select the checkpoint at step 3'000 for IDEFICS-80b-instruct and at step 8'000 for IDEFICS-9b-instruct.
273
 
274
+ | Model | Shots | <nobr>VQAv2 <br>OE VQA acc.</nobr> | <nobr>OKVQA <br>OE VQA acc.</nobr> | <nobr>TextVQA <br>OE VQA acc.</nobr> | <nobr>VizWiz<br>OE VQA acc.</nobr> | <nobr>TextCaps <br>CIDEr</nobr> | <nobr>Coco <br>CIDEr</nobr> | <nobr>NoCaps<br>CIDEr</nobr> | <nobr>Flickr<br>CIDEr</nobr> | <nobr>VisDial <br>NDCG</nobr> | <nobr>HatefulMemes<br>ROC AUC</nobr> | <nobr>ScienceQA <br>acc.</nobr> | <nobr>RenderedSST2<br>acc.</nobr> | <nobr>Winoground<br>group/text/image</nobr> |
275
  | :--------------------- | --------: | ---------------------: | ---------------------: | -----------------------: | ----------------------: | -------------------: | ---------------: | -----------------: | -----------------: | -----------------: | -------------------------: | -----------------------: | --------------------------: | ----------------------------------: |
276
  | Finetuning data **does not** contain the evaluation dataset | - | &#10006; | &#10006; | &#10006; | &#10004; | &#10006; | &#10006; | &#10006; | &#10004; | &#10006; | &#10004; | &#10006; | &#10004; | &#10006; |
277
  | <nobr>IDEFICS 80B Instruct<br> | 0 | 37.4 (-22.7) | 36.9 (-8.2) | 32.9 (1.9) | 26.2 (-9.8) | 76.5 (19.7) | 117.2 (25.4) | 104.5 (39.5) | 65.3 (11.7) | 49.3 (0.4) | 58.9 (-1.7) | 69.5 (0.5) | 67.3 (6.8) | 9.2/20.0/25.0 (1.2/1.2/2.5) |
 
286
  | | 16 | 66.8 (9.8) | 51.7 (3.3) | 31.6 (3.7) | 44.8 (2.3) | 70.2 (2.7) | 128.8 (29.1) | 101.5 (12.2) | 75.8 (11.4) | - | 51.7 (0.7) | - | 63.3 (-4.6) | - |
287
  | | 32 | 66.9 (9.0) | 52.3 (2.7) | 32.0 (3.7) | 46.0 (2.2) | 71.7 (3.6) | 127.8 (29.8) | 101.0 (10.5) | 76.3 (11.9) | - | 50.8 (1.0) | - | 60.9 (-6.1) | - |
288
 
289
+ *() Improvement over non-instruct version.
290
 
291
  # Technical Specifications
292