Commit
•
9ef6114
1
Parent(s):
8324361
Update README (#14)
Browse files- Update README (0c461edc7386276fd5044f0dd692b85b9b9f9aef)
Co-authored-by: Pedro Cuenca <[email protected]>
README.md
CHANGED
@@ -212,10 +212,10 @@ The Llama 3.2-Vision collection of multimodal large language models (LLMs) is a
|
|
212 |
|
213 |
**Model Architecture:** Llama 3.2-Vision is built on top of Llama 3.1 text-only model, which is an auto-regressive language model that uses an optimized transformer architecture. The tuned versions use supervised fine-tuning (SFT) and reinforcement learning with human feedback (RLHF) to align with human preferences for helpfulness and safety. To support image recognition tasks, the Llama 3.2-Vision model uses a separately trained vision adapter that integrates with the pre-trained Llama 3.1 language model. The adapter consists of a series of cross-attention layers that feed image encoder representations into the core LLM.
|
214 |
|
215 |
-
| | Training Data | Params | Input modalities | Output modalities | Context length | GQA |
|
216 |
| :---- | :---- | :---- | :---- | :---- | :---- | :---- | :---- | :---- |
|
217 |
| Llama 3.2-Vision | (Image, text) pairs | 11B (10.6) | Text \+ Image | Text | 128k | Yes | 6B (image, text) pairs | December 2023 |
|
218 |
-
|
|
219 |
|
220 |
**Supported Languages:** For text only tasks, English, German, French, Italian, Portuguese, Hindi, Spanish, and Thai are officially supported. Llama 3.2 has been trained on a broader collection of languages than these 8 supported languages. Note for image+text applications, English is the only language supported.
|
221 |
|
@@ -329,31 +329,31 @@ In this section, we report the results for Llama 3.2-Vision models on standard a
|
|
329 |
|
330 |
| Category | Benchmark | \# Shots | Metric | Llama 3.2 11B | Llama 3.2 90B |
|
331 |
| ----- | ----- | ----- | ----- | ----- | ----- |
|
332 |
-
| Image Understanding | VQAv2 (
|
333 |
-
| | Text VQA (val) | 0 |
|
334 |
-
| | DocVQA (val, unseen) | 0 |
|
335 |
-
| Visual Reasoning | MMMU (val, 0-shot) | 0 |
|
336 |
-
| | ChartQA (test) | 0 |
|
337 |
-
| | InfographicsQA (val, unseen) | 0 |
|
338 |
-
| | AI2 Diagram (test) | 0 |
|
339 |
|
340 |
### Instruction Tuned Models
|
341 |
|
342 |
| Modality | Capability | Benchmark | \# Shots | Metric | Llama 3.2 11B | Llama 3.2 90B |
|
343 |
| ----- | :---: | ----- | :---: | :---: | ----- | ----- |
|
344 |
-
| Image | College-level Problems and Mathematical Reasoning | MMMU (val, CoT) | 0 |
|
345 |
-
| | | MMMU-Pro, Standard (10 opts, test) | 0 |
|
346 |
-
| | | MMMU-Pro, Vision (test) | 0 |
|
347 |
-
| | | MathVista (testmini) | 0 |
|
348 |
-
| | Charts and Diagram Understanding | ChartQA (test, CoT) | 0 |
|
349 |
-
| | | AI2 Diagram (test) | 0 |
|
350 |
| | | DocVQA (test) | 0 | ANLS | 88.4 | 90.1 |
|
351 |
-
| | General Visual Question Answering | VQAv2 (test) | 0 |
|
352 |
| | | | | | | |
|
353 |
-
| Text | General | MMLU | 0 |
|
354 |
-
| | Math | MATH (CoT) | 0 |
|
355 |
-
| | Reasoning | GPQA | 0 |
|
356 |
-
| | Multilingual | MGSM (CoT) | 0 | em |
|
357 |
|
358 |
## Responsibility & Safety
|
359 |
|
@@ -399,8 +399,8 @@ In addition to our safety work above, we took extra care on measuring and/or mit
|
|
399 |
|
400 |
**2\. Child Safety:** Child Safety risk assessments were conducted using a team of experts, to assess the model’s capability to produce outputs that could result in Child Safety risks and inform on any necessary and appropriate risk mitigations via fine tuning. We leveraged those expert red teaming sessions to expand the coverage of our evaluation benchmarks through Llama 3 model development. For Llama 3, we conducted new in-depth sessions using objective based methodologies to assess the model risks along multiple attack vectors including the additional languages Llama 3 is trained on. We also partnered with content specialists to perform red teaming exercises assessing potentially violating content while taking account of market specific nuances or experiences.
|
401 |
|
402 |
-
**3\. Cyber Attacks:**
|
403 |
-
Our attack automation study focused on evaluating the capabilities of LLMs when used as autonomous agents in cyber offensive operations, specifically in the context of ransomware attacks. This evaluation was distinct from previous studies that considered LLMs as interactive assistants. The primary objective was to assess whether these models could effectively function as independent agents in executing complex cyber-attacks without human intervention.
|
404 |
|
405 |
### Community
|
406 |
|
|
|
212 |
|
213 |
**Model Architecture:** Llama 3.2-Vision is built on top of Llama 3.1 text-only model, which is an auto-regressive language model that uses an optimized transformer architecture. The tuned versions use supervised fine-tuning (SFT) and reinforcement learning with human feedback (RLHF) to align with human preferences for helpfulness and safety. To support image recognition tasks, the Llama 3.2-Vision model uses a separately trained vision adapter that integrates with the pre-trained Llama 3.1 language model. The adapter consists of a series of cross-attention layers that feed image encoder representations into the core LLM.
|
214 |
|
215 |
+
| | Training Data | Params | Input modalities | Output modalities | Context length | GQA | Data volume | Knowledge cutoff |
|
216 |
| :---- | :---- | :---- | :---- | :---- | :---- | :---- | :---- | :---- |
|
217 |
| Llama 3.2-Vision | (Image, text) pairs | 11B (10.6) | Text \+ Image | Text | 128k | Yes | 6B (image, text) pairs | December 2023 |
|
218 |
+
| Llama 3.2-Vision | (Image, text) pairs | 90B (88.8) | Text \+ Image | Text | 128k | Yes | 6B (image, text) pairs | December 2023 |
|
219 |
|
220 |
**Supported Languages:** For text only tasks, English, German, French, Italian, Portuguese, Hindi, Spanish, and Thai are officially supported. Llama 3.2 has been trained on a broader collection of languages than these 8 supported languages. Note for image+text applications, English is the only language supported.
|
221 |
|
|
|
329 |
|
330 |
| Category | Benchmark | \# Shots | Metric | Llama 3.2 11B | Llama 3.2 90B |
|
331 |
| ----- | ----- | ----- | ----- | ----- | ----- |
|
332 |
+
| Image Understanding | VQAv2 (val) | 0 | Accuracy | 66.8 | 73.6 |
|
333 |
+
| | Text VQA (val) | 0 | Relaxed accuracy | 73.1 | 73.5 |
|
334 |
+
| | DocVQA (val, unseen) | 0 | ANLS | 62.3 | 70.7 |
|
335 |
+
| Visual Reasoning | MMMU (val, 0-shot) | 0 | Micro average accuracy | 41.7 | 49.3 |
|
336 |
+
| | ChartQA (test) | 0 | Accuracy | 39.4 | 54.2 |
|
337 |
+
| | InfographicsQA (val, unseen) | 0 | ANLS | 43.2 | 56.8 |
|
338 |
+
| | AI2 Diagram (test) | 0 | Accuracy | 62.4 | 75.3 |
|
339 |
|
340 |
### Instruction Tuned Models
|
341 |
|
342 |
| Modality | Capability | Benchmark | \# Shots | Metric | Llama 3.2 11B | Llama 3.2 90B |
|
343 |
| ----- | :---: | ----- | :---: | :---: | ----- | ----- |
|
344 |
+
| Image | College-level Problems and Mathematical Reasoning | MMMU (val, CoT) | 0 | Micro average accuracy | 50.7 | 60.3 |
|
345 |
+
| | | MMMU-Pro, Standard (10 opts, test) | 0 | Accuracy | 33.0 | 45.2 |
|
346 |
+
| | | MMMU-Pro, Vision (test) | 0 | Accuracy | 23.7 | 33.8 |
|
347 |
+
| | | MathVista (testmini) | 0 | Accuracy | 51.5 | 57.3 |
|
348 |
+
| | Charts and Diagram Understanding | ChartQA (test, CoT) | 0 | Relaxed accuracy | 83.4 | 85.5 |
|
349 |
+
| | | AI2 Diagram (test) | 0 | Accuracy | 91.1 | 92.3 |
|
350 |
| | | DocVQA (test) | 0 | ANLS | 88.4 | 90.1 |
|
351 |
+
| | General Visual Question Answering | VQAv2 (test) | 0 | Accuracy | 75.2 | 78.1 |
|
352 |
| | | | | | | |
|
353 |
+
| Text | General | MMLU (CoT) | 0 | Macro\_avg/acc | 73.0 | 86.0 |
|
354 |
+
| | Math | MATH (CoT) | 0 | Final\_em | 51.9 | 68.0 |
|
355 |
+
| | Reasoning | GPQA | 0 | Accuracy | 32.8 | 46.7 |
|
356 |
+
| | Multilingual | MGSM (CoT) | 0 | em | 68.9 | 86.9 |
|
357 |
|
358 |
## Responsibility & Safety
|
359 |
|
|
|
399 |
|
400 |
**2\. Child Safety:** Child Safety risk assessments were conducted using a team of experts, to assess the model’s capability to produce outputs that could result in Child Safety risks and inform on any necessary and appropriate risk mitigations via fine tuning. We leveraged those expert red teaming sessions to expand the coverage of our evaluation benchmarks through Llama 3 model development. For Llama 3, we conducted new in-depth sessions using objective based methodologies to assess the model risks along multiple attack vectors including the additional languages Llama 3 is trained on. We also partnered with content specialists to perform red teaming exercises assessing potentially violating content while taking account of market specific nuances or experiences.
|
401 |
|
402 |
+
**3\. Cyber Attacks:** For Llama 3.1 405B, our cyber attack uplift study investigated whether LLMs can enhance human capabilities in hacking tasks, both in terms of skill level and speed.
|
403 |
+
Our attack automation study focused on evaluating the capabilities of LLMs when used as autonomous agents in cyber offensive operations, specifically in the context of ransomware attacks. This evaluation was distinct from previous studies that considered LLMs as interactive assistants. The primary objective was to assess whether these models could effectively function as independent agents in executing complex cyber-attacks without human intervention. Because Llama 3.2’s vision capabilities are not generally germane to cyber uplift, we believe that the testing conducted for Llama 3.1 also applies to Llama 3.2.
|
404 |
|
405 |
### Community
|
406 |
|