Aurora-M: The First Open Source Biden-Harris Executive Order Red teamed Multilingual Language Model

Community Article Published April 2, 2024

Authors

Mayank Mishra*, Taishi Nakamura*, Simone Tedeschi*, Yekun Chai, Jason T Stillerman, Tanmay Laud, Felix Friedrich, Prateek Yadav, Minh Chien Vu, Terry Yue Zhuo, Diganta Misra, Dung Nguyen, Nam Pham, Ben Bogin, Xuan-Son Vu, Marzena Karpinska, Arnav Varma Dantuluri, Wojciech Kusa, Tommaso Furlanello, Niklas Muennighoff, Suhas Pai, Tosin Adewumi, Veronika Laippala, Xiaozhe Yao, Adalberto Junior, Alpay Ariyak, Aleksandr Drozd, Jordan Clive, Kshitij Gupta, Liangyu Chen, Peter Szemraj, Qi Sun, Ken Tsui, Noah Persaud, Nour Fahmy, Tianlong Chen, Mohit Bansal, Arnav Dantuluri, Nicolò Monti, Tai Dang, Ziyang Luo, Tien-Tung Bui, Matthew Blumberg, Erik Orth, Ray Tam, Rio Yokota, Robin Graham, TeH_Venom, KoboldHenk, Yu Hou, Yuchen Lu, Victor May*, Huu Nguyen*, Sampo Pyysalo
* (equal contribution)

Introduction

On Jan 24, 2024, Ontocord.AI and the MDEL open source community quietly released a preview version of our model, Aurora-M. Aurora-M is an open-source 15.5B parameter model with multilingual and coding capabilities. In this blog we describe further our efforts to create smarter and more lawful AI for everyone. Aurora-M is an extended pretrained version of the StarCoderPlus model, trained on an additional 435B tokens, bringing the total training to approximately 2T tokens.

Aurora-M is proficient at coding while also having strong multilingual performance, has familiarity with a range of specialized domains and is also safe by design. It was trained on Japanese, English, Vietnamese, Hindi and Finnish language data.

Domain knowledge in the datasets also includes chemical SMILES formulae, financial data, legal contracts, political debates, climate change data, ABC music notations, coding, math and many other domains.

To our knowledge, Aurora-M is the first open source model to be red teamed according to the Biden-Harris executive order requirements and we have also tried to align it to general safety standards.

Our contribution is to provide a methodology and model that is able to retain much of its English and coding abilities while adding SOTA (state-of-the-art) or near SOTA results in multilingual settings. The model is also red teamed for modern AI laws while retaining helpfulness, without, we believe, exaggerated safety.

We trained the model on the LUMI supercomputer, using compute resources generously provided by CSC - IT Center for Science, Finland. We thank them and all the participants of the Aurora-M and MDEL efforts. We would also like to thank the wonderful BigCode team for releasing and developing the StarCoder models in the open.

We release 3 different models:

  1. Aurora-M base: Model pretrained on 377B tokens of multilingual data.
  2. Aurora-M instruct: Model instruction tuned on Slim-Orca dataset on top of the base model.
  3. Aurora-M Biden-Harris red teamed: Model finetuned on 58B tokens of instruction tuning data mixed with Biden Harris red teaming dataset.

Training Dataset

Training data distribution by language: English (EN), Finnish (FI), Hindi (HI), Japanese (JA) and Vietnamese (VI)

We use about 1.5TB of text data from the Stack, Refined Web, Red Pajama 1 and Pile datasets along with specific datasets created as part of the MDEL efforts. These datasets contain text in Japanese, English, Vietnamese, Hindi and Finnish language.

This dataset was cleaned using standard methods similar to what was done with Cultura-X and Cultura-Y. We also used the red-pajama fastText filter that was based on Wikipedia linked articles to filter out articles that look less like Wikipedia linked articles. In addition, we created fastText filters for Japanese, Finnish, Vietnamese and Hindi with linked Wikipedia articles but found that these were less effective because Wikipedia articles in those languages were much more limited. Therefore, we tried to find other reference text as "good" text in those languages, but had varying results, which we will explain in our dataset card. In particular, we could not find a satisfactory "good" source of text for Finnish so only applied standard data cleaning on Finnish.

We also mixed in publicly available instruction tuning datasets including the OIG dataset, OpenAssistant, Pseudo-Code Instructions, Gorilla and others in two stages. In the first stage, we used lower quality but more generic instructions and for the second stage, we used higher quality instructions, chat data such as Ultrachat and a safety instructions dataset called Biden Harris red teaming dataset that we created ourselves.

In both stages we also use pretraining datasets like commoncrawl, wikipedia etc as is common for pretraining models. Here we list the instruction tuning datasets used during the first pretraining stage:

  1. A sample of minipile with added instructions generated using bart-base-open-instructiongen-v1
  2. Open Assisstant
  3. SMILES formulae converted to instructions
  4. XP3, especially cross-lingual instructions
  5. Gorilla
  6. public Hinglish instructions and Hinglish translations created using IndicXlit and this script
  7. subset of Anh dataset for crosslingual code
  8. public abc_music instructions
  9. science dataset converted to instructions
  10. the subset sungai_ul2_instructions from Anh
  11. a subset of the OIG dataset

In the second stage, we use the following instruction tuning datasets (some are repeated from stage 1 of training):

  1. Gorilla
  2. Pseudo-Code Instructions
  3. OIG:
    • unified_grade_school_math_instructions
    • unified_poetry_2_song
    • unified_multi_news
    • unified_multi_sum
    • unified_ul2_plus_oscar_en_sample_dialog
    • unified_unifiedskg_instructions
    • unified_xp3_sample
    • unified_joke_explanations
    • unified_conv_finqa
    • unified_sqlv2
  4. subset of smiles-transformers
  5. Hinglish instructions
  6. Open Assisstant Guacano
  7. ABC music
  8. Code-Evol-Instruct-OSS
  9. python subset of code contests
  10. Ultrachat
  11. HelpSteer
  12. Tulu-v2
  13. MetaMathQA
  14. GSM8K_Backward
  15. BuggedPythonLeetCode
  16. bridge_dict
  17. Lila
  18. Natural instructions
  19. OPUS translations
  20. Biden Harris red teaming dataset: This dataset comprises of several thousand red teamed, human reviewed and edited instructions to address general safety concerns and more specifically the concerns in the Biden-Harris Executive Order on AI. The dataset consists of instruction-response pairs covering specific categories of red teaming concerns. The instructions are obtained both by filtering the human preference dataset about harmlessness from Anthropic as well as by means of semi-automatic template-based methods. The responses, instead, are first drafted by GPT-4 and then rephrased and expanded by the Aurora-M model obtained in the first stage of pretraining. Finally, we manually edit these response to provide refusals with explanations.

Our Reading of the Biden-Harris Executive Order on AI Safety

Below is our reading of red teaming requirements of the Executive Order (2023, October 30, The White House) on the Safe, Secure, and Trustworthy Development and Use of Artificial Intelligence. We focus specifically on Section 3(d) and 3(k):

3 (d)

The term "AI red teaming" means a structured testing effort to find flaws and vulnerabilities in an AI system, often in a controlled environment and in collaboration with developers of AI. Artificial Intelligence red teaming is most often performed by dedicated "red teams" that adopt adversarial methods to identify flaws and vulnerabilities, such as harmful or discriminatory outputs from an AI system, unforeseen or undesirable system behaviors, limitations, or potential risks associated with the misuse of the system.

3 (k)

The term "dual-use foundation model" means an AI model that is trained on broad data; generally uses self-supervision; contains at least tens of billions of parameters; is applicable across a wide range of contexts; and that exhibits or could be easily modified to exhibit high levels of performance at tasks that pose a serious risk to security, national economic security, national public health or safety or any combination of those matters, such as by:

  1. substantially lowering the barrier of entry for non-experts to design, synthesize, acquire, or use chemical, biological, radiological, or nuclear (CBRN) weapons;
  2. enabling powerful offensive cyber operations through automated vulnerability discovery and exploitation against a wide range of potential targets of cyber attacks; or
  3. permitting the evasion of human control or oversight through means of deception or obfuscation.

Models meet this definition even if they are provided to end users with technical safeguards that attempt to prevent users from taking advantage of the relevant unsafe capabilities. So broadly, the Executive Order defines AI red teaming as testing for flaws and vulnerabilities, including:

  • Harmful or discriminatory outputs
  • Unforeseen or undesirable system behaviors. This connects to broader safety concerns outlined in the order.
  • Limitations of the models itself. The aim is to assess the system's robustness and ability to fulfill its designed purpose.
  • Potential risks associated with misuse of the system. This encompasses a wide range of concerns, including cybersecurity threats (as emphasized throughout the Order) and the potential for illegal or harmful acts. ("serious risk to security, national economic security, national public health or safety").

Safety

While LLMs are very powerful, they are prone to generating toxic, harmful and even dangerous content. They can also generate biased outputs and generate false information. Although users must wield LLMs responsibly -- considering the potential consequences of their generated content -- developers bear the responsibility to design LLMs with care, emphasizing ethical guidelines and protecting them against potential attacks that could bypass safety protocols and undermine their guiding principles. Motivated by this and by considering the latest AI regulations, we construct a large dataset of instruction-response pairs to enhance the safety and robustness of our model. Specifically, our effort focused on the following main areas of concern under the Biden-Harris US Executive Order on AI:

  1. Harm to oneself or others
  2. Requests on how to conduct cyber-attacks
  3. Making or proliferating chemical, biological, radiological or nuclear weapons
  4. Participation in any illegal act: theft and robbery, tax evasion, drug trafficking and use, and manipulation of public opinion etc
  5. Attempts to bypass red teaming controls

Our generated dataset serves to mitigate these specific issues outlined in the order.

Training

Aurora-M was trained on the LUMI supercomputer. The training was done using 32 nodes, each node equipped with 4x AMD MI250X GPUs for 74 days with server downtime. It should also be noted that LUMI uses 100% hydro-powered energy. LUMI's waste heat is also used to heat hundreds of households in the city of Kajaani.

Due to unavailability of FlashAttention kernels on AMD GPUs at the time of training, we had to use a PyTorch implementation of attention which restricted us to 2k context length for training and making our training inefficient. We used a custom fork of Megatron-LM which is compatible with both NVIDIA and AMD GPUs.

As mentioned in the previous section, we use a 2 stage curriculum. In the first stage, we took the massive pretraining corpuses of the 5 languages and mix in the low quality instruction tuning datasets mentioned in the previous section. We train the model for 90k steps on this dataset.

For the second stage, we used the higher quality instruction datasets mentioned in the previous section along with the Biden Harris red teaming dataset's train split intermixed with oversampled Wikipedia, subsampled English data, oversampled Python code and also markdown in order to steer the model towards producing well formatted text. In this step we also remove text with large number of symbols and numbers. We train the model for 14k steps in the second stage.

We see a steeper decline in training loss after 90k steps, this could be attributed to much cleaner instruction tuning datasets. We leave this to further investigation.

image/png

Please find the full WandB training report here.

Evaluation

Here we provide the following plot on different language and code evaluations aggregated on a wide variety of tasks. We significantly outperform the StarCoder models on a variety of language tasks while being comparable on coding tasks. We also outperform Finnish GPT, Llama-2 and StarCoder models on Finnish, Vietnamese and Hindi. We avoid mentioning the details on the exact evaluations for brevity in this blog.

Overall performance compared to StarCoderBase and StarCoderPlus on multilingual code synthesis and English (EN), Finnish (FI), Hindi (HI), Japanese (JA) and Vietnamese (VI) benchmarks

Conclusion

We will release a technical report which will describe our thorough evaluations and more details about the model and its limitations. Aurora-M is an open source effort that includes volunteers from academia and industry to promote linguistic fairness, lawfulness and performant AI research. We began this journey in Spring 2023 and our work should NOT be confused with the AuroraGPT project. As part of Ontocord.AI’s commitment to enabling open science and equal access to AI knowledge we support projects like Aurora-M, Ontocord.AI prioritizes lawfulness, data quality and utilizes data filtering, synthetic data and safety instructions for AI development. We are honored to be able to apply some of these techniques in our Aurora-M work. Please reach out to us if you have questions: [email protected].