init

Browse files

Files changed (10) hide show

README.md +279 -0
Readme_en.md +272 -0
config.json +38 -0
generation_config.json +6 -0
model.safetensors.index.json +370 -0
output-00001-of-00002.safetensors +3 -0
output-00002-of-00002.safetensors +3 -0
special_tokens_map.json +34 -0
tokenizer.json +0 -0
tokenizer_config.json +0 -0

README.md ADDED Viewed

	@@ -0,0 +1,279 @@

+---
+license: apache-2.0
+datasets:
+- Vikhrmodels/GrandMaster-PRO-MAX
+- Vikhrmodels/Grounded-RAG-RU-v2
+language:
+- en
+- ru
+base_model:
+- mistralai/Mistral-Nemo-Instruct-2407
+---
+[Reame.md in English](Readme_en.md)
+## Vikhr-Nemo-12B-Instruct-R-21-09-24
+### Описание
+**Vikhr-Nemo** - это наша флагманская унимодальная LLM (Large Language Model) представляющая из себя улучшенную версию [mistralai/Mistral-Nemo-Instruct-2407](https://huggingface.co/mistralai/Mistral-Nemo-Instruct-2407) командой **VikhrModels**, адаптированную преимущественно для русского и английского языков. Для ее обучения мы использовали несколько этапов включающих в себя **SFT** и **SMPO** - нашу собственную вариацию DPO, подробнее читайте в секции *"Как эта модель создавалась"*.
+Модель оптимизированна для различных вариантов использования, включая ризонинг, суммаризацию, код, roleplay, поддержание диалога. Vikhr-Nemo обладает возможностью многоязычной генерации, и высокопроизводительными возможностями RAG. Модель иммет лучшие оценки среди прочих на наших инструктивных и RAG бенчарках и, поэтому, мы верим, что в некоторых задачах (например, RAG) может быть не хуже gpt-4o-mini от OpenAI.
+Весь использованный код для обучения доступен в нашем репозитории [effective_llm_alignment](https://github.com/VikhrModels/effective_llm_alignment/) на GitHub, а основные датасеты доступны в нашем [профиле на HF](https://huggingface.co/Vikhrmodels).
+### Особенности
+1. Высокое качество генераций на русском и английском языках, а также некоторых других языках, благодаря датасету [Grandmaster-PRO-MAX](https://huggingface.co/datasets/Vikhrmodels/GrandMaster-PRO-MAX) и исходной модели
+2. Поддержка системных промптов для регулриования стиля ответов
+3. Поддержка до 128k токенов контекста благодаря исходной модели
+4. Grounded RAG режим - модель имеет специальную роль documents и специальный режим работы для поиска идентификаторов релевантных вопросу пользователя документов и использования их для ответа на вопрос, вдохновлено аналогичной способностью модели Command-R
+### Метрики и оценка качества
+Модель оценивалась на нашем русскоязычном open-source SbS бенчмарке [ru-arena-general](https://github.com/VikhrModels/ru_llm_arena) (50 топиков по 10 вопросов), где судьей выступает gpt-4-1106-preview и [бенчмарке](https://colab.research.google.com/drive/16730rWQ4-yGqWoooLs0Ece_16frmOniP?usp=sharing) для RAG на основе тестового сета [Grounded-RAG-v2](https://huggingface.co/datasets/Vikhrmodels/Grounded-RAG-RU-v2), где судей выступа gpt-4o.
+#### Результаты на Ru-Arena-General
+В качестве референсых ответов, с которыми сравниваются модели выступают ответы от gpt-3.5-turbo-0125, поэтому она имеет винрейт 50%.
+Здесь приведена лишь часть лидерборда, подробнее смотрите в репозитории бенчмарка.
+| Model Name                                       | Winrate  | 95% CI             | Average # Tokens |
+|--------------------------------------------------|--------|--------------------|------------------|
+| gpt-4-1106-preview                               | 90.9   | (-1.3, 1.0)        | 541              |
+| gpt-4o-mini                                      | 83.9   | (-1.8, 1.1)        | 448              |
+| **vikhr-nemo-12b-instruct-r-21-09-24**               | **79.8**   | (-2.2, 1.9)        | **627**              |
+| gemma-2-9b-it-sppo-iter3                         | 73.6   | (-1.6, 2.2)        | 509              |
+| gemma-2-9b-it                                    | 69.2   | (-2.5, 1.9)        | 459              |
+| t-lite-instruct-0.1                              | 64.7   | (-2.1, 1.7)        | 810              |
+| vikhr-llama3.1-8b-instruct-r-21-09-24            | 63.4   | (-2.1, 2.5)        | 618              |
+| suzume-llama-3-8B-multilingual-orpo-borda-half   | 57.1   | (-1.9, 2.2)        | 682              |
+| mistral-nemo-instruct-2407                       | 50.5   | (-2.7, 2.6)        | 403              |
+| gpt-3.5-turbo-0125                               | 50.0   | (0.0, 0.0)         | 220              |
+| c4ai-command-r-v01                               | 49.0   | (-1.7, 2.2)        | 529              |
+| meta-llama-3.1-8b-instruct                       | 43.1   | (-2.8, 2.3)        | 628              |
+#### Результаты на бенчмарке RAG
+Общий размер тестового сета - 200 примеров, 100 для in_domain вопросов и 100 для out_of_domain.
+Тут для оценки качества модель-судья gpt-4o была проинструктирована учитывать релеватность и фактологичкскую полноту ответов исходя из документов и реферсного ответа от gpt-4-1106-preview.
+Подробности промптов и оценок смотрите в коде бенчмарка на [коллабе](https://colab.research.google.com/drive/16730rWQ4-yGqWoooLs0Ece_16frmOniP?usp=sharing)
+in_domain - вопросы которые связаны с содержанием предоставленных документов в той или иной степени \
+out_of_domain - вопросы которые специально никак не связаны с содержанием предоставленных документов
+<table>
+<thead>
+  <tr>
+    <th rowspan="2">question_type</th>
+    <th colspan="3">gpt-4o</th>
+  </tr>
+  <tr>
+    <th>judge_correct_percent</th>
+    <th>avg_answer_match_rougeL</th>
+    <th>avg_abs_indexes_diff</th>
+  </tr>
+</thead>
+<tbody>
+  <tr>
+    <td>in_domain</td>
+    <td>73%</td>
+    <td>0.34</td>
+    <td>NaN</td>
+  </tr>
+  <tr>
+    <td>out_of_domain</td>
+    <td>81%</td>
+    <td>0.20</td>
+    <td>NaN</td>
+  </tr>
+</tbody>
+</table>
+<table>
+<thead>
+  <tr>
+    <th style="visibility: hidden;" rowspan="2">question_type</th>
+    <th colspan="3">Vikhr-Nemo-12B-Instruct-R-21-09-24</th>
+  </tr>
+  <tr>
+    <th style="visibility: hidden;">judge_correct_percent</th>
+    <th style="visibility: hidden;">avg_answer_match_rougeL</th>
+    <th style="visibility: hidden;">avg_abs_indexes_diff</th>
+  </tr>
+</thead>
+<tbody>
+  <tr>
+    <td>in_domain</td>
+    <td>68%</td>
+    <td>0.41</td>
+    <td>0</td>
+  </tr>
+  <tr>
+    <td>out_of_domain</td>
+    <td>92%</td>
+    <td>0.52</td>
+    <td>0</td>
+  </tr>
+</tbody>
+</table>
+<table>
+<thead>
+  <tr>
+    <th style="visibility: hidden;" rowspan="2">question_type</th>
+    <th colspan="3">gpt-4o-mini</th>
+  </tr>
+  <tr>
+    <th style="visibility: hidden;">judge_correct_percent</th>
+    <th style="visibility: hidden;">avg_answer_match_rougeL</th>
+    <th style="visibility: hidden;">avg_abs_indexes_diff</th>
+  </tr>
+</thead>
+<tbody>
+  <tr>
+    <td>in_domain</td>
+    <td>65%</td>
+    <td>0.33</td>
+    <td>NaN</td>
+  </tr>
+  <tr>
+    <td>out_of_domain</td>
+    <td>73%</td>
+    <td>0.18</td>
+    <td>NaN</td>
+  </tr>
+</tbody>
+</table>
+<table>
+<thead>
+  <tr>
+    <th style="visibility: hidden;" rowspan="2">question_type</th>
+    <th colspan="3">gpt-3.5-turbo-0125 </th>
+  </tr>
+  <tr>
+    <th style="visibility: hidden;">judge_correct_percent</th>
+    <th style="visibility: hidden;">avg_answer_match_rougeL</th>
+    <th style="visibility: hidden;">avg_abs_indexes_diff</th>
+  </tr>
+</thead>
+<tbody>
+  <tr>
+    <td>in_domain</td>
+    <td>49%</td>
+    <td>0.28</td>
+    <td>NaN</td>
+  </tr>
+  <tr>
+    <td>out_of_domain</td>
+    <td>76%</td>
+    <td>0.20</td>
+    <td>NaN</td>
+  </tr>
+</tbody>
+</table>
+### Как эта модель создавалась
+#### Инструктивная SFT часть
+Для SFT этапа обучения модели мы подготовили большой (150к инструкций) инструктивный синтетический датасет [Vikhrmodels/GrandMaster-PRO-MAX](https://huggingface.co/datasets/Vikhrmodels/GrandMaster-PRO-MAX). Его особенностью является встроеный CoT (Chain-Of-Thought), для сбора которого мы использовали модифицированный промет для gpt-4-turbo, подробности в карточке датасета.
+Кроме того, для того чтобы сделать RAG Grounding, мы подготовили другой синтетический датасет - [Vikhrmodels/Grounded-RAG-RU-v2](https://huggingface.co/datasets/Vikhrmodels/Grounded-RAG-RU-v2) (50k диалогов), его пайплайн сборки достаточно сложный для короткого описания и полробнее об этом вы можете прочитать в его карточке.
+#### Этап алайнмента с SMPO
+Для дальнейшего улучшения качества ответов мы использовали следущий пайплайн:
+1) Обучили кастомную Reward модель (она пока не будет выкладываться в открытый доступ)
+2) Дедуплицировали и отфилтровали используя RM модель оригинальный датасет Vikhrmodels/GrandMaster-PRO-MAX, получив порядка 10к самых высококачественных и разнообразных диалогов.
+3) Сделали Rejection Sampling с SFT чекпоинтом используя полученный датасет и Reward модель. (Генерировали 7 гипотез и брали только 2 самые худшие как rejected)
+4) Дообучили SFT чекпоинт с помощью нашего метода SMPO используя полученный датасет из этапа 3. SMPO был спроектирован и выбран как метод для повышения стабильности тренировки преференсов в условиях Rejection Sampling и достижения нужного margin.
+Реализацию SMPO, rejection sampling и тд можно найти в нашей библиотеке [effective_llm_alignment](https://github.com/VikhrModels/effective_llm_alignment/) на GitHub
+Идея использования именно SMPO, а не другого PO метода, возникла в результате проведения большого количества экспериментов с классическими методами, при необходимости лучшего контроля процесса сходимости. При тщательной настройке других методов (например SimPO), можно добится похожего результата, однако мы постарались стаблизировать этот процесс и объединить лучшие практики из других методов.
+### Как работать с RAG
+Роль documents представляет из себя список словарей с описанием контента документов, с примнением `json.dumps(array, ensure_ascii=False)` (см. пример ниже). \
+Контент документов может быть представлен в **3** различных форматах: **Markdown**, **HTML**, **Plain Text**. Контент каждого документа - может быть чанком текста длиной до 4к символов.
+```json
+[
+  {
+    "doc_id": (0..5),
+    "title": "(null or str)",
+    "content": "(html or markdown or plain text)"
+  }
+]
+```
+#### Пример правильного использования с OpenAI-like API
+Запуск vLLM сервера: `vllm serve --dtype half --max-model-len 32000 -tp 1 Vikhrmodels/Vikhr-Nemo-12B-Instruct-R-21-09-24 --api-key token-abc123`
+```python
+GROUNDED_SYSTEM_PROMPT = "Your task is to answer the user's questions using only the information from the provided documents. Give two answers to each question: one with a list of relevant document identifiers and the second with the answer to the question itself, using documents with these identifiers."
+documents = [
+  {
+    "doc_id": 0,
+    "title": "Глобальное потепление: ледники",
+    "content": "За последние 50 лет объем ледников в мире уменьшился на 30%"
+  },
+  {
+    "doc_id": 1,
+    "title": "Глобальное потепление: Уровень моря",
+    "content": "Уровень мирового океана повысился на 20 см с 1880 года и продолжает расти на 3,3 мм в год"
+  }
+]
+sample_history = [
+    {'role': 'system', 'content': GROUNDED_SYSTEM_PROMPT},
+    {'role': 'documents', 'content': json.dumps(documents, ensure_ascii=False)},
+    {'role': 'user', 'content': 'Глоабльное потепление'}
+]
+relevant_indexes = llm_client.chat.completions.create(
+    model=llm_model,
+    messages=sample_history,
+    temperature=0.0,
+    max_tokens=2048
+).choices[0].message.content
+print('Using documents: ' + relevant_indexes + '\n----')
+final_answer = llm_client.chat.completions.create(
+    model=llm_model,
+    messages=sample_history + [{'role': 'assistant', 'content': relevant_indexes}],
+    temperature=0.3,
+    max_tokens=2048
+).choices[0].message.content
+print(final_answer)
+```
+Ответ после выполнения кода будет выглядеть примерно так:
+    Using documents: {"relevant_doc_ids": [0, 1]}
+    ----
+    Глобальное потепление – это долгосрочное повышение средней температуры атмосферы и океанов Земли. Это явление имеет множество последствий для нашей планеты, и среди них можно выделить два ключевых аспекта, основанных на доступных мне данных:
+    1. **Уменьшение объема ледников**: За последние 50 лет наблюдается значительное сокращение объема ледников по всему миру. Согласно данным, объем ледников уменьшился на 30%. Это может быть связано с таянием ледников из-за повышения температур, что является одним из признаков глобального потепления.
+    2. **Повышение уровня моря**: Уровень мирового океана также увеличивается, что связано с таянием ледников и ледяных покровов, а также с расширением воды при повышении температуры. С 1880 года уровень моря повысился на 20 сантиметров, и этот процесс продолжается, с ежегодным увеличением на 3,3 миллиметра.
+    Эти изменения имеют серьезные последствия для экосистем, климата и человеческого общества. Таяние ледников приводит к повышению уровня моря, что может привести к затоплению прибрежных территорий и островов, а также к изменению водных ресурсов и климатических паттернов.
+Используя первый ответ модели `relevant_indexes` (JSON), можно понять нашла ли модель информацию в документах или нет, она обучена возврашать пустой массив если ее нет и в таком случае она будет отвечать, что не смогла найти информацию в базе знаний (при генерации второго ответа).
+### Нюансы и ограничения
+- Модель имеет **низкий уровень безопасности ответов** и нацелена на правильное и полное выполенние инструкций, имейте это ввиду при использовании и тестируйте самостоятельно. Частично это исправляется системными промптами и дополнительными указаниями о важности безопасности в промпте пользователя.
+- Системные промпты не предназначены для описание персонажей, мы рекомендуем использовать их для спецификации стиля ответа (вроде "answer only in json format"). Кроме того, желательно, писать их **на английском языке**, так как так было в датасете, от использования английского в системных промтпах не зависит язык ответа.
+- RAG режим **требует обязательного** наличия системного промпта `GROUNDED_SYSTEM_PROMPT` описаного в секции *Как работать с RAG*. Так же иногда модель может добавлять общую информацию из своих знаний в ответ к той, что есть в документах.
+- Модель лучше использовать с низкой темптературой (0.1-0.5), а таже использовать top_k (30-50), при температуре 1.0 были замечены случайные дефекты генерации.
+### Авторы
+- Sergei Bratchikov, [NLP Wanderer](https://t.me/nlpwanderer), Vikhr Team
+- Konstantin Korolev, Vikhr Team
+- Aleksandr Nikolich, Vikhr Team

Readme_en.md ADDED Viewed

	@@ -0,0 +1,272 @@

+## Vikhr-Nemo-12B-Instruct-R-21-09-24
+### Description
+**Vikhr-Nemo** is our flagship unimodal LLM (Large Language Model), representing an improved version of [mistralai/Mistral-Nemo-Instruct-2407](https://huggingface.co/mistralai/Mistral-Nemo-Instruct-2407) developed by the **VikhrModels** team, primarily adapted for Russian and English languages. The training involved several stages, including **SFT** and **SMPO** – our custom variant of DPO, details of which are available in the *"How This Model Was Created"* section.
+The model is optimized for a wide range of use cases, including reasoning, summarization, coding, role-playing, and dialogue maintenance. Vikhr-Nemo has capabilities for multilingual generation and high-performance RAG capabilities. It achieves top scores on our instruction and RAG benchmarks, and we believe that for certain tasks (e.g., RAG), it can rival OpenAI's gpt-4o-mini.
+All training code is available in our [effective_llm_alignment](https://github.com/VikhrModels/effective_llm_alignment/) repository on GitHub, and the main datasets are available on our [HF profile](https://huggingface.co/Vikhrmodels).
+### Features
+1. High-quality generation in Russian, English, and several other languages, thanks to the [Grandmaster-PRO-MAX](https://huggingface.co/datasets/Vikhrmodels/GrandMaster-PRO-MAX) dataset and the base model.
+2. Support for system prompts to regulate response styles.
+3. Up to 128k token context support thanks to the base model.
+4. Grounded RAG mode – the model features a special 'documents' role and a mode for identifying relevant document IDs for user queries and using them for responses, inspired by Command-R’s similar capabilities.
+### Metrics and Quality Evaluation
+The model was evaluated on our open-source Russian-language SbS benchmark [ru-arena-general](https://github.com/VikhrModels/ru_llm_arena) (50 topics with 10 questions each), where gpt-4-1106-preview acted as the judge, and the [RAG benchmark](https://colab.research.google.com/drive/16730rWQ4-yGqWoooLs0Ece_16frmOniP?usp=sharing) based on the [Grounded-RAG-v2](https://huggingface.co/datasets/Vikhrmodels/Grounded-RAG-RU-v2) test set, judged by gpt-4o.
+#### Results on Ru-Arena-General
+The reference answers, to which models are compared, were generated by gpt-3.5-turbo-0125, hence it has a win rate of 50%.
+Only part of the leaderboard is shown here; for more details, check the benchmark repository.
+| Model Name                                       | Winrate  | 95% CI             | Average # Tokens |
+|--------------------------------------------------|--------|--------------------|------------------|
+| gpt-4-1106-preview                               | 90.9   | (-1.3, 1.0)        | 541              |
+| gpt-4o-mini                                      | 83.9   | (-1.8, 1.1)        | 448              |
+| **vikhr-nemo-12b-instruct-r-21-09-24**           | **79.8**   | (-2.2, 1.9)        | **627**          |
+| gemma-2-9b-it-sppo-iter3                         | 73.6   | (-1.6, 2.2)        | 509              |
+| gemma-2-9b-it                                    | 69.2   | (-2.5, 1.9)        | 459              |
+| t-lite-instruct-0.1                              | 64.7   | (-2.1, 1.7)        | 810              |
+| vikhr-llama3.1-8b-instruct-r-21-09-24            | 63.4   | (-2.1, 2.5)        | 618              |
+| suzume-llama-3-8B-multilingual-orpo-borda-half   | 57.1   | (-1.9, 2.2)        | 682              |
+| mistral-nemo-instruct-2407                       | 50.5   | (-2.7, 2.6)        | 403              |
+| gpt-3.5-turbo-0125                               | 50.0   | (0.0, 0.0)         | 220              |
+| c4ai-command-r-v01                               | 49.0   | (-1.7, 2.2)        | 529              |
+| meta-llama-3.1-8b-instruct                       | 43.1   | (-2.8, 2.3)        | 628              |
+#### RAG Benchmark Results
+The test set comprises 200 examples: 100 in-domain questions and 100 out-of-domain questions.
+For evaluation, the judge model gpt-4o was instructed to consider relevance and factual completeness based on documents and the reference answer from gpt-4-1106-preview.
+For prompt details and evaluations, refer to the [Colab notebook](https://colab.research.google.com/drive/16730rWQ4-yGqWoooLs0Ece_16frmOniP?usp=sharing).
+**In-Domain**: Questions related to the provided documents.
+**Out-of-Domain**: Questions deliberately unrelated to the provided documents.
+[Table representations of the results follow in the original text.]
+<table>
+<thead>
+  <tr>
+    <th rowspan="2">question_type</th>
+    <th colspan="3">gpt-4o</th>
+  </tr>
+  <tr>
+    <th>judge_correct_percent</th>
+    <th>avg_answer_match_rougeL</th>
+    <th>avg_abs_indexes_diff</th>
+  </tr>
+</thead>
+<tbody>
+  <tr>
+    <td>in_domain</td>
+    <td>73%</td>
+    <td>0.34</td>
+    <td>NaN</td>
+  </tr>
+  <tr>
+    <td>out_of_domain</td>
+    <td>81%</td>
+    <td>0.20</td>
+    <td>NaN</td>
+  </tr>
+</tbody>
+</table>
+<table>
+<thead>
+  <tr>
+    <th style="visibility: hidden;" rowspan="2">question_type</th>
+    <th colspan="3">Vikhr-Nemo-12B-Instruct-R-21-09-24</th>
+  </tr>
+  <tr>
+    <th style="visibility: hidden;">judge_correct_percent</th>
+    <th style="visibility: hidden;">avg_answer_match_rougeL</th>
+    <th style="visibility: hidden;">avg_abs_indexes_diff</th>
+  </tr>
+</thead>
+<tbody>
+  <tr>
+    <td>in_domain</td>
+    <td>68%</td>
+    <td>0.41</td>
+    <td>0</td>
+  </tr>
+  <tr>
+    <td>out_of_domain</td>
+    <td>92%</td>
+    <td>0.52</td>
+    <td>0</td>
+  </tr>
+</tbody>
+</table>
+<table>
+<thead>
+  <tr>
+    <th style="visibility: hidden;" rowspan="2">question_type</th>
+    <th colspan="3">gpt-4o-mini</th>
+  </tr>
+  <tr>
+    <th style="visibility: hidden;">judge_correct_percent</th>
+    <th style="visibility: hidden;">avg_answer_match_rougeL</th>
+    <th style="visibility: hidden;">avg_abs_indexes_diff</th>
+  </tr>
+</thead>
+<tbody>
+  <tr>
+    <td>in_domain</td>
+    <td>65%</td>
+    <td>0.33</td>
+    <td>NaN</td>
+  </tr>
+  <tr>
+    <td>out_of_domain</td>
+    <td>73%</td>
+    <td>0.18</td>
+    <td>NaN</td>
+  </tr>
+</tbody>
+</table>
+<table>
+<thead>
+  <tr>
+    <th style="visibility: hidden;" rowspan="2">question_type</th>
+    <th colspan="3">gpt-3.5-turbo-0125 </th>
+  </tr>
+  <tr>
+    <th style="visibility: hidden;">judge_correct_percent</th>
+    <th style="visibility: hidden;">avg_answer_match_rougeL</th>
+    <th style="visibility: hidden;">avg_abs_indexes_diff</th>
+  </tr>
+</thead>
+<tbody>
+  <tr>
+    <td>in_domain</td>
+    <td>49%</td>
+    <td>0.28</td>
+    <td>NaN</td>
+  </tr>
+  <tr>
+    <td>out_of_domain</td>
+    <td>76%</td>
+    <td>0.20</td>
+    <td>NaN</td>
+  </tr>
+</tbody>
+</table>
+### How This Model Was Created
+#### Instructional SFT Part
+For the SFT training stage, we prepared a large (150k instructions) synthetic dataset [Vikhrmodels/GrandMaster-PRO-MAX](https://huggingface.co/datasets/Vikhrmodels)
+#### Instructional SFT Part
+For the SFT stage of model training, we have prepared a large (150k instructions) synthetic instructional dataset [Vikhrmodels/GrandMaster-PRO-MAX](https://huggingface.co/datasets/Vikhrmodels/GrandMaster-PRO-MAX). Its unique feature is the built-in Chain-Of-Thought (CoT), which we collected using a modified prompt for gpt-4-turbo. For more details, please refer to the dataset card.
+Additionally, to perform RAG Grounding, we have prepared another synthetic dataset - [Vikhrmodels/Grounded-RAG-RU-v2](https://huggingface.co/datasets/Vikhrmodels/Grounded-RAG-RU-v2) (50k dialogues). The pipeline for its construction is quite complex, so you can find more information in the dataset card.
+#### SMPO Alignment Stage
+To further improve the quality of responses, we used the following pipeline:
+1) Trained a custom Reward model (it will not be released publicly for now).
+2) Deduplicated and filtered the original Vikhrmodels/GrandMaster-PRO-MAX dataset using the Reward model, resulting in around 10k of the highest-quality and most diverse dialogues.
+3) Performed Rejection Sampling with the SFT checkpoint using the resulting dataset and Reward model. We generated 7 hypotheses and selected the 2 worst ones as rejected.
+4) Fine-tuned the SFT checkpoint using our SMPO method with the dataset obtained from step 3. SMPO was designed and chosen as the method to enhance the stability of preference training under Rejection Sampling conditions and to achieve the desired margin.
+The implementation of SMPO, rejection sampling, etc., can be found in our [effective_llm_alignment](https://github.com/VikhrModels/effective_llm_alignment/) library on GitHub.
+The idea of using SMPO over other PO methods arose from numerous experiments with classical methods and the need for better convergence control. While other methods (e.g., SimPO) can achieve similar results with careful tuning, we aimed to stabilize the process and combine the best practices from other methods.
+### How to Work with RAG
+The role of "documents" represents a list of dictionaries describing document content, using `json.dumps(array, ensure_ascii=False)` (see example below). \
+The document content can be presented in **3** different formats: **Markdown**, **HTML**, **Plain Text**. The content of each document can be a chunk of text up to 4k characters long.
+```json
+[
+  {
+    "doc_id": (0..5),
+    "title": "(null or str)",
+    "content": "(html or markdown or plain text)"
+  }
+]
+```
+#### Example of Correct Usage with an OpenAI-like API
+Launching the vLLM server: `vllm serve --dtype half --max-model-len 32000 -tp 1 Vikhrmodels/Vikhr-Nemo-12B-Instruct-R-21-09-24 --api-key token-abc123`
+```python
+GROUNDED_SYSTEM_PROMPT = "Your task is to answer the user's questions using only the information from the provided documents. Give two answers to each question: one with a list of relevant document identifiers and the second with the answer to the question itself, using documents with these identifiers."
+documents = [
+  {
+    "doc_id": 0,
+    "title": "Global Warming: Glaciers",
+    "content": "Over the past 50 years, the volume of glaciers worldwide has decreased by 30%."
+  },
+  {
+    "doc_id": 1,
+    "title": "Global Warming: Sea Level",
+    "content": "The global sea level has risen by 20 cm since 1880 and continues to rise by 3.3 mm per year."
+  }
+]
+sample_history = [
+    {'role': 'system', 'content': GROUNDED_SYSTEM_PROMPT},
+    {'role': 'documents', 'content': json.dumps(documents, ensure_ascii=False)},
+    {'role': 'user', 'content': 'Global warming'}
+]
+relevant_indexes = llm_client.chat.completions.create(
+    model=llm_model,
+    messages=sample_history,
+    temperature=0.0,
+    max_tokens=2048
+).choices[0].message.content
+print('Using documents: ' + relevant_indexes + '\n----')
+final_answer = llm_client.chat.completions.create(
+    model=llm_model,
+    messages=sample_history + [{'role': 'assistant', 'content': relevant_indexes}],
+    temperature=0.3,
+    max_tokens=2048
+).choices[0].message.content
+print(final_answer)
+```
+The response after running the code might look like this:
+    Using documents: {"relevant_doc_ids": [0, 1]}
+    ----
+    Global warming is a long-term increase in the Earth's atmosphere and ocean temperatures. This phenomenon has numerous consequences for our planet, and two key aspects based on the data available to me are:
+    1. **Reduction of glacier volume**: Over the past 50 years, there has been a significant reduction in glacier volume worldwide. According to the data, glacier volume has decreased by 30%. This could be related to glacier melting due to rising temperatures, which is one of the signs of global warming.
+    2. **Sea level rise**: The global sea level is also rising, which is associated with the melting of glaciers and ice sheets, as well as the expansion of water as it warms. Since 1880, the sea level has risen by 20 centimeters, and this process continues, with an annual increase of 3.3 millimeters.
+    These changes have serious consequences for ecosystems, climate, and human society. Glacier melting leads to sea-level rise, which can result in the flooding of coastal areas and islands, as well as changes in water resources and climate patterns.
+Using the model's first response, `relevant_indexes` (JSON), one can determine whether the model found information in the documents. The model is trained to return an empty array if no information is found, in which case it will state that it couldn’t find the information in the knowledge base (when generating the second response).
+### Nuances and Limitations
+- The model has a **low level of response safety** and is focused on correctly and fully executing instructions. Keep this in mind during usage and test it independently. This can be partially corrected with system prompts and additional user prompt guidance about the importance of safety.
+- System prompts are not intended for character descriptions; we recommend using them to specify the response style (e.g., "answer only in JSON format"). Additionally, it’s preferable to write them **in English**, as this was the case in the dataset; using English in system prompts does not affect the response language.
+- The RAG mode **requires the presence** of the system prompt `GROUNDED_SYSTEM_PROMPT` described in the "How to Work with RAG" section. The model may sometimes add general knowledge information to the response along with the information present in the documents.
+- The model works best with low temperatures (0.1-0.5) and top_k (30-50). At a temperature of 1.0, random generation defects were observed.
+### Authors
+- Sergei Bratchikov, [NLP Wanderer](https://t.me/nlpwanderer), Vikhr Team
+- Konstantin Korolev, Vikhr Team
+- Aleksandr Nikolich, Vikhr Team

config.json ADDED Viewed

	@@ -0,0 +1,38 @@

+{
+    "_name_or_path": "Vikhrmodels/Vikhr-Nemo-12B-Instruct-R-05-09-24",
+    "architectures": [
+        "MistralForCausalLM"
+    ],
+    "attention_dropout": 0.0,
+    "bos_token_id": 1,
+    "eos_token_id": 2,
+    "head_dim": 128,
+    "hidden_act": "silu",
+    "hidden_size": 5120,
+    "initializer_range": 0.02,
+    "intermediate_size": 14336,
+    "max_position_embeddings": 1024000,
+    "model_type": "mistral",
+    "num_attention_heads": 32,
+    "num_hidden_layers": 40,
+    "num_key_value_heads": 8,
+    "rms_norm_eps": 1e-05,
+    "rope_theta": 1000000.0,
+    "sliding_window": null,
+    "tie_word_embeddings": false,
+    "torch_dtype": "bfloat16",
+    "transformers_version": "4.44.2",
+    "use_cache": true,
+    "vocab_size": 131074,
+    "quantization_config": {
+        "quant_method": "exl2",
+        "version": "0.2.2",
+        "bits": 8.0,
+        "head_bits": 8,
+        "calibration": {
+            "rows": 115,
+            "length": 2048,
+            "dataset": "(default)"
+        }
+    }
+}

generation_config.json ADDED Viewed

	@@ -0,0 +1,6 @@

+{
+  "_from_model_config": true,
+  "bos_token_id": 1,
+  "eos_token_id": 2,
+  "transformers_version": "4.44.2"
+}

model.safetensors.index.json ADDED Viewed

	@@ -0,0 +1,370 @@

+{
+  "metadata": {
+    "total_size": 24495605760
+  },
+  "weight_map": {
+    "lm_head.weight": "model-00005-of-00005.safetensors",
+    "model.embed_tokens.weight": "model-00001-of-00005.safetensors",
+    "model.layers.0.input_layernorm.weight": "model-00001-of-00005.safetensors",
+    "model.layers.0.mlp.down_proj.weight": "model-00001-of-00005.safetensors",
+    "model.layers.0.mlp.gate_proj.weight": "model-00001-of-00005.safetensors",
+    "model.layers.0.mlp.up_proj.weight": "model-00001-of-00005.safetensors",
+    "model.layers.0.post_attention_layernorm.weight": "model-00001-of-00005.safetensors",
+    "model.layers.0.self_attn.k_proj.weight": "model-00001-of-00005.safetensors",
+    "model.layers.0.self_attn.o_proj.weight": "model-00001-of-00005.safetensors",
+    "model.layers.0.self_attn.q_proj.weight": "model-00001-of-00005.safetensors",
+    "model.layers.0.self_attn.v_proj.weight": "model-00001-of-00005.safetensors",
+    "model.layers.1.input_layernorm.weight": "model-00001-of-00005.safetensors",
+    "model.layers.1.mlp.down_proj.weight": "model-00001-of-00005.safetensors",
+    "model.layers.1.mlp.gate_proj.weight": "model-00001-of-00005.safetensors",
+    "model.layers.1.mlp.up_proj.weight": "model-00001-of-00005.safetensors",
+    "model.layers.1.post_attention_layernorm.weight": "model-00001-of-00005.safetensors",
+    "model.layers.1.self_attn.k_proj.weight": "model-00001-of-00005.safetensors",
+    "model.layers.1.self_attn.o_proj.weight": "model-00001-of-00005.safetensors",
+    "model.layers.1.self_attn.q_proj.weight": "model-00001-of-00005.safetensors",
+    "model.layers.1.self_attn.v_proj.weight": "model-00001-of-00005.safetensors",
+    "model.layers.10.input_layernorm.weight": "model-00002-of-00005.safetensors",
+    "model.layers.10.mlp.down_proj.weight": "model-00002-of-00005.safetensors",
+    "model.layers.10.mlp.gate_proj.weight": "model-00002-of-00005.safetensors",
+    "model.layers.10.mlp.up_proj.weight": "model-00002-of-00005.safetensors",
+    "model.layers.10.post_attention_layernorm.weight": "model-00002-of-00005.safetensors",
+    "model.layers.10.self_attn.k_proj.weight": "model-00002-of-00005.safetensors",
+    "model.layers.10.self_attn.o_proj.weight": "model-00002-of-00005.safetensors",
+    "model.layers.10.self_attn.q_proj.weight": "model-00002-of-00005.safetensors",
+    "model.layers.10.self_attn.v_proj.weight": "model-00002-of-00005.safetensors",
+    "model.layers.11.input_layernorm.weight": "model-00002-of-00005.safetensors",
+    "model.layers.11.mlp.down_proj.weight": "model-00002-of-00005.safetensors",
+    "model.layers.11.mlp.gate_proj.weight": "model-00002-of-00005.safetensors",
+    "model.layers.11.mlp.up_proj.weight": "model-00002-of-00005.safetensors",
+    "model.layers.11.post_attention_layernorm.weight": "model-00002-of-00005.safetensors",
+    "model.layers.11.self_attn.k_proj.weight": "model-00002-of-00005.safetensors",
+    "model.layers.11.self_attn.o_proj.weight": "model-00002-of-00005.safetensors",
+    "model.layers.11.self_attn.q_proj.weight": "model-00002-of-00005.safetensors",
+    "model.layers.11.self_attn.v_proj.weight": "model-00002-of-00005.safetensors",
+    "model.layers.12.input_layernorm.weight": "model-00002-of-00005.safetensors",
+    "model.layers.12.mlp.down_proj.weight": "model-00002-of-00005.safetensors",
+    "model.layers.12.mlp.gate_proj.weight": "model-00002-of-00005.safetensors",
+    "model.layers.12.mlp.up_proj.weight": "model-00002-of-00005.safetensors",
+    "model.layers.12.post_attention_layernorm.weight": "model-00002-of-00005.safetensors",
+    "model.layers.12.self_attn.k_proj.weight": "model-00002-of-00005.safetensors",
+    "model.layers.12.self_attn.o_proj.weight": "model-00002-of-00005.safetensors",
+    "model.layers.12.self_attn.q_proj.weight": "model-00002-of-00005.safetensors",
+    "model.layers.12.self_attn.v_proj.weight": "model-00002-of-00005.safetensors",
+    "model.layers.13.input_layernorm.weight": "model-00002-of-00005.safetensors",
+    "model.layers.13.mlp.down_proj.weight": "model-00002-of-00005.safetensors",
+    "model.layers.13.mlp.gate_proj.weight": "model-00002-of-00005.safetensors",
+    "model.layers.13.mlp.up_proj.weight": "model-00002-of-00005.safetensors",
+    "model.layers.13.post_attention_layernorm.weight": "model-00002-of-00005.safetensors",
+    "model.layers.13.self_attn.k_proj.weight": "model-00002-of-00005.safetensors",
+    "model.layers.13.self_attn.o_proj.weight": "model-00002-of-00005.safetensors",
+    "model.layers.13.self_attn.q_proj.weight": "model-00002-of-00005.safetensors",
+    "model.layers.13.self_attn.v_proj.weight": "model-00002-of-00005.safetensors",
+    "model.layers.14.input_layernorm.weight": "model-00002-of-00005.safetensors",
+    "model.layers.14.mlp.down_proj.weight": "model-00002-of-00005.safetensors",
+    "model.layers.14.mlp.gate_proj.weight": "model-00002-of-00005.safetensors",
+    "model.layers.14.mlp.up_proj.weight": "model-00002-of-00005.safetensors",
+    "model.layers.14.post_attention_layernorm.weight": "model-00002-of-00005.safetensors",
+    "model.layers.14.self_attn.k_proj.weight": "model-00002-of-00005.safetensors",
+    "model.layers.14.self_attn.o_proj.weight": "model-00002-of-00005.safetensors",
+    "model.layers.14.self_attn.q_proj.weight": "model-00002-of-00005.safetensors",
+    "model.layers.14.self_attn.v_proj.weight": "model-00002-of-00005.safetensors",
+    "model.layers.15.input_layernorm.weight": "model-00003-of-00005.safetensors",
+    "model.layers.15.mlp.down_proj.weight": "model-00003-of-00005.safetensors",
+    "model.layers.15.mlp.gate_proj.weight": "model-00002-of-00005.safetensors",
+    "model.layers.15.mlp.up_proj.weight": "model-00003-of-00005.safetensors",
+    "model.layers.15.post_attention_layernorm.weight": "model-00003-of-00005.safetensors",
+    "model.layers.15.self_attn.k_proj.weight": "model-00002-of-00005.safetensors",
+    "model.layers.15.self_attn.o_proj.weight": "model-00002-of-00005.safetensors",
+    "model.layers.15.self_attn.q_proj.weight": "model-00002-of-00005.safetensors",
+    "model.layers.15.self_attn.v_proj.weight": "model-00002-of-00005.safetensors",
+    "model.layers.16.input_layernorm.weight": "model-00003-of-00005.safetensors",
+    "model.layers.16.mlp.down_proj.weight": "model-00003-of-00005.safetensors",
+    "model.layers.16.mlp.gate_proj.weight": "model-00003-of-00005.safetensors",
+    "model.layers.16.mlp.up_proj.weight": "model-00003-of-00005.safetensors",
+    "model.layers.16.post_attention_layernorm.weight": "model-00003-of-00005.safetensors",
+    "model.layers.16.self_attn.k_proj.weight": "model-00003-of-00005.safetensors",
+    "model.layers.16.self_attn.o_proj.weight": "model-00003-of-00005.safetensors",
+    "model.layers.16.self_attn.q_proj.weight": "model-00003-of-00005.safetensors",
+    "model.layers.16.self_attn.v_proj.weight": "model-00003-of-00005.safetensors",
+    "model.layers.17.input_layernorm.weight": "model-00003-of-00005.safetensors",
+    "model.layers.17.mlp.down_proj.weight": "model-00003-of-00005.safetensors",
+    "model.layers.17.mlp.gate_proj.weight": "model-00003-of-00005.safetensors",
+    "model.layers.17.mlp.up_proj.weight": "model-00003-of-00005.safetensors",
+    "model.layers.17.post_attention_layernorm.weight": "model-00003-of-00005.safetensors",
+    "model.layers.17.self_attn.k_proj.weight": "model-00003-of-00005.safetensors",
+    "model.layers.17.self_attn.o_proj.weight": "model-00003-of-00005.safetensors",
+    "model.layers.17.self_attn.q_proj.weight": "model-00003-of-00005.safetensors",
+    "model.layers.17.self_attn.v_proj.weight": "model-00003-of-00005.safetensors",
+    "model.layers.18.input_layernorm.weight": "model-00003-of-00005.safetensors",
+    "model.layers.18.mlp.down_proj.weight": "model-00003-of-00005.safetensors",
+    "model.layers.18.mlp.gate_proj.weight": "model-00003-of-00005.safetensors",
+    "model.layers.18.mlp.up_proj.weight": "model-00003-of-00005.safetensors",
+    "model.layers.18.post_attention_layernorm.weight": "model-00003-of-00005.safetensors",
+    "model.layers.18.self_attn.k_proj.weight": "model-00003-of-00005.safetensors",
+    "model.layers.18.self_attn.o_proj.weight": "model-00003-of-00005.safetensors",
+    "model.layers.18.self_attn.q_proj.weight": "model-00003-of-00005.safetensors",
+    "model.layers.18.self_attn.v_proj.weight": "model-00003-of-00005.safetensors",
+    "model.layers.19.input_layernorm.weight": "model-00003-of-00005.safetensors",
+    "model.layers.19.mlp.down_proj.weight": "model-00003-of-00005.safetensors",
+    "model.layers.19.mlp.gate_proj.weight": "model-00003-of-00005.safetensors",
+    "model.layers.19.mlp.up_proj.weight": "model-00003-of-00005.safetensors",
+    "model.layers.19.post_attention_layernorm.weight": "model-00003-of-00005.safetensors",
+    "model.layers.19.self_attn.k_proj.weight": "model-00003-of-00005.safetensors",
+    "model.layers.19.self_attn.o_proj.weight": "model-00003-of-00005.safetensors",
+    "model.layers.19.self_attn.q_proj.weight": "model-00003-of-00005.safetensors",
+    "model.layers.19.self_attn.v_proj.weight": "model-00003-of-00005.safetensors",
+    "model.layers.2.input_layernorm.weight": "model-00001-of-00005.safetensors",
+    "model.layers.2.mlp.down_proj.weight": "model-00001-of-00005.safetensors",
+    "model.layers.2.mlp.gate_proj.weight": "model-00001-of-00005.safetensors",
+    "model.layers.2.mlp.up_proj.weight": "model-00001-of-00005.safetensors",
+    "model.layers.2.post_attention_layernorm.weight": "model-00001-of-00005.safetensors",
+    "model.layers.2.self_attn.k_proj.weight": "model-00001-of-00005.safetensors",
+    "model.layers.2.self_attn.o_proj.weight": "model-00001-of-00005.safetensors",
+    "model.layers.2.self_attn.q_proj.weight": "model-00001-of-00005.safetensors",
+    "model.layers.2.self_attn.v_proj.weight": "model-00001-of-00005.safetensors",
+    "model.layers.20.input_layernorm.weight": "model-00003-of-00005.safetensors",
+    "model.layers.20.mlp.down_proj.weight": "model-00003-of-00005.safetensors",
+    "model.layers.20.mlp.gate_proj.weight": "model-00003-of-00005.safetensors",
+    "model.layers.20.mlp.up_proj.weight": "model-00003-of-00005.safetensors",
+    "model.layers.20.post_attention_layernorm.weight": "model-00003-of-00005.safetensors",
+    "model.layers.20.self_attn.k_proj.weight": "model-00003-of-00005.safetensors",
+    "model.layers.20.self_attn.o_proj.weight": "model-00003-of-00005.safetensors",
+    "model.layers.20.self_attn.q_proj.weight": "model-00003-of-00005.safetensors",
+    "model.layers.20.self_attn.v_proj.weight": "model-00003-of-00005.safetensors",
+    "model.layers.21.input_layernorm.weight": "model-00003-of-00005.safetensors",
+    "model.layers.21.mlp.down_proj.weight": "model-00003-of-00005.safetensors",
+    "model.layers.21.mlp.gate_proj.weight": "model-00003-of-00005.safetensors",
+    "model.layers.21.mlp.up_proj.weight": "model-00003-of-00005.safetensors",
+    "model.layers.21.post_attention_layernorm.weight": "model-00003-of-00005.safetensors",
+    "model.layers.21.self_attn.k_proj.weight": "model-00003-of-00005.safetensors",
+    "model.layers.21.self_attn.o_proj.weight": "model-00003-of-00005.safetensors",
+    "model.layers.21.self_attn.q_proj.weight": "model-00003-of-00005.safetensors",
+    "model.layers.21.self_attn.v_proj.weight": "model-00003-of-00005.safetensors",
+    "model.layers.22.input_layernorm.weight": "model-00003-of-00005.safetensors",
+    "model.layers.22.mlp.down_proj.weight": "model-00003-of-00005.safetensors",
+    "model.layers.22.mlp.gate_proj.weight": "model-00003-of-00005.safetensors",
+    "model.layers.22.mlp.up_proj.weight": "model-00003-of-00005.safetensors",
+    "model.layers.22.post_attention_layernorm.weight": "model-00003-of-00005.safetensors",
+    "model.layers.22.self_attn.k_proj.weight": "model-00003-of-00005.safetensors",
+    "model.layers.22.self_attn.o_proj.weight": "model-00003-of-00005.safetensors",
+    "model.layers.22.self_attn.q_proj.weight": "model-00003-of-00005.safetensors",
+    "model.layers.22.self_attn.v_proj.weight": "model-00003-of-00005.safetensors",
+    "model.layers.23.input_layernorm.weight": "model-00003-of-00005.safetensors",
+    "model.layers.23.mlp.down_proj.weight": "model-00003-of-00005.safetensors",
+    "model.layers.23.mlp.gate_proj.weight": "model-00003-of-00005.safetensors",
+    "model.layers.23.mlp.up_proj.weight": "model-00003-of-00005.safetensors",
+    "model.layers.23.post_attention_layernorm.weight": "model-00003-of-00005.safetensors",
+    "model.layers.23.self_attn.k_proj.weight": "model-00003-of-00005.safetensors",
+    "model.layers.23.self_attn.o_proj.weight": "model-00003-of-00005.safetensors",
+    "model.layers.23.self_attn.q_proj.weight": "model-00003-of-00005.safetensors",
+    "model.layers.23.self_attn.v_proj.weight": "model-00003-of-00005.safetensors",
+    "model.layers.24.input_layernorm.weight": "model-00004-of-00005.safetensors",
+    "model.layers.24.mlp.down_proj.weight": "model-00004-of-00005.safetensors",
+    "model.layers.24.mlp.gate_proj.weight": "model-00003-of-00005.safetensors",
+    "model.layers.24.mlp.up_proj.weight": "model-00004-of-00005.safetensors",
+    "model.layers.24.post_attention_layernorm.weight": "model-00004-of-00005.safetensors",
+    "model.layers.24.self_attn.k_proj.weight": "model-00003-of-00005.safetensors",
+    "model.layers.24.self_attn.o_proj.weight": "model-00003-of-00005.safetensors",
+    "model.layers.24.self_attn.q_proj.weight": "model-00003-of-00005.safetensors",
+    "model.layers.24.self_attn.v_proj.weight": "model-00003-of-00005.safetensors",
+    "model.layers.25.input_layernorm.weight": "model-00004-of-00005.safetensors",
+    "model.layers.25.mlp.down_proj.weight": "model-00004-of-00005.safetensors",
+    "model.layers.25.mlp.gate_proj.weight": "model-00004-of-00005.safetensors",
+    "model.layers.25.mlp.up_proj.weight": "model-00004-of-00005.safetensors",
+    "model.layers.25.post_attention_layernorm.weight": "model-00004-of-00005.safetensors",
+    "model.layers.25.self_attn.k_proj.weight": "model-00004-of-00005.safetensors",
+    "model.layers.25.self_attn.o_proj.weight": "model-00004-of-00005.safetensors",
+    "model.layers.25.self_attn.q_proj.weight": "model-00004-of-00005.safetensors",
+    "model.layers.25.self_attn.v_proj.weight": "model-00004-of-00005.safetensors",
+    "model.layers.26.input_layernorm.weight": "model-00004-of-00005.safetensors",
+    "model.layers.26.mlp.down_proj.weight": "model-00004-of-00005.safetensors",
+    "model.layers.26.mlp.gate_proj.weight": "model-00004-of-00005.safetensors",
+    "model.layers.26.mlp.up_proj.weight": "model-00004-of-00005.safetensors",
+    "model.layers.26.post_attention_layernorm.weight": "model-00004-of-00005.safetensors",
+    "model.layers.26.self_attn.k_proj.weight": "model-00004-of-00005.safetensors",
+    "model.layers.26.self_attn.o_proj.weight": "model-00004-of-00005.safetensors",
+    "model.layers.26.self_attn.q_proj.weight": "model-00004-of-00005.safetensors",
+    "model.layers.26.self_attn.v_proj.weight": "model-00004-of-00005.safetensors",
+    "model.layers.27.input_layernorm.weight": "model-00004-of-00005.safetensors",
+    "model.layers.27.mlp.down_proj.weight": "model-00004-of-00005.safetensors",
+    "model.layers.27.mlp.gate_proj.weight": "model-00004-of-00005.safetensors",
+    "model.layers.27.mlp.up_proj.weight": "model-00004-of-00005.safetensors",
+    "model.layers.27.post_attention_layernorm.weight": "model-00004-of-00005.safetensors",
+    "model.layers.27.self_attn.k_proj.weight": "model-00004-of-00005.safetensors",
+    "model.layers.27.self_attn.o_proj.weight": "model-00004-of-00005.safetensors",
+    "model.layers.27.self_attn.q_proj.weight": "model-00004-of-00005.safetensors",
+    "model.layers.27.self_attn.v_proj.weight": "model-00004-of-00005.safetensors",
+    "model.layers.28.input_layernorm.weight": "model-00004-of-00005.safetensors",
+    "model.layers.28.mlp.down_proj.weight": "model-00004-of-00005.safetensors",
+    "model.layers.28.mlp.gate_proj.weight": "model-00004-of-00005.safetensors",
+    "model.layers.28.mlp.up_proj.weight": "model-00004-of-00005.safetensors",
+    "model.layers.28.post_attention_layernorm.weight": "model-00004-of-00005.safetensors",
+    "model.layers.28.self_attn.k_proj.weight": "model-00004-of-00005.safetensors",
+    "model.layers.28.self_attn.o_proj.weight": "model-00004-of-00005.safetensors",
+    "model.layers.28.self_attn.q_proj.weight": "model-00004-of-00005.safetensors",
+    "model.layers.28.self_attn.v_proj.weight": "model-00004-of-00005.safetensors",
+    "model.layers.29.input_layernorm.weight": "model-00004-of-00005.safetensors",
+    "model.layers.29.mlp.down_proj.weight": "model-00004-of-00005.safetensors",
+    "model.layers.29.mlp.gate_proj.weight": "model-00004-of-00005.safetensors",
+    "model.layers.29.mlp.up_proj.weight": "model-00004-of-00005.safetensors",
+    "model.layers.29.post_attention_layernorm.weight": "model-00004-of-00005.safetensors",
+    "model.layers.29.self_attn.k_proj.weight": "model-00004-of-00005.safetensors",
+    "model.layers.29.self_attn.o_proj.weight": "model-00004-of-00005.safetensors",
+    "model.layers.29.self_attn.q_proj.weight": "model-00004-of-00005.safetensors",
+    "model.layers.29.self_attn.v_proj.weight": "model-00004-of-00005.safetensors",
+    "model.layers.3.input_layernorm.weight": "model-00001-of-00005.safetensors",
+    "model.layers.3.mlp.down_proj.weight": "model-00001-of-00005.safetensors",
+    "model.layers.3.mlp.gate_proj.weight": "model-00001-of-00005.safetensors",
+    "model.layers.3.mlp.up_proj.weight": "model-00001-of-00005.safetensors",
+    "model.layers.3.post_attention_layernorm.weight": "model-00001-of-00005.safetensors",
+    "model.layers.3.self_attn.k_proj.weight": "model-00001-of-00005.safetensors",
+    "model.layers.3.self_attn.o_proj.weight": "model-00001-of-00005.safetensors",
+    "model.layers.3.self_attn.q_proj.weight": "model-00001-of-00005.safetensors",
+    "model.layers.3.self_attn.v_proj.weight": "model-00001-of-00005.safetensors",
+    "model.layers.30.input_layernorm.weight": "model-00004-of-00005.safetensors",
+    "model.layers.30.mlp.down_proj.weight": "model-00004-of-00005.safetensors",
+    "model.layers.30.mlp.gate_proj.weight": "model-00004-of-00005.safetensors",
+    "model.layers.30.mlp.up_proj.weight": "model-00004-of-00005.safetensors",
+    "model.layers.30.post_attention_layernorm.weight": "model-00004-of-00005.safetensors",
+    "model.layers.30.self_attn.k_proj.weight": "model-00004-of-00005.safetensors",
+    "model.layers.30.self_attn.o_proj.weight": "model-00004-of-00005.safetensors",
+    "model.layers.30.self_attn.q_proj.weight": "model-00004-of-00005.safetensors",
+    "model.layers.30.self_attn.v_proj.weight": "model-00004-of-00005.safetensors",
+    "model.layers.31.input_layernorm.weight": "model-00004-of-00005.safetensors",
+    "model.layers.31.mlp.down_proj.weight": "model-00004-of-00005.safetensors",
+    "model.layers.31.mlp.gate_proj.weight": "model-00004-of-00005.safetensors",
+    "model.layers.31.mlp.up_proj.weight": "model-00004-of-00005.safetensors",
+    "model.layers.31.post_attention_layernorm.weight": "model-00004-of-00005.safetensors",
+    "model.layers.31.self_attn.k_proj.weight": "model-00004-of-00005.safetensors",
+    "model.layers.31.self_attn.o_proj.weight": "model-00004-of-00005.safetensors",
+    "model.layers.31.self_attn.q_proj.weight": "model-00004-of-00005.safetensors",
+    "model.layers.31.self_attn.v_proj.weight": "model-00004-of-00005.safetensors",
+    "model.layers.32.input_layernorm.weight": "model-00004-of-00005.safetensors",
+    "model.layers.32.mlp.down_proj.weight": "model-00004-of-00005.safetensors",
+    "model.layers.32.mlp.gate_proj.weight": "model-00004-of-00005.safetensors",
+    "model.layers.32.mlp.up_proj.weight": "model-00004-of-00005.safetensors",
+    "model.layers.32.post_attention_layernorm.weight": "model-00004-of-00005.safetensors",
+    "model.layers.32.self_attn.k_proj.weight": "model-00004-of-00005.safetensors",
+    "model.layers.32.self_attn.o_proj.weight": "model-00004-of-00005.safetensors",
+    "model.layers.32.self_attn.q_proj.weight": "model-00004-of-00005.safetensors",
+    "model.layers.32.self_attn.v_proj.weight": "model-00004-of-00005.safetensors",
+    "model.layers.33.input_layernorm.weight": "model-00005-of-00005.safetensors",
+    "model.layers.33.mlp.down_proj.weight": "model-00005-of-00005.safetensors",
+    "model.layers.33.mlp.gate_proj.weight": "model-00004-of-00005.safetensors",
+    "model.layers.33.mlp.up_proj.weight": "model-00005-of-00005.safetensors",
+    "model.layers.33.post_attention_layernorm.weight": "model-00005-of-00005.safetensors",
+    "model.layers.33.self_attn.k_proj.weight": "model-00004-of-00005.safetensors",
+    "model.layers.33.self_attn.o_proj.weight": "model-00004-of-00005.safetensors",
+    "model.layers.33.self_attn.q_proj.weight": "model-00004-of-00005.safetensors",
+    "model.layers.33.self_attn.v_proj.weight": "model-00004-of-00005.safetensors",
+    "model.layers.34.input_layernorm.weight": "model-00005-of-00005.safetensors",
+    "model.layers.34.mlp.down_proj.weight": "model-00005-of-00005.safetensors",
+    "model.layers.34.mlp.gate_proj.weight": "model-00005-of-00005.safetensors",
+    "model.layers.34.mlp.up_proj.weight": "model-00005-of-00005.safetensors",
+    "model.layers.34.post_attention_layernorm.weight": "model-00005-of-00005.safetensors",
+    "model.layers.34.self_attn.k_proj.weight": "model-00005-of-00005.safetensors",
+    "model.layers.34.self_attn.o_proj.weight": "model-00005-of-00005.safetensors",
+    "model.layers.34.self_attn.q_proj.weight": "model-00005-of-00005.safetensors",
+    "model.layers.34.self_attn.v_proj.weight": "model-00005-of-00005.safetensors",
+    "model.layers.35.input_layernorm.weight": "model-00005-of-00005.safetensors",
+    "model.layers.35.mlp.down_proj.weight": "model-00005-of-00005.safetensors",
+    "model.layers.35.mlp.gate_proj.weight": "model-00005-of-00005.safetensors",
+    "model.layers.35.mlp.up_proj.weight": "model-00005-of-00005.safetensors",
+    "model.layers.35.post_attention_layernorm.weight": "model-00005-of-00005.safetensors",
+    "model.layers.35.self_attn.k_proj.weight": "model-00005-of-00005.safetensors",
+    "model.layers.35.self_attn.o_proj.weight": "model-00005-of-00005.safetensors",
+    "model.layers.35.self_attn.q_proj.weight": "model-00005-of-00005.safetensors",
+    "model.layers.35.self_attn.v_proj.weight": "model-00005-of-00005.safetensors",
+    "model.layers.36.input_layernorm.weight": "model-00005-of-00005.safetensors",
+    "model.layers.36.mlp.down_proj.weight": "model-00005-of-00005.safetensors",
+    "model.layers.36.mlp.gate_proj.weight": "model-00005-of-00005.safetensors",
+    "model.layers.36.mlp.up_proj.weight": "model-00005-of-00005.safetensors",
+    "model.layers.36.post_attention_layernorm.weight": "model-00005-of-00005.safetensors",
+    "model.layers.36.self_attn.k_proj.weight": "model-00005-of-00005.safetensors",
+    "model.layers.36.self_attn.o_proj.weight": "model-00005-of-00005.safetensors",
+    "model.layers.36.self_attn.q_proj.weight": "model-00005-of-00005.safetensors",
+    "model.layers.36.self_attn.v_proj.weight": "model-00005-of-00005.safetensors",
+    "model.layers.37.input_layernorm.weight": "model-00005-of-00005.safetensors",
+    "model.layers.37.mlp.down_proj.weight": "model-00005-of-00005.safetensors",
+    "model.layers.37.mlp.gate_proj.weight": "model-00005-of-00005.safetensors",
+    "model.layers.37.mlp.up_proj.weight": "model-00005-of-00005.safetensors",
+    "model.layers.37.post_attention_layernorm.weight": "model-00005-of-00005.safetensors",
+    "model.layers.37.self_attn.k_proj.weight": "model-00005-of-00005.safetensors",
+    "model.layers.37.self_attn.o_proj.weight": "model-00005-of-00005.safetensors",
+    "model.layers.37.self_attn.q_proj.weight": "model-00005-of-00005.safetensors",
+    "model.layers.37.self_attn.v_proj.weight": "model-00005-of-00005.safetensors",
+    "model.layers.38.input_layernorm.weight": "model-00005-of-00005.safetensors",
+    "model.layers.38.mlp.down_proj.weight": "model-00005-of-00005.safetensors",
+    "model.layers.38.mlp.gate_proj.weight": "model-00005-of-00005.safetensors",
+    "model.layers.38.mlp.up_proj.weight": "model-00005-of-00005.safetensors",
+    "model.layers.38.post_attention_layernorm.weight": "model-00005-of-00005.safetensors",
+    "model.layers.38.self_attn.k_proj.weight": "model-00005-of-00005.safetensors",
+    "model.layers.38.self_attn.o_proj.weight": "model-00005-of-00005.safetensors",
+    "model.layers.38.self_attn.q_proj.weight": "model-00005-of-00005.safetensors",
+    "model.layers.38.self_attn.v_proj.weight": "model-00005-of-00005.safetensors",
+    "model.layers.39.input_layernorm.weight": "model-00005-of-00005.safetensors",
+    "model.layers.39.mlp.down_proj.weight": "model-00005-of-00005.safetensors",
+    "model.layers.39.mlp.gate_proj.weight": "model-00005-of-00005.safetensors",
+    "model.layers.39.mlp.up_proj.weight": "model-00005-of-00005.safetensors",
+    "model.layers.39.post_attention_layernorm.weight": "model-00005-of-00005.safetensors",
+    "model.layers.39.self_attn.k_proj.weight": "model-00005-of-00005.safetensors",
+    "model.layers.39.self_attn.o_proj.weight": "model-00005-of-00005.safetensors",
+    "model.layers.39.self_attn.q_proj.weight": "model-00005-of-00005.safetensors",
+    "model.layers.39.self_attn.v_proj.weight": "model-00005-of-00005.safetensors",
+    "model.layers.4.input_layernorm.weight": "model-00001-of-00005.safetensors",
+    "model.layers.4.mlp.down_proj.weight": "model-00001-of-00005.safetensors",
+    "model.layers.4.mlp.gate_proj.weight": "model-00001-of-00005.safetensors",
+    "model.layers.4.mlp.up_proj.weight": "model-00001-of-00005.safetensors",
+    "model.layers.4.post_attention_layernorm.weight": "model-00001-of-00005.safetensors",
+    "model.layers.4.self_attn.k_proj.weight": "model-00001-of-00005.safetensors",
+    "model.layers.4.self_attn.o_proj.weight": "model-00001-of-00005.safetensors",
+    "model.layers.4.self_attn.q_proj.weight": "model-00001-of-00005.safetensors",
+    "model.layers.4.self_attn.v_proj.weight": "model-00001-of-00005.safetensors",
+    "model.layers.5.input_layernorm.weight": "model-00001-of-00005.safetensors",
+    "model.layers.5.mlp.down_proj.weight": "model-00001-of-00005.safetensors",
+    "model.layers.5.mlp.gate_proj.weight": "model-00001-of-00005.safetensors",
+    "model.layers.5.mlp.up_proj.weight": "model-00001-of-00005.safetensors",
+    "model.layers.5.post_attention_layernorm.weight": "model-00001-of-00005.safetensors",
+    "model.layers.5.self_attn.k_proj.weight": "model-00001-of-00005.safetensors",
+    "model.layers.5.self_attn.o_proj.weight": "model-00001-of-00005.safetensors",
+    "model.layers.5.self_attn.q_proj.weight": "model-00001-of-00005.safetensors",
+    "model.layers.5.self_attn.v_proj.weight": "model-00001-of-00005.safetensors",
+    "model.layers.6.input_layernorm.weight": "model-00002-of-00005.safetensors",
+    "model.layers.6.mlp.down_proj.weight": "model-00002-of-00005.safetensors",
+    "model.layers.6.mlp.gate_proj.weight": "model-00001-of-00005.safetensors",
+    "model.layers.6.mlp.up_proj.weight": "model-00002-of-00005.safetensors",
+    "model.layers.6.post_attention_layernorm.weight": "model-00002-of-00005.safetensors",
+    "model.layers.6.self_attn.k_proj.weight": "model-00001-of-00005.safetensors",
+    "model.layers.6.self_attn.o_proj.weight": "model-00001-of-00005.safetensors",
+    "model.layers.6.self_attn.q_proj.weight": "model-00001-of-00005.safetensors",
+    "model.layers.6.self_attn.v_proj.weight": "model-00001-of-00005.safetensors",
+    "model.layers.7.input_layernorm.weight": "model-00002-of-00005.safetensors",
+    "model.layers.7.mlp.down_proj.weight": "model-00002-of-00005.safetensors",
+    "model.layers.7.mlp.gate_proj.weight": "model-00002-of-00005.safetensors",
+    "model.layers.7.mlp.up_proj.weight": "model-00002-of-00005.safetensors",
+    "model.layers.7.post_attention_layernorm.weight": "model-00002-of-00005.safetensors",
+    "model.layers.7.self_attn.k_proj.weight": "model-00002-of-00005.safetensors",
+    "model.layers.7.self_attn.o_proj.weight": "model-00002-of-00005.safetensors",
+    "model.layers.7.self_attn.q_proj.weight": "model-00002-of-00005.safetensors",
+    "model.layers.7.self_attn.v_proj.weight": "model-00002-of-00005.safetensors",
+    "model.layers.8.input_layernorm.weight": "model-00002-of-00005.safetensors",
+    "model.layers.8.mlp.down_proj.weight": "model-00002-of-00005.safetensors",
+    "model.layers.8.mlp.gate_proj.weight": "model-00002-of-00005.safetensors",
+    "model.layers.8.mlp.up_proj.weight": "model-00002-of-00005.safetensors",
+    "model.layers.8.post_attention_layernorm.weight": "model-00002-of-00005.safetensors",
+    "model.layers.8.self_attn.k_proj.weight": "model-00002-of-00005.safetensors",
+    "model.layers.8.self_attn.o_proj.weight": "model-00002-of-00005.safetensors",
+    "model.layers.8.self_attn.q_proj.weight": "model-00002-of-00005.safetensors",
+    "model.layers.8.self_attn.v_proj.weight": "model-00002-of-00005.safetensors",
+    "model.layers.9.input_layernorm.weight": "model-00002-of-00005.safetensors",
+    "model.layers.9.mlp.down_proj.weight": "model-00002-of-00005.safetensors",
+    "model.layers.9.mlp.gate_proj.weight": "model-00002-of-00005.safetensors",
+    "model.layers.9.mlp.up_proj.weight": "model-00002-of-00005.safetensors",
+    "model.layers.9.post_attention_layernorm.weight": "model-00002-of-00005.safetensors",
+    "model.layers.9.self_attn.k_proj.weight": "model-00002-of-00005.safetensors",
+    "model.layers.9.self_attn.o_proj.weight": "model-00002-of-00005.safetensors",
+    "model.layers.9.self_attn.q_proj.weight": "model-00002-of-00005.safetensors",
+    "model.layers.9.self_attn.v_proj.weight": "model-00002-of-00005.safetensors",
+    "model.norm.weight": "model-00005-of-00005.safetensors"
+  }
+}

output-00001-of-00002.safetensors ADDED Viewed

	@@ -0,0 +1,3 @@

+version https://git-lfs.github.com/spec/v1
+oid sha256:48aa0433e72ffd47c68f0931de3a0838ed6a9d15c1e81181a7a158100261a746
+size 8577950148

output-00002-of-00002.safetensors ADDED Viewed

	@@ -0,0 +1,3 @@

+version https://git-lfs.github.com/spec/v1
+oid sha256:a89172c4108a0a76b9f0a67edcb1be29231277ea1eb95cf60dc54b9b367ab445
+size 3885376648

special_tokens_map.json ADDED Viewed

	@@ -0,0 +1,34 @@

+{
+  "additional_special_tokens": [
+    "<|start_header_id|>",
+    "<|end_header_id|>"
+  ],
+  "bos_token": {
+    "content": "<s>",
+    "lstrip": false,
+    "normalized": false,
+    "rstrip": false,
+    "single_word": false
+  },
+  "eos_token": {
+    "content": "</s>",
+    "lstrip": false,
+    "normalized": false,
+    "rstrip": false,
+    "single_word": false
+  },
+  "pad_token": {
+    "content": "<pad>",
+    "lstrip": false,
+    "normalized": false,
+    "rstrip": false,
+    "single_word": false
+  },
+  "unk_token": {
+    "content": "<unk>",
+    "lstrip": false,
+    "normalized": false,
+    "rstrip": false,
+    "single_word": false
+  }
+}

tokenizer.json ADDED Viewed

The diff for this file is too large to render. See raw diff

tokenizer_config.json ADDED Viewed

The diff for this file is too large to render. See raw diff