metadata

language:
  - en
tags:
  - llama
license: other

OpenChat: Advancing Open-source Language Models with Imperfect Data

The OpenChat v2 family is inspired by offline reinforcement learning, including conditional behavior cloning (OpenChat-v2) and weighted behavior cloning (OpenChat-v2-w).

OpenChat-v2-w: ~80k cleaned ShareGPT data with conditioning and weighted loss, based on LLaMA-13B with a context length of 2048.
- Achieves 50.9% win-rate over ChatGPT on MT-bench.
- Achieves 79.4% win-rate over ChatGPT on Vicuna-bench.
- Achieves 87.1% win-rate over text-davinci-003 on AlpacaEval.
OpenChat-v2: ~80k cleaned ShareGPT data with only conditioning, based on LLaMA-13B with a context length of 2048.
- Achieves 48.1% win-rate over ChatGPT on MT-bench.
- Achieves 80.6% win-rate over ChatGPT on Vicuna-bench.
- Achieves 85.0% win-rate over text-davinci-003 on AlpacaEval.

Code and Inference Server

We provide the full source code, including an inference server compatible with the "ChatCompletions" API, in the OpenChat GitHub repository.

Web UI

OpenChat also includes a web UI for a better user experience. See the GitHub repository for instructions.

Conversation Template

The conversation template involves concatenating tokens, and cannot be expressed in plain-text.

Besides base model vocabulary, an end-of-turn token <|end_of_turn|> is added.

Here is an example of single-round conversation template:

def tokenize_single_input(tokenizer, prompt):
    # OpenChat V2
    human_prefix = "User:"
    prefix    = "Assistant GPT4:"
    eot_token = "<|end_of_turn|>"
    bos_token = "<s>"

    def _tokenize(text):
        return tokenizer.convert_tokens_to_ids(tokenizer._tokenize(text))

    def _tokenize_special(special_name):
        return tokenizer.convert_tokens_to_ids(special_name)
    
    return [_tokenize_special(bos_token)] + _tokenize(human_prefix) + _tokenize(prompt) + [_tokenize_special(eot_token)] + \
           _tokenize(prefix)

To explore conditional language models, you can also set prefix = "Assistant GPT3:" to mimic ChatGPT behavior (this may cause performance degradation).

Hint: In BPE, tokenize(A) + tokenize(B) does not always equals to tokenize(A + B)

Limitations

Foundation Model Limitations Despite its advanced capabilities, OpenChat is still bound by the limitations inherent in its foundation models. These limitations may impact the model's performance in areas such as:

Complex reasoning
Mathematical and arithmetic tasks
Programming and coding challenges

Hallucination of Non-existent Information OpenChat may sometimes generate information that does not exist or is not accurate, also known as "hallucination". Users should be aware of this possibility and verify any critical information obtained from the model.

Open LLM Leaderboard Evaluation Results

Detailed results can be found here

Metric	Value
Avg.	47.16
ARC (25-shot)	57.17
HellaSwag (10-shot)	81.14
MMLU (5-shot)	50.58
TruthfulQA (0-shot)	49.54
Winogrande (5-shot)	76.24
GSM8K (5-shot)	9.1
DROP (3-shot)	6.37