leaderboard-pr-bot's picture
Adding Evaluation Results
eaf0126 verified
|
raw
history blame
3.88 kB
metadata
license: llama2
model-index:
  - name: vicuna-mmlu-val-mcq-7b-ep2
    results:
      - task:
          type: text-generation
          name: Text Generation
        dataset:
          name: AI2 Reasoning Challenge (25-Shot)
          type: ai2_arc
          config: ARC-Challenge
          split: test
          args:
            num_few_shot: 25
        metrics:
          - type: acc_norm
            value: 53.33
            name: normalized accuracy
        source:
          url: >-
            https://huggingface.co/spaces/HuggingFaceH4/open_llm_leaderboard?query=luffycodes/vicuna-mmlu-val-mcq-7b-ep2
          name: Open LLM Leaderboard
      - task:
          type: text-generation
          name: Text Generation
        dataset:
          name: HellaSwag (10-Shot)
          type: hellaswag
          split: validation
          args:
            num_few_shot: 10
        metrics:
          - type: acc_norm
            value: 77.73
            name: normalized accuracy
        source:
          url: >-
            https://huggingface.co/spaces/HuggingFaceH4/open_llm_leaderboard?query=luffycodes/vicuna-mmlu-val-mcq-7b-ep2
          name: Open LLM Leaderboard
      - task:
          type: text-generation
          name: Text Generation
        dataset:
          name: MMLU (5-Shot)
          type: cais/mmlu
          config: all
          split: test
          args:
            num_few_shot: 5
        metrics:
          - type: acc
            value: 46.85
            name: accuracy
        source:
          url: >-
            https://huggingface.co/spaces/HuggingFaceH4/open_llm_leaderboard?query=luffycodes/vicuna-mmlu-val-mcq-7b-ep2
          name: Open LLM Leaderboard
      - task:
          type: text-generation
          name: Text Generation
        dataset:
          name: TruthfulQA (0-shot)
          type: truthful_qa
          config: multiple_choice
          split: validation
          args:
            num_few_shot: 0
        metrics:
          - type: mc2
            value: 43.87
        source:
          url: >-
            https://huggingface.co/spaces/HuggingFaceH4/open_llm_leaderboard?query=luffycodes/vicuna-mmlu-val-mcq-7b-ep2
          name: Open LLM Leaderboard
      - task:
          type: text-generation
          name: Text Generation
        dataset:
          name: Winogrande (5-shot)
          type: winogrande
          config: winogrande_xl
          split: validation
          args:
            num_few_shot: 5
        metrics:
          - type: acc
            value: 71.27
            name: accuracy
        source:
          url: >-
            https://huggingface.co/spaces/HuggingFaceH4/open_llm_leaderboard?query=luffycodes/vicuna-mmlu-val-mcq-7b-ep2
          name: Open LLM Leaderboard
      - task:
          type: text-generation
          name: Text Generation
        dataset:
          name: GSM8k (5-shot)
          type: gsm8k
          config: main
          split: test
          args:
            num_few_shot: 5
        metrics:
          - type: acc
            value: 10.24
            name: accuracy
        source:
          url: >-
            https://huggingface.co/spaces/HuggingFaceH4/open_llm_leaderboard?query=luffycodes/vicuna-mmlu-val-mcq-7b-ep2
          name: Open LLM Leaderboard

If you use this work, please cite: CLASS Meet SPOCK: An Education Tutoring Chatbot based on Learning Science Principles https://arxiv.org/abs/2305.13272

@misc{sonkar2023class,
      title={CLASS Meet SPOCK: An Education Tutoring Chatbot based on Learning Science Principles}, 
      author={Shashank Sonkar and Lucy Liu and Debshila Basu Mallick and Richard G. Baraniuk},
      year={2023},
      eprint={2305.13272},
      archivePrefix={arXiv},
      primaryClass={cs.CL}
}

Open LLM Leaderboard Evaluation Results

Detailed results can be found here

Metric Value
Avg. 50.55
AI2 Reasoning Challenge (25-Shot) 53.33
HellaSwag (10-Shot) 77.73
MMLU (5-Shot) 46.85
TruthfulQA (0-shot) 43.87
Winogrande (5-shot) 71.27
GSM8k (5-shot) 10.24