Corey Morris
commited on
Commit
•
2037152
1
Parent(s):
9ecc99c
check for URL and full model name
Browse files
app.py
CHANGED
@@ -115,9 +115,6 @@ st.title('Interactive Portal for Analyzing Open Source Large Language Models')
|
|
115 |
st.markdown("""***Last updated October 6th***""")
|
116 |
st.markdown("""**Models that are suspected to have training data contaminated with evaluation data have been removed.**""")
|
117 |
st.markdown("""
|
118 |
-
Hugging Face runs evaluations on open source models and provides results on a
|
119 |
-
[publicly available leaderboard](https://huggingface.co/spaces/HuggingFaceH4/open_llm_leaderboard) and [dataset](https://huggingface.co/datasets/open-llm-leaderboard/results).
|
120 |
-
The Hugging Face leaderboard currently displays the overall result for Measuring Massive Multitask Language Understanding (MMLU), but not the results for individual tasks.
|
121 |
This page provides a way to explore the results for individual tasks and compare models across tasks. Data for the benchmarks hellaswag, arc_challenge, and truthfulQA have also been included for comparison.
|
122 |
There are 57 tasks in the MMLU evaluation that cover a wide variety of subjects including Science, Math, Humanities, Social Science, Applied Science, Logic, and Security.
|
123 |
[Preliminary analysis of MMLU-by-Task data](https://coreymorrisdata.medium.com/preliminary-analysis-of-mmlu-evaluation-data-insights-from-500-open-source-models-e67885aa364b)
|
@@ -260,9 +257,13 @@ st.markdown("***The dashed red line indicates random chance accuracy of 0.25 as
|
|
260 |
st.markdown("***")
|
261 |
st.write("As expected, there is a strong positive relationship between the number of parameters and average performance on the MMLU evaluation.")
|
262 |
|
|
|
263 |
column_list_for_plotting = filtered_data.columns.tolist()
|
264 |
-
|
265 |
-
column_list_for_plotting.remove('
|
|
|
|
|
|
|
266 |
selected_x_column = st.selectbox('Select x-axis', column_list_for_plotting, index=0)
|
267 |
selected_y_column = st.selectbox('Select y-axis', column_list_for_plotting, index=1)
|
268 |
|
|
|
115 |
st.markdown("""***Last updated October 6th***""")
|
116 |
st.markdown("""**Models that are suspected to have training data contaminated with evaluation data have been removed.**""")
|
117 |
st.markdown("""
|
|
|
|
|
|
|
118 |
This page provides a way to explore the results for individual tasks and compare models across tasks. Data for the benchmarks hellaswag, arc_challenge, and truthfulQA have also been included for comparison.
|
119 |
There are 57 tasks in the MMLU evaluation that cover a wide variety of subjects including Science, Math, Humanities, Social Science, Applied Science, Logic, and Security.
|
120 |
[Preliminary analysis of MMLU-by-Task data](https://coreymorrisdata.medium.com/preliminary-analysis-of-mmlu-evaluation-data-insights-from-500-open-source-models-e67885aa364b)
|
|
|
257 |
st.markdown("***")
|
258 |
st.write("As expected, there is a strong positive relationship between the number of parameters and average performance on the MMLU evaluation.")
|
259 |
|
260 |
+
|
261 |
column_list_for_plotting = filtered_data.columns.tolist()
|
262 |
+
if 'URL' in column_list_for_plotting:
|
263 |
+
column_list_for_plotting.remove('URL')
|
264 |
+
if 'full_model_name' in column_list_for_plotting:
|
265 |
+
column_list_for_plotting.remove('full_model_name')
|
266 |
+
|
267 |
selected_x_column = st.selectbox('Select x-axis', column_list_for_plotting, index=0)
|
268 |
selected_y_column = st.selectbox('Select y-axis', column_list_for_plotting, index=1)
|
269 |
|