{ "cells": [ { "cell_type": "markdown", "metadata": {}, "source": [ "# F1 Scores" ] }, { "cell_type": "code", "execution_count": 2, "metadata": { "vscode": { "languageId": "r" } }, "outputs": [ { "name": "stderr", "output_type": "stream", "text": [ "Loading required package: ggplot2\n", "\n", "-- \u001b[1mAttaching packages\u001b[22m --------------------------------------- tidyverse 1.3.1 --\n", "\n", "\u001b[32mv\u001b[39m \u001b[34mtibble \u001b[39m 3.1.5 \u001b[32mv\u001b[39m \u001b[34mdplyr \u001b[39m 1.0.7\n", "\u001b[32mv\u001b[39m \u001b[34mtidyr \u001b[39m 1.1.4 \u001b[32mv\u001b[39m \u001b[34mstringr\u001b[39m 1.4.0\n", "\u001b[32mv\u001b[39m \u001b[34mpurrr \u001b[39m 0.3.4 \u001b[32mv\u001b[39m \u001b[34mforcats\u001b[39m 0.5.1\n", "\n", "-- \u001b[1mConflicts\u001b[22m ------------------------------------------ tidyverse_conflicts() --\n", "\u001b[31mx\u001b[39m \u001b[34mdplyr\u001b[39m::\u001b[32mfilter()\u001b[39m masks \u001b[34mstats\u001b[39m::filter()\n", "\u001b[31mx\u001b[39m \u001b[34mdplyr\u001b[39m::\u001b[32mlag()\u001b[39m masks \u001b[34mstats\u001b[39m::lag()\n", "\n", "Loading required package: mvtnorm\n", "\n", "Loading required package: survival\n", "\n", "Loading required package: TH.data\n", "\n", "Loading required package: MASS\n", "\n", "\n", "Attaching package: 'MASS'\n", "\n", "\n", "The following object is masked from 'package:dplyr':\n", "\n", " select\n", "\n", "\n", "\n", "Attaching package: 'TH.data'\n", "\n", "\n", "The following object is masked from 'package:MASS':\n", "\n", " geyser\n", "\n", "\n", "Loading required package: carData\n", "\n", "\n", "Attaching package: 'car'\n", "\n", "\n", "The following object is masked from 'package:dplyr':\n", "\n", " recode\n", "\n", "\n", "The following object is masked from 'package:purrr':\n", "\n", " some\n", "\n", "\n", "\n", "Attaching package: 'rstatix'\n", "\n", "\n", "The following object is masked from 'package:MASS':\n", "\n", " select\n", "\n", "\n", "The following object is masked from 'package:stats':\n", "\n", " filter\n", "\n", "\n" ] } ], "source": [ "library(\"ggpubr\")\n", "library(readr)\n", "library(ggplot2)\n", "library(tidyverse)\n", "library(ARTool)\n", "library(emmeans)\n", "library(multcomp)\n", "library(car)\n", "library(rstatix)" ] }, { "cell_type": "code", "execution_count": 3, "metadata": { "vscode": { "languageId": "r" } }, "outputs": [ { "name": "stderr", "output_type": "stream", "text": [ "New names:\n", "* `` -> ...1\n", "\n", "\u001b[1mRows: \u001b[22m\u001b[34m59\u001b[39m \u001b[1mColumns: \u001b[22m\u001b[34m5\u001b[39m\n", "\u001b[36m--\u001b[39m \u001b[1mColumn specification\u001b[22m \u001b[36m--------------------------------------------------------\u001b[39m\n", "\u001b[1mDelimiter:\u001b[22m \",\"\n", "\u001b[32mdbl\u001b[39m (5): ...1, faiss_dpr, faiss_longformer, es_dpr, es_longformer\n", "\n", "\u001b[36mi\u001b[39m Use `spec()` to retrieve the full column specification for this data.\n", "\u001b[36mi\u001b[39m Specify the column types or set `show_col_types = FALSE` to quiet this message.\n" ] }, { "data": { "text/html": [ "\n", "\n", "\n", "\t\n", "\t\n", "\n", "\n", "\t\n", "\t\n", "\t\n", "\t\n", "\t\n", "\t\n", "\n", "
A tibble: 6 × 4
questionretrieverreaderem
<dbl><fct><fct><dbl>
0faissdpr 0
0faisslongformer0
0es dpr 0
0es longformer0
1faissdpr 0
1faisslongformer0
\n" ], "text/latex": [ "A tibble: 6 × 4\n", "\\begin{tabular}{llll}\n", " question & retriever & reader & em\\\\\n", " & & & \\\\\n", "\\hline\n", "\t 0 & faiss & dpr & 0\\\\\n", "\t 0 & faiss & longformer & 0\\\\\n", "\t 0 & es & dpr & 0\\\\\n", "\t 0 & es & longformer & 0\\\\\n", "\t 1 & faiss & dpr & 0\\\\\n", "\t 1 & faiss & longformer & 0\\\\\n", "\\end{tabular}\n" ], "text/markdown": [ "\n", "A tibble: 6 × 4\n", "\n", "| question <dbl> | retriever <fct> | reader <fct> | em <dbl> |\n", "|---|---|---|---|\n", "| 0 | faiss | dpr | 0 |\n", "| 0 | faiss | longformer | 0 |\n", "| 0 | es | dpr | 0 |\n", "| 0 | es | longformer | 0 |\n", "| 1 | faiss | dpr | 0 |\n", "| 1 | faiss | longformer | 0 |\n", "\n" ], "text/plain": [ " question retriever reader em\n", "1 0 faiss dpr 0 \n", "2 0 faiss longformer 0 \n", "3 0 es dpr 0 \n", "4 0 es longformer 0 \n", "5 1 faiss dpr 0 \n", "6 1 faiss longformer 0 " ] }, "metadata": {}, "output_type": "display_data" } ], "source": [ "em_scores <- read_csv(\"em_scores.csv\") %>%\n", " rename(question = `...1`) %>%\n", " pivot_longer(!question, names_to=c(\"retriever\", \"reader\"), names_sep=\"_\", values_to=\"em\")\n", "\n", "em_scores$retriever <- as.factor(em_scores$retriever)\n", "em_scores$reader <- as.factor(em_scores$reader)\n", "\n", "head(em_scores)" ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "To test which tests we can use, we need to check for normality. For this, we use a Shapiro-Wilk test of normality. In this case, results with FAISS as retriever or DPR had reader had zero exact matches, thus making it impossible to compute the Shapiro-Wilk test of normality. Nonetheless, we know that a distribution with all-identical values is not normally distributed. As you can see in the results below, all other $p$-values are lower than 0.001, so we reject the null-hypothesis of normality and now know that none of the f1-scores are normally distributed." ] }, { "cell_type": "code", "execution_count": 14, "metadata": { "vscode": { "languageId": "r" } }, "outputs": [ { "data": { "text/html": [ "\n", "\n", "\n", "\t\n", "\t\n", "\n", "\n", "\t\n", "\n", "
A tibble: 1 × 3
retrieversw.statsw.p
<fct><dbl><dbl>
es0.25036666.788451e-22
\n" ], "text/latex": [ "A tibble: 1 × 3\n", "\\begin{tabular}{lll}\n", " retriever & sw.stat & sw.p\\\\\n", " & & \\\\\n", "\\hline\n", "\t es & 0.2503666 & 6.788451e-22\\\\\n", "\\end{tabular}\n" ], "text/markdown": [ "\n", "A tibble: 1 × 3\n", "\n", "| retriever <fct> | sw.stat <dbl> | sw.p <dbl> |\n", "|---|---|---|\n", "| es | 0.2503666 | 6.788451e-22 |\n", "\n" ], "text/plain": [ " retriever sw.stat sw.p \n", "1 es 0.2503666 6.788451e-22" ] }, "metadata": {}, "output_type": "display_data" }, { "data": { "text/html": [ "\n", "\n", "\n", "\t\n", "\t\n", "\n", "\n", "\t\n", "\n", "
A tibble: 1 × 3
readersw.statsw.p
<fct><dbl><dbl>
longformer0.25036666.788451e-22
\n" ], "text/latex": [ "A tibble: 1 × 3\n", "\\begin{tabular}{lll}\n", " reader & sw.stat & sw.p\\\\\n", " & & \\\\\n", "\\hline\n", "\t longformer & 0.2503666 & 6.788451e-22\\\\\n", "\\end{tabular}\n" ], "text/markdown": [ "\n", "A tibble: 1 × 3\n", "\n", "| reader <fct> | sw.stat <dbl> | sw.p <dbl> |\n", "|---|---|---|\n", "| longformer | 0.2503666 | 6.788451e-22 |\n", "\n" ], "text/plain": [ " reader sw.stat sw.p \n", "1 longformer 0.2503666 6.788451e-22" ] }, "metadata": {}, "output_type": "display_data" } ], "source": [ "em_scores %>%\n", " select(!question) %>%\n", " group_by(retriever) %>%\n", " filter(sum(em) > 0) %>%\n", " summarise(sw.stat = shapiro.test(em)$statistic,\n", " sw.p = shapiro.test(em)$p)\n", "em_scores %>%\n", " select(!question) %>%\n", " group_by(reader) %>%\n", " filter(sum(em) > 0) %>%\n", " summarise(sw.stat = shapiro.test(em)$statistic,\n", " sw.p = shapiro.test(em)$p)" ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "Since our data is not normally distributed, we cannot use an ANOVA to compare our results. Therefore, we use an aligned-rank test, which is a non-parameteric version of a factorial repeated measures ANOVA." ] }, { "cell_type": "code", "execution_count": 4, "metadata": { "vscode": { "languageId": "r" } }, "outputs": [ { "data": { "text/html": [ "\n", "\n", "\n", "\t\n", "\t\n", "\n", "\n", "\t\n", "\t\n", "\t\n", "\n", "
A anova.art: 3 × 7
TermDfDf.resSum SqSum Sq.resF valuePr(>F)
<chr><dbl><dbl><dbl><dbl><dbl><dbl>
retrieverretriever 12321156426308110.19780.001600976
readerreader 12321156426308110.19780.001600976
retriever:readerretriever:reader12321156426308110.19780.001600976
\n" ], "text/latex": [ "A anova.art: 3 × 7\n", "\\begin{tabular}{r|lllllll}\n", " & Term & Df & Df.res & Sum Sq & Sum Sq.res & F value & Pr(>F)\\\\\n", " & & & & & & & \\\\\n", "\\hline\n", "\tretriever & retriever & 1 & 232 & 11564 & 263081 & 10.1978 & 0.001600976\\\\\n", "\treader & reader & 1 & 232 & 11564 & 263081 & 10.1978 & 0.001600976\\\\\n", "\tretriever:reader & retriever:reader & 1 & 232 & 11564 & 263081 & 10.1978 & 0.001600976\\\\\n", "\\end{tabular}\n" ], "text/markdown": [ "\n", "A anova.art: 3 × 7\n", "\n", "| | Term <chr> | Df <dbl> | Df.res <dbl> | Sum Sq <dbl> | Sum Sq.res <dbl> | F value <dbl> | Pr(>F) <dbl> |\n", "|---|---|---|---|---|---|---|---|\n", "| retriever | retriever | 1 | 232 | 11564 | 263081 | 10.1978 | 0.001600976 |\n", "| reader | reader | 1 | 232 | 11564 | 263081 | 10.1978 | 0.001600976 |\n", "| retriever:reader | retriever:reader | 1 | 232 | 11564 | 263081 | 10.1978 | 0.001600976 |\n", "\n" ], "text/plain": [ " Term Df Df.res Sum Sq Sum Sq.res F value\n", "retriever retriever 1 232 11564 263081 10.1978\n", "reader reader 1 232 11564 263081 10.1978\n", "retriever:reader retriever:reader 1 232 11564 263081 10.1978\n", " Pr(>F) \n", "retriever 0.001600976\n", "reader 0.001600976\n", "retriever:reader 0.001600976" ] }, "metadata": {}, "output_type": "display_data" }, { "name": "stderr", "output_type": "stream", "text": [ "NOTE: Results may be misleading due to involvement in interactions\n", "\n" ] }, { "data": { "text/plain": [ " contrast estimate SE df t.ratio p.value\n", " es - faiss 14 4.38 232 3.193 0.0016\n", "\n", "Results are averaged over the levels of: reader " ] }, "metadata": {}, "output_type": "display_data" }, { "name": "stderr", "output_type": "stream", "text": [ "NOTE: Results may be misleading due to involvement in interactions\n", "\n" ] }, { "data": { "text/plain": [ " contrast estimate SE df t.ratio p.value\n", " dpr - longformer -14 4.38 232 -3.193 0.0016\n", "\n", "Results are averaged over the levels of: retriever " ] }, "metadata": {}, "output_type": "display_data" } ], "source": [ "model.acc <- art(f1 ~ retriever * reader, data = em_scores)\n", "anova(model.acc)\n", "art.con(model.acc, ~ retriever)\n", "art.con(model.acc, ~ reader)" ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "From these results, we can see that both the retriever and the reader have a significant effect on the F1 score ($F = 58.63$ and $F = 16.23$ respectively, $p < 0.0001$ for both). However, there is also an interaction between the retriever and reader ($F = 43.53$, $p < 0.0001$). The post-hoc analysis of contrasts shows that ElasticSearch performs better than FAISS ($p < 0.0001$) and Longformer performs better than DPR ($p = 0.0001$)." ] } ], "metadata": { "kernelspec": { "display_name": "R", "language": "R", "name": "ir" }, "language_info": { "codemirror_mode": "r", "file_extension": ".r", "mimetype": "text/x-r-source", "name": "R", "pygments_lexer": "r", "version": "4.1.2" }, "orig_nbformat": 4 }, "nbformat": 4, "nbformat_minor": 2 }