{ "cells": [ { "cell_type": "markdown", "metadata": {}, "source": [ "# F1 Scores" ] }, { "cell_type": "code", "execution_count": 2, "metadata": { "vscode": { "languageId": "r" } }, "outputs": [ { "name": "stderr", "output_type": "stream", "text": [ "Loading required package: ggplot2\n", "\n", "-- \u001b[1mAttaching packages\u001b[22m --------------------------------------- tidyverse 1.3.1 --\n", "\n", "\u001b[32mv\u001b[39m \u001b[34mtibble \u001b[39m 3.1.5 \u001b[32mv\u001b[39m \u001b[34mdplyr \u001b[39m 1.0.7\n", "\u001b[32mv\u001b[39m \u001b[34mtidyr \u001b[39m 1.1.4 \u001b[32mv\u001b[39m \u001b[34mstringr\u001b[39m 1.4.0\n", "\u001b[32mv\u001b[39m \u001b[34mpurrr \u001b[39m 0.3.4 \u001b[32mv\u001b[39m \u001b[34mforcats\u001b[39m 0.5.1\n", "\n", "-- \u001b[1mConflicts\u001b[22m ------------------------------------------ tidyverse_conflicts() --\n", "\u001b[31mx\u001b[39m \u001b[34mdplyr\u001b[39m::\u001b[32mfilter()\u001b[39m masks \u001b[34mstats\u001b[39m::filter()\n", "\u001b[31mx\u001b[39m \u001b[34mdplyr\u001b[39m::\u001b[32mlag()\u001b[39m masks \u001b[34mstats\u001b[39m::lag()\n", "\n", "Loading required package: mvtnorm\n", "\n", "Loading required package: survival\n", "\n", "Loading required package: TH.data\n", "\n", "Loading required package: MASS\n", "\n", "\n", "Attaching package: 'MASS'\n", "\n", "\n", "The following object is masked from 'package:dplyr':\n", "\n", " select\n", "\n", "\n", "\n", "Attaching package: 'TH.data'\n", "\n", "\n", "The following object is masked from 'package:MASS':\n", "\n", " geyser\n", "\n", "\n", "Loading required package: carData\n", "\n", "\n", "Attaching package: 'car'\n", "\n", "\n", "The following object is masked from 'package:dplyr':\n", "\n", " recode\n", "\n", "\n", "The following object is masked from 'package:purrr':\n", "\n", " some\n", "\n", "\n", "\n", "Attaching package: 'rstatix'\n", "\n", "\n", "The following object is masked from 'package:MASS':\n", "\n", " select\n", "\n", "\n", "The following object is masked from 'package:stats':\n", "\n", " filter\n", "\n", "\n" ] } ], "source": [ "library(\"ggpubr\")\n", "library(readr)\n", "library(ggplot2)\n", "library(tidyverse)\n", "library(ARTool)\n", "library(emmeans)\n", "library(multcomp)\n", "library(car)\n", "library(rstatix)" ] }, { "cell_type": "code", "execution_count": 3, "metadata": { "vscode": { "languageId": "r" } }, "outputs": [ { "name": "stderr", "output_type": "stream", "text": [ "New names:\n", "* `` -> ...1\n", "\n", "\u001b[1mRows: \u001b[22m\u001b[34m59\u001b[39m \u001b[1mColumns: \u001b[22m\u001b[34m5\u001b[39m\n", "\u001b[36m--\u001b[39m \u001b[1mColumn specification\u001b[22m \u001b[36m--------------------------------------------------------\u001b[39m\n", "\u001b[1mDelimiter:\u001b[22m \",\"\n", "\u001b[32mdbl\u001b[39m (5): ...1, faiss_dpr, faiss_longformer, es_dpr, es_longformer\n", "\n", "\u001b[36mi\u001b[39m Use `spec()` to retrieve the full column specification for this data.\n", "\u001b[36mi\u001b[39m Specify the column types or set `show_col_types = FALSE` to quiet this message.\n" ] }, { "data": { "text/html": [ "\n", "\n", "\n", "\t\n", "\t\n", "\n", "\n", "\t\n", "\t\n", "\t\n", "\t\n", "\t\n", "\t\n", "\n", "
\n" ], "text/latex": [ "A tibble: 6 × 4\n", "\\begin{tabular}{llll}\n", " question & retriever & reader & em\\\\\n", " & & & \\\\\n", "\\hline\n", "\t 0 & faiss & dpr & 0\\\\\n", "\t 0 & faiss & longformer & 0\\\\\n", "\t 0 & es & dpr & 0\\\\\n", "\t 0 & es & longformer & 0\\\\\n", "\t 1 & faiss & dpr & 0\\\\\n", "\t 1 & faiss & longformer & 0\\\\\n", "\\end{tabular}\n" ], "text/markdown": [ "\n", "A tibble: 6 × 4\n", "\n", "| question <dbl> | retriever <fct> | reader <fct> | em <dbl> |\n", "|---|---|---|---|\n", "| 0 | faiss | dpr | 0 |\n", "| 0 | faiss | longformer | 0 |\n", "| 0 | es | dpr | 0 |\n", "| 0 | es | longformer | 0 |\n", "| 1 | faiss | dpr | 0 |\n", "| 1 | faiss | longformer | 0 |\n", "\n" ], "text/plain": [ " question retriever reader em\n", "1 0 faiss dpr 0 \n", "2 0 faiss longformer 0 \n", "3 0 es dpr 0 \n", "4 0 es longformer 0 \n", "5 1 faiss dpr 0 \n", "6 1 faiss longformer 0 " ] }, "metadata": {}, "output_type": "display_data" } ], "source": [ "em_scores <- read_csv(\"em_scores.csv\") %>%\n", " rename(question = `...1`) %>%\n", " pivot_longer(!question, names_to=c(\"retriever\", \"reader\"), names_sep=\"_\", values_to=\"em\")\n", "\n", "em_scores$retriever <- as.factor(em_scores$retriever)\n", "em_scores$reader <- as.factor(em_scores$reader)\n", "\n", "head(em_scores)" ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "To test which tests we can use, we need to check for normality. For this, we use a Shapiro-Wilk test of normality. In this case, results with FAISS as retriever or DPR had reader had zero exact matches, thus making it impossible to compute the Shapiro-Wilk test of normality. Nonetheless, we know that a distribution with all-identical values is not normally distributed. As you can see in the results below, all other $p$-values are lower than 0.001, so we reject the null-hypothesis of normality and now know that none of the f1-scores are normally distributed." ] }, { "cell_type": "code", "execution_count": 14, "metadata": { "vscode": { "languageId": "r" } }, "outputs": [ { "data": { "text/html": [ "\n", "\n", "\n", "\t\n", "\t\n", "\n", "\n", "\t\n", "\n", "
\n" ], "text/latex": [ "A tibble: 1 × 3\n", "\\begin{tabular}{lll}\n", " retriever & sw.stat & sw.p\\\\\n", " & & \\\\\n", "\\hline\n", "\t es & 0.2503666 & 6.788451e-22\\\\\n", "\\end{tabular}\n" ], "text/markdown": [ "\n", "A tibble: 1 × 3\n", "\n", "| retriever <fct> | sw.stat <dbl> | sw.p <dbl> |\n", "|---|---|---|\n", "| es | 0.2503666 | 6.788451e-22 |\n", "\n" ], "text/plain": [ " retriever sw.stat sw.p \n", "1 es 0.2503666 6.788451e-22" ] }, "metadata": {}, "output_type": "display_data" }, { "data": { "text/html": [ "\n", "\n", "\n", "\t\n", "\t\n", "\n", "\n", "\t\n", "\n", "
\n" ], "text/latex": [ "A tibble: 1 × 3\n", "\\begin{tabular}{lll}\n", " reader & sw.stat & sw.p\\\\\n", " & & \\\\\n", "\\hline\n", "\t longformer & 0.2503666 & 6.788451e-22\\\\\n", "\\end{tabular}\n" ], "text/markdown": [ "\n", "A tibble: 1 × 3\n", "\n", "| reader <fct> | sw.stat <dbl> | sw.p <dbl> |\n", "|---|---|---|\n", "| longformer | 0.2503666 | 6.788451e-22 |\n", "\n" ], "text/plain": [ " reader sw.stat sw.p \n", "1 longformer 0.2503666 6.788451e-22" ] }, "metadata": {}, "output_type": "display_data" } ], "source": [ "em_scores %>%\n", " select(!question) %>%\n", " group_by(retriever) %>%\n", " filter(sum(em) > 0) %>%\n", " summarise(sw.stat = shapiro.test(em)$statistic,\n", " sw.p = shapiro.test(em)$p)\n", "em_scores %>%\n", " select(!question) %>%\n", " group_by(reader) %>%\n", " filter(sum(em) > 0) %>%\n", " summarise(sw.stat = shapiro.test(em)$statistic,\n", " sw.p = shapiro.test(em)$p)" ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "Since our data is not normally distributed, we cannot use an ANOVA to compare our results. Therefore, we use an aligned-rank test, which is a non-parameteric version of a factorial repeated measures ANOVA." ] }, { "cell_type": "code", "execution_count": 4, "metadata": { "vscode": { "languageId": "r" } }, "outputs": [ { "data": { "text/html": [ "\n", "\n", "\n", "\t\n", "\t\n", "\n", "\n", "\t\n", "\t\n", "\t\n", "\n", "
