Perform SQL operations
Performing SQL operations with DuckDB opens up a world of possibilities for querying datasets efficiently. Letβs dive into some examples showcasing the power of DuckDB functions.
For our demonstration, weβll explore a fascinating dataset. The MMLU dataset is a multitask test containing multiple-choice questions spanning various knowledge domains.
To preview the dataset, letβs select a sample of 3 rows:
FROM 'hf://datasets/cais/mmlu/all/test-*.parquet' USING SAMPLE 3;
ββββββββββββββββββββββββ¬βββββββββββββββββββββββ¬βββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββ¬βββββββββ
β question β subject β choices β answer β
β varchar β varchar β varchar[] β int64 β
ββββββββββββββββββββββββΌβββββββββββββββββββββββΌβββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββΌβββββββββ€
β The model of lightβ¦ β conceptual_physics β [wave model, particle model, Both of these, Neither of these] β 1 β
β A person who is loβ¦ β professional_psychβ¦ β [his/her life scripts., his/her own feelings, attitudes, and beliefs., the emotional reactions and behaviors of the people he/she is interacting with.β¦ β 1 β
β The thermic effectβ¦ β nutrition β [is substantially higher for carbohydrate than for protein, is accompanied by a slight decrease in body core temperature., is partly related to sympatβ¦ β 2 β
ββββββββββββββββββββββββ΄βββββββββββββββββββββββ΄βββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββ΄βββββββββ
This command retrieves a random sample of 3 rows from the dataset for us to examine.
Letβs start by examining the schema of our dataset. The following table outlines the structure of our dataset:
DESCRIBE FROM 'hf://datasets/cais/mmlu/all/test-*.parquet' USING SAMPLE 3;
βββββββββββββββ¬ββββββββββββββ¬ββββββββββ¬ββββββββββ¬ββββββββββ¬ββββββββββ
β column_name β column_type β null β key β default β extra β
β varchar β varchar β varchar β varchar β varchar β varchar β
βββββββββββββββΌββββββββββββββΌββββββββββΌββββββββββΌββββββββββΌββββββββββ€
β question β VARCHAR β YES β β β β
β subject β VARCHAR β YES β β β β
β choices β VARCHAR[] β YES β β β β
β answer β BIGINT β YES β β β β
βββββββββββββββ΄ββββββββββββββ΄ββββββββββ΄ββββββββββ΄ββββββββββ΄ββββββββββ
Next, letβs analyze if there are any duplicated records in our dataset:
SELECT *,
COUNT(*) AS counts
FROM 'hf://datasets/cais/mmlu/all/test-*.parquet'
GROUP BY ALL
HAVING counts > 2;
ββββββββββββ¬ββββββββββ¬ββββββββββββ¬βββββββββ¬βββββββββ
β question β subject β choices β answer β counts β
β varchar β varchar β varchar[] β int64 β int64 β
ββββββββββββ΄ββββββββββ΄ββββββββββββ΄βββββββββ΄βββββββββ€
β 0 rows β
ββββββββββββββββββββββββββββββββββββββββββββββββββββ
Fortunately, our dataset doesnβt contain any duplicate records.
Letβs see the proportion of questions based on the subject in a bar representation:
SELECT
subject,
COUNT(*) AS counts,
BAR(COUNT(*), 0, (SELECT COUNT(*) FROM 'hf://datasets/cais/mmlu/all/test-*.parquet')) AS percentage
FROM
'hf://datasets/cais/mmlu/all/test-*.parquet'
GROUP BY
subject
ORDER BY
counts DESC;
ββββββββββββββββββββββββββββββββ¬βββββββββ¬βββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββ
β subject β counts β percentage β
β varchar β int64 β varchar β
ββββββββββββββββββββββββββββββββΌβββββββββΌβββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββ€
β professional_law β 1534 β βββββββββ β
β moral_scenarios β 895 β βββββ β
β miscellaneous β 783 β βββββ β
β professional_psychology β 612 β ββββ β
β high_school_psychology β 545 β βββ β
β high_school_macroeconomics β 390 β βββ β
β elementary_mathematics β 378 β βββ β
β moral_disputes β 346 β ββ β
ββββββββββββββββββββββββββββββββ΄βββββββββ΄βββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββ€
β 57 rows (8 shown) 3 columns β
ββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββ
Now, letβs prepare a subset of the dataset containing questions related to nutrition and create a mapping of questions to correct answers. Notice that we have the column choices from which we can get the correct answer using the answer column as an index.
SELECT *
FROM 'hf://datasets/cais/mmlu/all/test-*.parquet'
WHERE subject = 'nutrition' LIMIT 3;
ββββββββββββββββββββββββ¬ββββββββββββ¬ββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββ¬βββββββββ
β question β subject β choices β answer β
β varchar β varchar β varchar[] β int64 β
ββββββββββββββββββββββββΌββββββββββββΌββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββΌβββββββββ€
β Which foods tend tβ¦ β nutrition β [Meat, Confectionary, Fruits and vegetables, Potatoes] β 2 β
β In which one of thβ¦ β nutrition β [If the incidence rate of the disease falls., If survival time with the disease increases., If recovery of the disease is faster., If the population in which theβ¦ β 1 β
β Which of the folloβ¦ β nutrition β [The flavonoid class comprises flavonoids and isoflavonoids., The digestibility and bioavailability of isoflavones in soya food products are not changed by proceβ¦ β 0 β
ββββββββββββββββββββββββ΄ββββββββββββ΄ββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββ΄βββββββββ
SELECT question,
choices[answer] AS correct_answer
FROM 'hf://datasets/cais/mmlu/all/test-*.parquet'
WHERE subject = 'nutrition' LIMIT 3;
βββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββ¬ββββββββββββββββββββββββββββββββββββββββββββββ
β question β correct_answer β
β varchar β varchar β
βββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββΌββββββββββββββββββββββββββββββββββββββββββββββ€
β Which foods tend to be consumed in lower quantities in Wales and Scotland (as of 2020)?\n β Confectionary β
β In which one of the following circumstances will the prevalence of a disease in the population increase, all else being constant?\n β If the incidence rate of the disease falls. β
β Which of the following statements is correct?\n β β
βββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββ΄ββββββββββββββββββββββββββββββββββββββββββββββ
To ensure data cleanliness, letβs remove any newline characters at the end of the questions and filter out any empty answers:
SELECT regexp_replace(question, '\n', '') AS question,
choices[answer] AS correct_answer
FROM 'hf://datasets/cais/mmlu/all/test-*.parquet'
WHERE subject = 'nutrition' AND LENGTH(correct_answer) > 0 LIMIT 3;
βββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββ¬ββββββββββββββββββββββββββββββββββββββββββββββ
β question β correct_answer β
β varchar β varchar β
βββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββΌββββββββββββββββββββββββββββββββββββββββββββββ€
β Which foods tend to be consumed in lower quantities in Wales and Scotland (as of 2020)? β Confectionary β
β In which one of the following circumstances will the prevalence of a disease in the population increase, all else being constant? β If the incidence rate of the disease falls. β
β Which vitamin is a major lipid-soluble antioxidant in cell membranes? β Vitamin D β
βββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββ΄ββββββββββββββββββββββββββββββββββββββββββββββ
Finally, lets highlight some of the DuckDB functions used in this section:
DESCRIBE
, returns the table schema.USING SAMPLE
, samples are used to randomly select a subset of a dataset.BAR
, draws a band whose width is proportional to (x - min) and equal to width characters when x = max. Width defaults to 80.string[begin:end]
, extracts a string using slice conventions. Missing begin or end arguments are interpreted as the beginning or end of the list respectively. Negative values are accepted.regexp_replace
, if the string contains the regexp pattern, replaces the matching part with replacement.LENGTH
, gets the number of characters in the string.
There are plenty of useful functions available in DuckDBβs SQL functions overview. The best part is that you can use them directly on Hugging Face datasets.