Edit model card

This model is an instance of RoBERTa-Base finetuned to classify student postsecondary administrative transcripts into the National Center of Education Statistics' 2010 College Course Map (CCM).

The College Course Map is a hierarchical taxonomy of course content that roughly aligns with the commonly used Classification of Instructional Program codes used in the United States.

The College Course Map was developed for use with longitudinal surveys including the High School Longitudinal Study of 2009 (HSLS 2009), Baccalaureate and Beyond Longitudinal Study of 2008-2012 (B&B 2008), Beginning Postsecondary Students Longitudinal Study of 2004-2009 (BPS 2004), and Beginning Postsecondary Students Longitudinal Study of 2012-2017 (BPS 2012).

Administrative transcripts for all survey participants were collected along with each survey and each course enrollment in the transcripts were labelled with the appropriate six-digit CCM by human annotators. More information about the development of the CCM and the annotation process are available here:

      Bryan, M. & Simone, S. (2012). 2010 College Course Map Technical Report. National Center for Education Statistics. https://nces.ed.gov/pubs2012/2012162rev.pdf.

This RoBERTa model is fine-tuned to classify course records into the appropriate two-digit CCM code (for example, 45 represents Social Science courses and 38.01 represents Philosophy and Religion courses). This model is fine-tuned on 802,190 unique course sections from the four surveys referenced above.

More information about the fine-tuning process is available here:

      Annaliese Paulson, Kevin Stange, and Allyson Flaster. (2024) Classifying Courses at Scale: a Text as Data Approach to Characterizing Student Course-Taking Trends with Administrative Transcripts. Working Paper

The model is fine-tuned on data formatted as "{SUBJECT CODE} {CATALOG NUMBER} --- {COURSE TITLE}". For example, for a course offered in an economics department with subject code "ECON", course number "101", and course title "Principles of Microeconomics", the model anticipates the following string: "ECON 101 --- Principles of Microeconomics." This Colab Notebook provides a short vignette applying the model.

Six-Digit Prediction Accuracy on Course Sections: 0.65
Six-Digit Prediction Accuracy on Enrollment Weighted Course Sections: 0.75

Downloads last month
21
Safetensors
Model size
126M params
Tensor type
F32
·
Inference Examples
Inference API (serverless) is not available, repository is disabled.

Model tree for annamp/classifying-courses-at-scale-six-digit-roberta-base

Finetuned
this model

Collection including annamp/classifying-courses-at-scale-six-digit-roberta-base