Sparse Autoencoders for Scientific Paper Embeddings

This repository contains a collection of Sparse Autoencoders (SAEs) trained on embeddings from scientific papers in two domains: Computer Science (cs.LG) and Astrophysics (astro.PH). These SAEs are designed to disentangle semantic concepts in dense embeddings while maintaining semantic fidelity.

Model Description

Overview

The SAEs in this repository are trained on embeddings of scientific paper abstracts from arXiv, specifically from the cs.LG (Computer Science - Machine Learning) and astro.PH (Astrophysics) categories. They are designed to extract interpretable features from dense text embeddings derived from large language models.

Model Architecture

Each SAE follows a top-k architecture with varying hyperparameters:

k: number of active latents (16, 32, 64, or 128)
n: total number of latents (3072, 4608, 6144, 9216, or 12288)

The naming convention for the models is: {domain}_{k}_{n}_{batch_size}.pth

For example, csLG_128_3072_256.pth represents an SAE trained on cs.LG data with k=128, n=3072, and a batch size of 256.

Intended Uses & Limitations

These SAEs are primarily intended for:

Extracting interpretable features from dense embeddings of scientific texts
Enabling fine-grained control over semantic search in scientific literature
Studying the structure of semantic spaces in specific scientific domains

Limitations:

The models are domain-specific (cs.LG and astro.PH) and may not generalize well to other domains
Performance may vary depending on the quality and domain-specificity of the input embeddings

Training Data

The SAEs were trained on embeddings of abstracts from:

cs.LG: 153,000 papers
astro.PH: 272,000 papers

Training Procedure

The SAEs were trained using a custom loss function combining reconstruction loss, sparsity constraints, and an auxiliary loss. For detailed training procedures, please refer to our paper (link to be added upon publication).

Evaluation Results

Performance metrics for various configurations:

k	n	Domain	MSE	Log FD	Act Mean
16	3072	astro.PH	0.2264	-2.7204	0.1264
16	3072	cs.LG	0.2284	-2.7314	0.1332
64	9216	astro.PH	0.1182	-2.4682	0.0539
64	9216	cs.LG	0.1240	-2.3536	0.0545
128	12288	astro.PH	0.0936	-2.7025	0.0399
128	12288	cs.LG	0.0942	-2.0858	0.0342

MSE: Normalised Mean Squared Error
Log FD: Mean log density of feature activations
Act Mean: Mean activation value across non-zero features

For full results, please refer to our paper (link to be added upon publication).

Ethical Considerations

While these models are designed to improve interpretability, users should be aware that:

The extracted features may reflect biases present in the scientific literature used for training
Interpretations of the features should be validated carefully, especially when used for decision-making processes

Citation

If you use these models in your research, please cite our paper (citation to be added upon publication).

Additional Information

For more details on the methodology, feature families, and applications in semantic search, please refer to our full paper (link to be added upon publication).

charlieoneill
/

embedding-saes