--- license: mit datasets: - ankitagr01/dynamic_topic_modeling_arxiv_abstracts - knkarthick/topicsum - nuvocare/MSD_manual_topics_user_base language: - en metrics: - mse base_model: - thesephist/contra-bottleneck-t5-large-wikipedia pipeline_tag: summarization tags: - topic-extraction - topic-summarization - dyanmic-topic-modeling --- # Contra-Topic-bottleneck-t5-large: Linear Topic Extraction using Bottleneck T5 A lightweight approach to topic extraction leveraging the Bottleneck T5 autoencoder architecture with learned transformation matrices. This project provides three specialized transformation matrices for mapping content embeddings to topic embeddings across different domains. [**Check out the blog**](https://amanpriyanshu.github.io/blogs/posts/2024/contra-topic/) **TL;DR:** Transform content embeddings into topic embeddings using domain-specific 1024×1024 transformation matrices, trained on three distinct datasets. Built on top of the Bottleneck T5 architecture for efficient, training-free topic extraction. ## Motivation Large Language Models (LLMs) have become the go-to solution for many NLP tasks, including topic extraction and classification. However, they come with significant overhead: - High computational requirements - Large memory footprint - Considerable inference latency - Complex deployment needs - Limited to pre-specified classes This project offers a lightweight alternative specifically for topic extraction by leveraging the semantic structure of the Bottleneck T5's latent space. Instead of training a new model or fine-tuning existing ones, we learn a simple linear transformation between content and topic embeddings, providing: - Fast inference (milliseconds) - Minimal memory footprint (single 1024×1024 matrix per domain) - Simple deployment (basic matrix multiplication) - No need for GPU at inference time - Generator in nature ## Architecture Overview ### Base Model - Uses Bottleneck T5 Large ([thesephist/contra-bottleneck-t5-large-wikipedia](https://huggingface.co/thesephist/contra-bottleneck-t5-large-wikipedia)) - Fixed embedding dimension: 1024 - Pre-trained on Wikipedia data - Autoencoder architecture with attention pooling ### Transformation Layers - Three domain-specific transformation matrices (1024×1024 each) - Linear mapping from content to topic space - Learned using simple Mean Squared Error optimization - Total additional parameters: ~3M per domain ## Datasets and Performance Metrics ### 1. ArXiv Abstracts Dataset ([ankitagr01/dynamic_topic_modeling_arxiv_abstracts](https://huggingface.co/datasets/ankitagr01/dynamic_topic_modeling_arxiv_abstracts)) Scientific paper abstracts paired with their research topics, providing a test bed for academic content classification. **Performance Metrics:** - Training MSE: 0.00225 (error on samples used to learn transformation) - Testing MSE: 0.00268 (error on held-out validation set) - Inter-topic MSE: 0.00620 (minimum distance between different topic embeddings) **Use Cases:** - Automated paper categorization - Research trend analysis - Academic content recommendation ### 2. TopicSUM Dataset ([knkarthick/topicsum](https://huggingface.co/datasets/knkarthick/topicsum)) 241,171 dialogue samples with human-annotated topic labels, ideal for conversational content analysis. **Performance Metrics:** - Training MSE: 0.00252 - Testing MSE: 0.00255 - Inter-topic MSE: 0.00737 **Use Cases:** - Meeting summarization - Customer service dialogue categorization - Chat log analysis ### 3. MSD Manual Topics ([nuvocare/MSD_manual_topics_user_base](https://huggingface.co/datasets/nuvocare/MSD_manual_topics_user_base)) Medical content from Merck's Manual, featuring both professional and patient-oriented content. **Performance Metrics:** - Training MSE: 0.00174 - Testing MSE: 0.00197 - Inter-topic MSE: 0.00566 **Use Cases:** - Medical document classification - Healthcare content organization - Patient information routing ## Understanding the Metrics ### Computational Requirements | Resource | Requirement | Notes | |----------|-------------|--------| | Storage | ~9MB per matrix | 1024×1024 float32 values | | Memory | ~27MB total | All three domain matrices | | Inference Time | ~10ms | On CPU, per text sample | | Training Hardware | P100 GPU | Free tier on Kaggle | | Training Time | ~4 hours total | Mostly embedding generation | | Base Model | ~770M parameters | Loaded only during embedding creation | ### Performance Metrics Explained 1. **Training MSE (Mean Squared Error)** - Measures how well the transformation matrix maps content to topic embeddings - Calculated on the 80% training split - Lower values indicate better alignment between transformed content and actual topic embeddings 2. **Testing MSE** - Same metric but on 20% held-out test set - Indicates generalization capability - Similar values between train/test suggest good generalization. Slightly higher than training MSE is expected and healthy 3. **Inter-topic MSE** - Minimum squared distance between any pair of topic embeddings - Higher values indicate better topic separation - Critical for preventing topic confusion - Example: MSD's 0.00566 means medical topics maintain distinct representations ### Comparative Analysis - MSD dataset shows best training performance (0.00174 MSE) - Likely due to well-structured medical vocabulary - Clear topic boundaries in medical domain - TopicSUM has highest inter-topic MSE (0.00737) - Reflects diverse nature of conversational topics - Important for distinguishing between varied dialogue contexts - ArXiv results balance between the two - Scientific content has natural overlap between fields - Still maintains good topic separation (0.00620 inter-topic MSE) ## Implementation **Try it out here:** (https://colab.research.google.com/drive/1_SuTiL3QS-PUYjSrugqqD5mQlMv8Hbfc?usp=sharing) ### 1. Base Model Wrapper ```python import torch import torch.nn.functional as F from transformers import AutoTokenizer, AutoModelForCausalLM class BottleneckT5Autoencoder: def __init__(self, model_path: str, device='cpu'): self.device = device self.tokenizer = AutoTokenizer.from_pretrained(model_path, model_max_length=512) self.model = AutoModelForCausalLM.from_pretrained( model_path, trust_remote_code=True ).to(device) self.model.eval() @torch.no_grad() def embed(self, text: str) -> torch.FloatTensor: inputs = self.tokenizer(text, return_tensors='pt').to(self.device) decoder_inputs = self.tokenizer('', return_tensors='pt').to(self.device) return self.model( **inputs, decoder_input_ids=decoder_inputs['input_ids'], encode_only=True, )[0] @torch.no_grad() def generate_from_latent(self, latent: torch.FloatTensor, max_length=512, temperature=1.0) -> str: dummy_text = '.' dummy = self.embed(dummy_text) perturb_vector = latent - dummy self.model.perturb_vector = perturb_vector input_ids = self.tokenizer(dummy_text, return_tensors='pt').to(self.device).input_ids output = self.model.generate( input_ids=input_ids, max_length=max_length, do_sample=True, temperature=temperature, top_p=0.9, num_return_sequences=1, ) return self.tokenizer.decode(output[0], skip_special_tokens=True) ``` ### 2. Topic Mapper **Transformations Available:** 1. https://huggingface.co/AmanPriyanshu/Contra-Topic-bottleneck-t5-large/resolve/main/transformation_matrix_topicsum.pt 2. https://huggingface.co/AmanPriyanshu/Contra-Topic-bottleneck-t5-large/resolve/main/transformation_matrix_arxiv.pt 3. https://huggingface.co/AmanPriyanshu/Contra-Topic-bottleneck-t5-large/resolve/main/transformation_matrix_msd.pt ```python url = 'https://huggingface.co/AmanPriyanshu/Contra-Topic-bottleneck-t5-large/resolve/main/transformation_matrix_arxiv.pt' file_path = 'transformation_matrix.pt' with open(file_path, 'wb') as f: f.write(requests.get(url).content) transformation_matrix = torch.load(file_path, weights_only=False).float() print(transformation_matrix.shape, type(transformation_matrix)) ``` ### 3. Final Conversion ```python autoencoder = BottleneckT5Autoencoder(model_path=model_path, device=device) content_embedding = autoencoder.embed(content) topic_embedding = content_embedding @ transformation_matrix topic = = autoencoder.generate_from_latent(topic_embedding) print(topic) ``` ## Limitations and Future Work 1. **Representation Quality** - System inherits Bottleneck T5's encoding limitations - Performance depends on input text fitting model's training distribution 2. **Domain Specificity** - Each matrix is domain-optimized - Cross-domain performance not guaranteed - Future work: Investigate domain adaptation techniques 3. **Fixed Dimensionality** - Locked to Bottleneck T5's 1024D space - Potential future work: Dimension reduction studies 4. **Linear Transformation Limitations** - Assumes linear relationship between content and topic spaces - Future work: Explore non-linear transformations ## Memory and Computation Requirements - Transformation Matrix: 1024 × 1024 × 4 bytes ≈ 9MB per domain - Inference Time: ~10ms on CPU (matrix multiplication) - Total Model Size: ~27MB (all three domains) - Base Model: ~770M parameters (loaded only during embedding creation) ## Acknowledgments Special thanks to: - Linus Lee (@thesephist) for the Bottleneck T5 model - The T5 team at Google Research - Dataset providers: - @ankitagr01 for the ArXiv abstracts dataset - @knkarthick for the TopicSUM dataset - @nuvocare for the MSD Manual topics dataset - Kaggle for providing free P100 GPU resources ## License MIT ## Contributing Contributions are welcome! Please feel free to submit a Pull Request.