-
Attention Is All You Need
Paper ā¢ 1706.03762 ā¢ Published ā¢ 44 -
LoRA: Low-Rank Adaptation of Large Language Models
Paper ā¢ 2106.09685 ā¢ Published ā¢ 30 -
Direct Preference Optimization: Your Language Model is Secretly a Reward Model
Paper ā¢ 2305.18290 ā¢ Published ā¢ 48 -
Lost in the Middle: How Language Models Use Long Contexts
Paper ā¢ 2307.03172 ā¢ Published ā¢ 36
Collections
Discover the best community collections!
Collections including paper arxiv:1706.03762
-
Attention Is All You Need
Paper ā¢ 1706.03762 ā¢ Published ā¢ 44 -
Language Models are Few-Shot Learners
Paper ā¢ 2005.14165 ā¢ Published ā¢ 11 -
Chain-of-Thought Prompting Elicits Reasoning in Large Language Models
Paper ā¢ 2201.11903 ā¢ Published ā¢ 9 -
Orca 2: Teaching Small Language Models How to Reason
Paper ā¢ 2311.11045 ā¢ Published ā¢ 70
-
Attention Is All You Need
Paper ā¢ 1706.03762 ā¢ Published ā¢ 44 -
BERT: Pre-training of Deep Bidirectional Transformers for Language Understanding
Paper ā¢ 1810.04805 ā¢ Published ā¢ 14 -
Universal Language Model Fine-tuning for Text Classification
Paper ā¢ 1801.06146 ā¢ Published ā¢ 6 -
Language Models are Few-Shot Learners
Paper ā¢ 2005.14165 ā¢ Published ā¢ 11
-
Attention Is All You Need
Paper ā¢ 1706.03762 ā¢ Published ā¢ 44 -
BERT: Pre-training of Deep Bidirectional Transformers for Language Understanding
Paper ā¢ 1810.04805 ā¢ Published ā¢ 14 -
Universal Language Model Fine-tuning for Text Classification
Paper ā¢ 1801.06146 ā¢ Published ā¢ 6 -
Language Models are Few-Shot Learners
Paper ā¢ 2005.14165 ā¢ Published ā¢ 11
-
The Impact of Depth and Width on Transformer Language Model Generalization
Paper ā¢ 2310.19956 ā¢ Published ā¢ 9 -
Retentive Network: A Successor to Transformer for Large Language Models
Paper ā¢ 2307.08621 ā¢ Published ā¢ 170 -
RWKV: Reinventing RNNs for the Transformer Era
Paper ā¢ 2305.13048 ā¢ Published ā¢ 14 -
Attention Is All You Need
Paper ā¢ 1706.03762 ā¢ Published ā¢ 44
-
Attention Is All You Need
Paper ā¢ 1706.03762 ā¢ Published ā¢ 44 -
FlashAttention-2: Faster Attention with Better Parallelism and Work Partitioning
Paper ā¢ 2307.08691 ā¢ Published ā¢ 8 -
Mixtral of Experts
Paper ā¢ 2401.04088 ā¢ Published ā¢ 157 -
Mistral 7B
Paper ā¢ 2310.06825 ā¢ Published ā¢ 47
-
Detecting Pretraining Data from Large Language Models
Paper ā¢ 2310.16789 ā¢ Published ā¢ 10 -
Let's Synthesize Step by Step: Iterative Dataset Synthesis with Large Language Models by Extrapolating Errors from Small Models
Paper ā¢ 2310.13671 ā¢ Published ā¢ 18 -
AutoMix: Automatically Mixing Language Models
Paper ā¢ 2310.12963 ā¢ Published ā¢ 14 -
An Emulator for Fine-Tuning Large Language Models using Small Language Models
Paper ā¢ 2310.12962 ā¢ Published ā¢ 14
-
Efficient Memory Management for Large Language Model Serving with PagedAttention
Paper ā¢ 2309.06180 ā¢ Published ā¢ 25 -
LM-Infinite: Simple On-the-Fly Length Generalization for Large Language Models
Paper ā¢ 2308.16137 ā¢ Published ā¢ 39 -
Scaling Transformer to 1M tokens and beyond with RMT
Paper ā¢ 2304.11062 ā¢ Published ā¢ 2 -
DeepSpeed Ulysses: System Optimizations for Enabling Training of Extreme Long Sequence Transformer Models
Paper ā¢ 2309.14509 ā¢ Published ā¢ 17
-
Attention Is All You Need
Paper ā¢ 1706.03762 ā¢ Published ā¢ 44 -
Retrieval-Augmented Generation for Knowledge-Intensive NLP Tasks
Paper ā¢ 2005.11401 ā¢ Published ā¢ 12 -
LoRA: Low-Rank Adaptation of Large Language Models
Paper ā¢ 2106.09685 ā¢ Published ā¢ 30 -
FlashAttention: Fast and Memory-Efficient Exact Attention with IO-Awareness
Paper ā¢ 2205.14135 ā¢ Published ā¢ 11
-
SIMPL: A Simple and Efficient Multi-agent Motion Prediction Baseline for Autonomous Driving
Paper ā¢ 2402.02519 ā¢ Published -
Mixtral of Experts
Paper ā¢ 2401.04088 ā¢ Published ā¢ 157 -
Optimal Transport Aggregation for Visual Place Recognition
Paper ā¢ 2311.15937 ā¢ Published -
GOAT: GO to Any Thing
Paper ā¢ 2311.06430 ā¢ Published ā¢ 14