-
XC-Cache: Cross-Attending to Cached Context for Efficient LLM Inference
Paper β’ 2404.15420 β’ Published β’ 7 -
OpenELM: An Efficient Language Model Family with Open-source Training and Inference Framework
Paper β’ 2404.14619 β’ Published β’ 124 -
Phi-3 Technical Report: A Highly Capable Language Model Locally on Your Phone
Paper β’ 2404.14219 β’ Published β’ 251 -
How Good Are Low-bit Quantized LLaMA3 Models? An Empirical Study
Paper β’ 2404.14047 β’ Published β’ 44
Collections
Discover the best community collections!
Collections including paper arxiv:2310.18313
-
QuIP: 2-Bit Quantization of Large Language Models With Guarantees
Paper β’ 2307.13304 β’ Published β’ 2 -
SpQR: A Sparse-Quantized Representation for Near-Lossless LLM Weight Compression
Paper β’ 2306.03078 β’ Published β’ 3 -
OmniQuant: Omnidirectionally Calibrated Quantization for Large Language Models
Paper β’ 2308.13137 β’ Published β’ 17 -
AWQ: Activation-aware Weight Quantization for LLM Compression and Acceleration
Paper β’ 2306.00978 β’ Published β’ 8
-
Attention Is All You Need
Paper β’ 1706.03762 β’ Published β’ 44 -
FlashAttention-2: Faster Attention with Better Parallelism and Work Partitioning
Paper β’ 2307.08691 β’ Published β’ 8 -
Mixtral of Experts
Paper β’ 2401.04088 β’ Published β’ 157 -
Mistral 7B
Paper β’ 2310.06825 β’ Published β’ 47
-
FP8-LM: Training FP8 Large Language Models
Paper β’ 2310.18313 β’ Published β’ 31 -
LLM-FP4: 4-Bit Floating-Point Quantized Transformers
Paper β’ 2310.16836 β’ Published β’ 13 -
TEQ: Trainable Equivalent Transformation for Quantization of LLMs
Paper β’ 2310.10944 β’ Published β’ 9 -
ModuLoRA: Finetuning 3-Bit LLMs on Consumer GPUs by Integrating with Modular Quantizers
Paper β’ 2309.16119 β’ Published β’ 1
-
LoftQ: LoRA-Fine-Tuning-Aware Quantization for Large Language Models
Paper β’ 2310.08659 β’ Published β’ 22 -
QA-LoRA: Quantization-Aware Low-Rank Adaptation of Large Language Models
Paper β’ 2309.14717 β’ Published β’ 44 -
Norm Tweaking: High-performance Low-bit Quantization of Large Language Models
Paper β’ 2309.02784 β’ Published β’ 1 -
ModuLoRA: Finetuning 3-Bit LLMs on Consumer GPUs by Integrating with Modular Quantizers
Paper β’ 2309.16119 β’ Published β’ 1
-
Large Language Models for Compiler Optimization
Paper β’ 2309.07062 β’ Published β’ 22 -
Deja Vu: Contextual Sparsity for Efficient LLMs at Inference Time
Paper β’ 2310.17157 β’ Published β’ 11 -
FP8-LM: Training FP8 Large Language Models
Paper β’ 2310.18313 β’ Published β’ 31 -
Atom: Low-bit Quantization for Efficient and Accurate LLM Serving
Paper β’ 2310.19102 β’ Published β’ 10
-
Approximating Two-Layer Feedforward Networks for Efficient Transformers
Paper β’ 2310.10837 β’ Published β’ 10 -
BitNet: Scaling 1-bit Transformers for Large Language Models
Paper β’ 2310.11453 β’ Published β’ 96 -
QMoE: Practical Sub-1-Bit Compression of Trillion-Parameter Models
Paper β’ 2310.16795 β’ Published β’ 26 -
LLM-FP4: 4-Bit Floating-Point Quantized Transformers
Paper β’ 2310.16836 β’ Published β’ 13