leonardlin
's Collections
QuIP: 2-Bit Quantization of Large Language Models With Guarantees
Paper
•
2307.13304
•
Published
•
2
SpQR: A Sparse-Quantized Representation for Near-Lossless LLM Weight
Compression
Paper
•
2306.03078
•
Published
•
3
OmniQuant: Omnidirectionally Calibrated Quantization for Large Language
Models
Paper
•
2308.13137
•
Published
•
17
AWQ: Activation-aware Weight Quantization for LLM Compression and
Acceleration
Paper
•
2306.00978
•
Published
•
8
GPTQ: Accurate Post-Training Quantization for Generative Pre-trained
Transformers
Paper
•
2210.17323
•
Published
•
7
ZeroQuant(4+2): Redefining LLMs Quantization with a New FP6-Centric
Strategy for Diverse Generative Tasks
Paper
•
2312.08583
•
Published
•
9
QLoRA: Efficient Finetuning of Quantized LLMs
Paper
•
2305.14314
•
Published
•
45
LLM-FP4: 4-Bit Floating-Point Quantized Transformers
Paper
•
2310.16836
•
Published
•
13
FP8-LM: Training FP8 Large Language Models
Paper
•
2310.18313
•
Published
•
31
FP8 Quantization: The Power of the Exponent
Paper
•
2208.09225
•
Published
Atom: Low-bit Quantization for Efficient and Accurate LLM Serving
Paper
•
2310.19102
•
Published
•
10
QLLM: Accurate and Efficient Low-Bitwidth Quantization for Large
Language Models
Paper
•
2310.08041
•
Published
•
1
Towards End-to-end 4-Bit Inference on Generative Large Language Models
Paper
•
2310.09259
•
Published
•
1
Outlier Suppression+: Accurate quantization of large language models by
equivalent and optimal shifting and scaling
Paper
•
2304.09145
•
Published
•
1
Dual Grained Quantization: Efficient Fine-Grained Quantization for LLM
Paper
•
2310.04836
•
Published
•
1
Rethinking Channel Dimensions to Isolate Outliers for Low-bit Weight
Quantization of Large Language Models
Paper
•
2309.15531
•
Published
•
1
LoftQ: LoRA-Fine-Tuning-Aware Quantization for Large Language Models
Paper
•
2310.08659
•
Published
•
22
QA-LoRA: Quantization-Aware Low-Rank Adaptation of Large Language Models
Paper
•
2309.14717
•
Published
•
44
Norm Tweaking: High-performance Low-bit Quantization of Large Language
Models
Paper
•
2309.02784
•
Published
•
1
ModuLoRA: Finetuning 3-Bit LLMs on Consumer GPUs by Integrating with
Modular Quantizers
Paper
•
2309.16119
•
Published
•
1
Training and inference of large language models using 8-bit floating
point
Paper
•
2309.17224
•
Published
•
1
BitNet: Scaling 1-bit Transformers for Large Language Models
Paper
•
2310.11453
•
Published
•
96
Understanding the Impact of Post-Training Quantization on Large Language
Models
Paper
•
2309.05210
•
Published
•
1
PB-LLM: Partially Binarized Large Language Models
Paper
•
2310.00034
•
Published
•
1
TEQ: Trainable Equivalent Transformation for Quantization of LLMs
Paper
•
2310.10944
•
Published
•
9
QFT: Quantized Full-parameter Tuning of LLMs with Affordable Resources
Paper
•
2310.07147
•
Published
•
1
FPTQ: Fine-grained Post-Training Quantization for Large Language Models
Paper
•
2308.15987
•
Published
•
1
SmoothQuant: Accurate and Efficient Post-Training Quantization for Large
Language Models
Paper
•
2211.10438
•
Published
•
2
LLM.int8(): 8-bit Matrix Multiplication for Transformers at Scale
Paper
•
2208.07339
•
Published
•
4
Optimize Weight Rounding via Signed Gradient Descent for the
Quantization of LLMs
Paper
•
2309.05516
•
Published
•
9
BiLLM: Pushing the Limit of Post-Training Quantization for LLMs
Paper
•
2402.04291
•
Published
•
48
Extreme Compression of Large Language Models via Additive Quantization
Paper
•
2401.06118
•
Published
•
12
KVQuant: Towards 10 Million Context Length LLM Inference with KV Cache
Quantization
Paper
•
2401.18079
•
Published
•
7
GPTVQ: The Blessing of Dimensionality for LLM Quantization
Paper
•
2402.15319
•
Published
•
19
The case for 4-bit precision: k-bit Inference Scaling Laws
Paper
•
2212.09720
•
Published
•
3
SqueezeLLM: Dense-and-Sparse Quantization
Paper
•
2306.07629
•
Published
•
4
QuIP#: Even Better LLM Quantization with Hadamard Incoherence and
Lattice Codebooks
Paper
•
2402.04396
•
Published
•
1