Collections

26

RLHF Workflow: From Reward Modeling to Online RLHF

Paper • 2405.07863 • Published May 13 • 67
Chameleon: Mixed-Modal Early-Fusion Foundation Models

Paper • 2405.09818 • Published May 16 • 126
Meteor: Mamba-based Traversal of Rationale for Large Language and Vision Models

Paper • 2405.15574 • Published May 24 • 53
An Introduction to Vision-Language Modeling

Paper • 2405.17247 • Published May 27 • 85

30

EVA-CLIP-18B: Scaling CLIP to 18 Billion Parameters

Paper • 2402.04252 • Published Feb 6 • 25
Vision Superalignment: Weak-to-Strong Generalization for Vision Foundation Models

Paper • 2402.03749 • Published Feb 6 • 12
ScreenAI: A Vision-Language Model for UI and Infographics Understanding

Paper • 2402.04615 • Published Feb 7 • 38
EfficientViT-SAM: Accelerated Segment Anything Model Without Performance Loss

Paper • 2402.05008 • Published Feb 7 • 19

RLHF Workflow: From Reward Modeling to Online RLHF

Chameleon: Mixed-Modal Early-Fusion Foundation Models

Meteor: Mamba-based Traversal of Rationale for Large Language and Vision Models

An Introduction to Vision-Language Modeling

EVA-CLIP-18B: Scaling CLIP to 18 Billion Parameters

Vision Superalignment: Weak-to-Strong Generalization for Vision Foundation Models

ScreenAI: A Vision-Language Model for UI and Infographics Understanding

EfficientViT-SAM: Accelerated Segment Anything Model Without Performance Loss

MoMa: Efficient Early-Fusion Pre-training with Mixture of Modality-Aware Experts

VILA^2: VILA Augmented VILA

The Synergy between Data and Multi-Modal Large Language Models: A Survey from Co-Development Perspective

Vision language models are blind

LLaMA: Open and Efficient Foundation Language Models

Efficient Tool Use with Chain-of-Abstraction Reasoning

MoMa: Efficient Early-Fusion Pre-training with Mixture of Modality-Aware Experts

iVideoGPT: Interactive VideoGPTs are Scalable World Models

Meteor: Mamba-based Traversal of Rationale for Large Language and Vision Models

An Introduction to Vision-Language Modeling

Matryoshka Multimodal Models

Attention Is All You Need

LLaMA: Open and Efficient Foundation Language Models

Efficient Tool Use with Chain-of-Abstraction Reasoning

MoMa: Efficient Early-Fusion Pre-training with Mixture of Modality-Aware Experts

Megalodon: Efficient LLM Pretraining and Inference with Unlimited Context Length

RecurrentGemma: Moving Past Transformers for Efficient Open Language Models

Eagle and Finch: RWKV with Matrix-Valued States and Dynamic Recurrence

Mamba: Linear-Time Sequence Modeling with Selective State Spaces

Rho-1: Not All Tokens Are What You Need

VASA-1: Lifelike Audio-Driven Talking Faces Generated in Real Time

Instruction-tuned Language Models are Better Knowledge Learners

DoRA: Weight-Decomposed Low-Rank Adaptation

Can Large Language Models Understand Context?

OLMo: Accelerating the Science of Language Models

Self-Rewarding Language Models

SemScore: Automated Evaluation of Instruction-Tuned LLMs based on Semantic Textual Similarity

Outrageously Large Neural Networks: The Sparsely-Gated Mixture-of-Experts Layer

Sparse Networks from Scratch: Faster Training without Losing Performance

ZeRO: Memory Optimizations Toward Training Trillion Parameter Models

A Mixture of h-1 Heads is Better than h Heads