MarDini: Masked Autoregressive Diffusion for Video Generation at Scale Paper • 2410.20280 • Published 11 days ago • 21
NaturalBench: Evaluating Vision-Language Models on Natural Adversarial Samples Paper • 2410.14669 • Published 19 days ago • 35
MLLM can see? Dynamic Correction Decoding for Hallucination Mitigation Paper • 2410.11779 • Published 22 days ago • 24
Eagle: Exploring The Design Space for Multimodal LLMs with Mixture of Encoders Paper • 2408.15998 • Published Aug 28 • 83
Transfusion: Predict the Next Token and Diffuse Images with One Multi-Modal Model Paper • 2408.11039 • Published Aug 20 • 56
Seeing Faces in Things: A Model and Dataset for Pareidolia Paper • 2409.16143 • Published Sep 24 • 15
PixWizard: Versatile Image-to-Image Visual Assistant with Open-Language Instructions Paper • 2409.15278 • Published Sep 23 • 22
Qwen2-VL: Enhancing Vision-Language Model's Perception of the World at Any Resolution Paper • 2409.12191 • Published Sep 18 • 73
CDM: A Reliable Metric for Fair and Accurate Formula Recognition Evaluation Paper • 2409.03643 • Published Sep 5 • 18
Real-Time Video Generation with Pyramid Attention Broadcast Paper • 2408.12588 • Published Aug 22 • 14