Molmo 2: State-of-the-art video understanding, pointing, and tracking multimodal

From Allen AI's Discord:

*Introducing Molmo 2* : State-of-the-art video understanding, pointing, and tracking

Last year, Molmo helped push image understanding forward with pointing—grounded answers you can verify. Now, *Molmo 2 *brings those capabilities to video—so the model doesn’t just answer questions, it can show you where & when something is happening.

On major industry benchmarks, Molmo 2 *surpasses most open multimodal models* and even *rivals closed peers* like Gemini 3 Pro and Claude Sonnet 4.5.

Molmo 2 returns pixel coordinates + timestamps over videos and coordinates over images, enabling: *◘ Video + image QA ◘ Counting-by-pointing ◘ Dense captioning ◘ Artifact detection ◘ Subtitle-aware analysis …and more!*

Three variants depending on your needs: *Molmo 2 (8B)*: Qwen 3 backbone, best overall performance *Molmo 2 (4B)*: Qwen 3 backbone, fast + efficient *Molmo 2-O (7B)*: Olmo backbone, fully open model flow

Demos: *Counting objects & actions* (“How many times does the ball hit the ground?”)—returns the count plus space–time pointers for each event: https://www.youtube.com/watch?v=fvYfPTTTZ_w *Ask-it-anything long-video QA* (“Why does the player change strategy here?”)—points to the moments supporting the answer: https://www.youtube.com/watch?v=Ej3Hb3kRiac *Object tracking* (“Follow the red race car.”)—tracks it across frames with coordinates over time: https://www.youtube.com/watch?v=uot140v_h08

We’ve also *significantly upgraded the Ai2 Playground* You can now upload a video or multiple images to try summarization, tracking, and counting—while seeing exactly where the model is looking.

Try it and learn more: ▶ Playground: https://playground.allenai.org/ ⬇ Models: https://huggingface.co/collections/allenai/molmo2 Blog: https://allenai.org/blog/molmo2 Report: https://allenai.org/papers/molmo2 API coming soon

Molmo 2: State-of-the-art video understanding, pointing, and tracking multimodal

Comments