The WebVid-10M Dataset

Lonely beautiful woman sitting on the tent looking outside. wind on the hair and camping on the beach near the colors of water and shore. freedom and alternative tiny house for traveler lady drinking.

Female cop talking on walkietalkie, responding emergency call, crime prevention

Billiards, concentrated young woman playing in club.

Cabeza de toro, punta cana/ dominican republic - feb 20, 2020: 4k drone flight over coral reef with manta

Kherson, ukraine - 20 may 2016: open, free, rock music festival crowd partying at a rock concert. hands up, people, fans cheering clapping applauding in kherson, ukraine - 20 may 2016. band performing

Runners feet in a sneakers close up. realistic three dimensional animation.

What is WebVid-10M?

WebVid-10M is a large-scale dataset of short videos with textual descriptions sourced from the web. The videos are diverse and rich in their content.
  • 10.7M video-caption pairs.
  • 52K total video hours.


Full 10M (coming soon)
2.5M Subset



M. Bain, A. Nagrani, G. Varol, A. Zisserman.
Frozen in Time: A Joint Video and Image Encoder for End to End Paper.
ICCV, 2021.
(hosted on ArXiv)



Max Bain

Arsha Nagrani

Gül Varol

Andrew Zisserman

Template by Phillip Isola and Richard Zhang.