Zero-Shot Open-Vocabulary Tracking with Large Pre-Trained Models

Wen-Hsuan Chu    Adam W. Harley    Pavel Tokmakov   
Achal Dave    Leonidas J. Guibas    Katerina Fragkiadaki


Tracking-by-detection has long been a dominant paradigm for object tracking of specific object categories, such as people and vehicles. Recently, large-scale pre-trained models have shown promising advances in detecting and segmenting various types of objects and object parts in 2D images in the wild. This begs the question: can we re-purpose these large-scale pretrained static image detectors and segmenters to advance open-vocabulary video tracking? In this paper, we re-purpose an open-vocabulary detector, segmenter, and dense optical flow estimator, into a model that tracks and segments objects of any category in 2D videos. Given a monocular video input, our method predicts object and part tracks, with associated language descriptions. Our approach does not introduce any significant innovations: we propagate object boxes from frame to frame using an optical flow based motion model, we refine these propagated boxes with the box regression module of an open-vocabulary visual detector, and we prompt an open-world segmenter with the refined box to segment the box interior in a temporally consistent way. We decide the termination of an object track using forward-backward optical flow consistency, and re-identify using deep feature matching. We show that our model achieves strong performance on multiple established video object segmentation and tracking benchmarks, despite never being explicitly trained for tracking. We hope that our approach can serve as a simple and extensible framework for future research.


For quantitative results, please check out the paper. Here we show some gifs on examples where the users can prompt the tracker to track specific object categories.

No prompts


No prompts

"Coffee Maker"

By swapping out the detector for a referential detector, OVTracktor can be easily extended for referential tracking.

No prompts

"Large orange Parachute"

Since the detector is run at all frames, care must be taken when querying for "position-based" entities. In this example, the "rightmost" person changes as people move in the scene, leading to 2 final tracks.

No prompts

"Rightmost person"


Wen-Hsuan Chu, Adam W. Harley, Pavel Tokmakov, Achal Dave, Leonidas J. Guibasm and Katerina Fragkiadaki. Zero-Shot Open-Vocabulary Tracking with Large Pre-Trained Models