SlowFast is a video understanding framework that captures both spatial semantics and temporal dynamics efficiently by processing video frames at two different temporal resolutions. The slow pathway encodes semantic context by sampling frames sparsely, while the fast pathway captures motion and fine temporal cues by operating on densely sampled frames with fewer channels. Together, these two pathways complement each other, allowing the network to model both appearance and motion without excessive computational cost. The architecture is modular and supports tasks like action recognition, temporal localization, and video segmentation, performing strongly on benchmarks like Kinetics and AVA. The repository provides training recipes, pretrained models, and distributed pipelines optimized for large-scale video datasets.
Features
- Dual-pathway architecture: slow for semantics, fast for motion cues
- Efficient temporal modeling with complementary frame rates and channel budgets
- State-of-the-art results in action recognition and spatiotemporal understanding
- Training and evaluation pipelines for large-scale video datasets (Kinetics, AVA)
- Pretrained models and tools for fine-tuning and visualization
- Modular design for extending to video segmentation, detection, and multimodal tasks