Visual intelligence for your home.
Implementation of Vision Transformer, a simple way to achieve SOTA
Build Vision Agents quickly with any model or video provider
Phi-3.5 for Mac: Locally-run Vision and Language Models
Mixture-of-Experts Vision-Language Models for Advanced Multimodal
The repository provides code for running inference with SAM 2
Structure-from-Motion and Multi-View Stereo
Collection of CVPR 2026 Papers and Open Source Projects
Towards Real-World Vision-Language Understanding
Gracefully face hCaptcha challenge with multimodal llms
3D Computer Vision Framework
Vision-based AI framework for cross-platform UI automation tasks
Draw wireframe sketches and generate HTML with AI vision models
Advanced AI Explainability for computer vision
Skywork-R1V is an advanced multimodal AI model series
A Pioneering Open-Source Alternative to GPT-4o
ICLR2024 Spotlight: curation/training code, metadata, distribution
Reference PyTorch implementation and models for DINOv3
ExDARK dataset is the largest collection of low-light images
Multilingual Document Layout Parsing in a Single Vision-Language Model
RF-DETR is a real-time object detection and segmentation
Provides code for running inference with the SegmentAnything Model
"Big Model" trains a visual multimodal VLM with 26M parameters
A neural network that transforms a design mock-up into static websites
Large-language-model & vision-language-model based on Linear Attention