latency free download

Showing 165 open source projects for "latency"

View related business solutions

Artificial Intelligence Clear Filters & Widen Search

Inventory and Order Management Software for Multichannel Sellers
Avoid stockouts, overselling, and losing control as your business grows.

We are the most powerful inventory and order management platform for Amazon, Walmart, and multichannel product sellers. Centralize orders, product information, and fulfillment operations to run more efficiently, sell more products, and stay compliant with marketplace requirements so you can grow profitably.

Learn More
Securden Privileged Account Manager
Unified Privileged Access Management

Discover and manage administrator, service, and web app passwords, keys, and identities. Automate management with approval workflows. Centrally control, audit, monitor, and record all access to critical IT assets.

Learn More
1

DeepSpeed MII

MII makes low-latency and high-throughput inference possible

...While open-sourcing has democratized access to AI capabilities, their application is still restricted by two critical factors: inference latency and cost. DeepSpeed-MII is a new open-source python library from DeepSpeed, aimed towards making low-latency, low-cost inference of powerful models not only feasible but also easily accessible. MII offers access to the highly optimized implementation of thousands of widely used DL models. MII-supported models achieve significantly lower latency and cost compared to their original implementation.

Downloads: 6 This Week

Last Update: 2025-03-25
See Project
2

RealtimeSTT

A robust, efficient, low-latency speech-to-text library

RealtimeSTT is a Python-based realtime speech-to-text engine emphasizing low latency, wake-word detection, voice activity detection, and automatic speech segmentation. It provides asynchronous callbacks, nanosecond-precision timestamps, and CLI tools, suitable for building voice assistants, meeting transcribers, or live caption systems.

Downloads: 6 This Week

Last Update: 2025-07-03
See Project
3

FlashInfer

FlashInfer: Kernel Library for LLM Serving

FlashInfer is a kernel library designed to enhance the serving of Large Language Models (LLMs) by optimizing inference performance. It provides a high-performance framework that integrates seamlessly with existing systems, aiming to reduce latency and improve efficiency in LLM deployments. FlashInfer supports various hardware architectures and is built to scale with the demands of production environments.

Downloads: 35 This Week

Last Update: 1 day ago
See Project
4

LiteRT

LiteRT is the new name for TensorFlow Lite (TFLite)

LiteRT is an experimental, real-time inference runtime built by Google AI Edge to run lightweight ML models on edge devices with ultra-low latency. It focuses on delivering predictable and consistent performance for models used in time-critical applications like robotics, AR/VR, and IoT. LiteRT is designed to be hardware-agnostic, with minimal dependencies and tight control over execution scheduling.

Downloads: 20 This Week

Last Update: 3 days ago
See Project
The full-stack observability platform that protects your dataLayer, tags and conversion data
Stop losing revenue to bad data today. and protect your marketing data with Code-Cube.io.

Code-Cube.io detects issues instantly, alerts you in real time and helps you resolve them fast. No manual QA. No unreliable data. Just data you can trust and act on.

Learn More
5

OpenJarvis

Personal AI, On Personal Devices

...The framework provides shared primitives for building local-first agents, along with evaluation tools that measure performance using metrics such as energy consumption, latency, cost, and accuracy. OpenJarvis integrates with local inference engines like Ollama, vLLM, SGLang, and llama.cpp to run language models directly on personal hardware. It also includes a learning loop that allows models to improve over time using locally generated interaction traces. By prioritizing local execution and efficiency, OpenJarvis aims to provide a foundation for privacy-preserving personal AI assistants.

Downloads: 138 This Week

Last Update: 2026-03-16
See Project
6

AsmJit

Low-latency machine code generation

...It is commonly used in applications such as emulators, compilers, and high-performance computing systems where runtime optimization is essential. asmjit emphasizes low latency and efficiency, ensuring that generated code executes quickly without significant overhead. Its modular design allows developers to integrate it into various systems with minimal friction. Overall, asmjit bridges the gap between high-level programming and low-level execution by enabling efficient runtime code generation.

Downloads: 5 This Week

Last Update: 2026-04-06
See Project
7

optillm

Optimizing inference proxy for LLMs

...It serves as an OpenAI API-compatible proxy, allowing for seamless integration into existing workflows while optimizing inference processes. OptiLLM aims to reduce latency and resource consumption during LLM inference.

Downloads: 7 This Week

Last Update: 2026-03-19
See Project
8

Portkey AI Gateway

A blazing fast AI Gateway with integrated guardrails

...It’s lightweight and optimized for low latency with a small footprint.

Downloads: 7 This Week

Last Update: 2026-01-12
See Project
9

gemma.cpp

lightweight, standalone C++ inference engine for Google's Gemma models

Gemma.cpp is a C++ implementation for running inference with Gemma models efficiently on CPUs and GPUs. Developed by Google, it allows running large language models (LLMs) like Gemma with minimal hardware, focusing on optimized performance and low latency. Gemma.cpp is intended for developers seeking to deploy LLMs in production environments without needing massive computational resources.

Downloads: 6 This Week

Last Update: 2025-03-25
See Project
SoftCo: Enterprise Invoice and P2P Automation Software
For companies that process over 20,000 invoices per year

SoftCo Accounts Payable Automation processes all PO and non-PO supplier invoices electronically from capture and matching through to invoice approval and query management. SoftCoAP delivers unparalleled touchless automation by embedding AI across matching, coding, routing, and exception handling to minimize the number of supplier invoices requiring manual intervention. The result is 89% processing savings, supported by a context-aware AI Assistant that helps users understand exceptions, answer questions, and take the right action faster.

Learn More
10

Faster Whisper

Faster Whisper transcription with CTranslate2

Faster Whisper is an optimized implementation of the Whisper speech recognition model designed to deliver significantly faster inference while maintaining comparable accuracy. It leverages efficient inference engines and optimized computation strategies to reduce latency and resource consumption. The system is particularly useful for real-time or large-scale transcription tasks where performance is critical. It supports multiple model sizes, allowing users to balance speed and accuracy based on their needs. The architecture is designed to run efficiently on both CPUs and GPUs, making it accessible across different environments. ...

Downloads: 15 This Week

Last Update: 2026-04-06
See Project
11

Open Vision Agents by Stream

Build Vision Agents quickly with any model or video provider

...It focuses on combining video understanding models, such as YOLO and Roboflow based detectors, with real time large language models like OpenAI Realtime and Gemini Live to create interactive experiences. The framework uses Stream’s ultra low latency edge network so agents can join sessions quickly and maintain very low audio and video latency while processing frames and generating responses. Developers work with an agent abstraction that connects video edge providers, LLMs, and processors into pipelines, making it easier to orchestrate tasks like object detection, pose estimation, and conversational guidance. ...

Downloads: 6 This Week

Last Update: 2 days ago
See Project
12

Edgee

AI gateway with token compression for Claude Code, Codex, and more

Edgee is an edge-native execution platform designed to run AI-driven logic and data processing directly at the network edge, reducing latency and improving responsiveness for modern applications. It enables developers to deploy functions and workflows closer to users, allowing real-time processing without relying heavily on centralized cloud infrastructure. The platform is built to support event-driven architectures, where actions are triggered by incoming requests, user behavior, or external signals. ...

Downloads: 16 This Week

Last Update: 2 days ago
See Project
13

Cloudflare Agents

Build and deploy AI Agents on Cloudflare

Cloudflare Agents is an open-source framework designed to help developers build, deploy, and manage AI agents that run at the network edge. It provides infrastructure for creating stateful, event-driven agents capable of real-time interaction while maintaining low latency through Cloudflare’s distributed platform. The project includes SDKs, templates, and deployment tooling that simplify the process of connecting agents to external APIs, storage systems, and workflows. Its architecture emphasizes persistent memory, enabling agents to maintain context across sessions and interactions. Developers can orchestrate complex behaviors using workflows and durable objects, making it suitable for production-grade autonomous systems. ...

Downloads: 12 This Week

Last Update: 20 hours ago
See Project
14

ScaleLLM

A high-performance inference system for large language models

ScaleLLM is a high-performance inference system tailored for Large Language Models (LLMs), specifically designed for production environments. It focuses on optimizing inference processes to handle large-scale deployments efficiently, ensuring low latency and high throughput. ScaleLLM supports various LLM architectures and integrates with existing infrastructures, providing a scalable solution for deploying LLMs in real-world applications.

Downloads: 0 This Week

Last Update: 2025-09-13
See Project
15

Orpheus TTS

Towards Human-Sounding Speech

...The project ships both pretrained and finetuned English models, as well as a family of multilingual models released as a research preview, and includes data-processing scripts so users can train or finetune their own variants. Inference is provided through a Python package that uses vLLM under the hood for high-throughput, low-latency generation, including streaming examples that show how to generate audio chunks in real time. The maintainers provide Colab notebooks, a standardized prompting format, and one-click deployment via Baseten for production-grade, FP8/FP16 optimized inference with ~200 ms streaming latency.

Downloads: 3 This Week

Last Update: 2025-12-05
See Project
16

HelixDB

Graph-vector database for building unified AI backends fast

...It combines graph and vector data models, allowing developers to manage relationships and embeddings within the same system without relying on separate services. HelixDB is built from scratch in Rust and uses LMDB as its storage engine, enabling high performance and low-latency query execution. HelixDB also supports additional data formats such as key-value, document, and relational data, making it flexible for a wide range of backend architectures. A central feature of the project is its custom query language, HelixQL, which is fully type-safe and compiled to ensure reliability and correctness in production environments. ...

Downloads: 16 This Week

Last Update: 2026-03-31
See Project
17

Kimi K2.5

Moonshot's most powerful AI model

...Based on a 1T-parameter Mixture-of-Experts (MoE) architecture with 32B activated parameters, it integrates advanced language reasoning with strong visual understanding. K2.5 supports both “Thinking” and “Instant” modes, enabling either deep step-by-step reasoning or low-latency responses depending on the task. Designed for agentic workflows, it features an Agent Swarm mechanism that decomposes complex problems into coordinated sub-agents executing in parallel. With a 256K context length and MoonViT vision encoder, the model excels across reasoning, coding, long-context comprehension, image, and video benchmarks. ...

Downloads: 48 This Week

Last Update: 3 days ago
See Project
18

RTP-LLM

Alibaba's high-performance LLM inference engine for diverse apps

RTP-LLM is an open-source large language model inference acceleration engine developed by Alibaba to provide high-performance serving infrastructure for modern LLM deployments. The system focuses on improving throughput, latency, and resource utilization when running large models in production environments. It achieves this by implementing optimized GPU kernels, batching strategies, and memory management techniques tailored for transformer inference workloads. The framework is designed for large-scale AI services and is already used internally across several Alibaba platforms such as Taobao, Amap, and other business systems that rely on conversational or search-related AI services. ...

Downloads: 12 This Week

Last Update: 2026-03-09
See Project
19

claude-token-efficient

One CLAUDE.md file. Keeps Claude responses terse

...The approach focuses on removing redundant explanations, overly polite language, and repeated context, which are common contributors to excessive token consumption in AI-generated responses. It is particularly useful in high-volume or agent-based workflows where output tokens accumulate quickly and significantly impact cost and latency. The system works best in iterative or automation-heavy environments where consistent formatting and brevity are more valuable than exploratory reasoning.

Downloads: 3 This Week

Last Update: 2026-04-07
See Project
20

hls4ml

Machine learning on FPGAs using HLS

hls4ml is an open-source framework that enables machine learning models to be implemented directly on hardware such as FPGAs and ASICs using high-level synthesis techniques. The system converts trained neural network models from common machine learning frameworks into hardware description code suitable for ultra-low-latency inference. This approach allows machine learning algorithms to run directly on specialized hardware, making them suitable for applications that require extremely fast response times and minimal power consumption. The framework was originally developed for high-energy physics experiments where real-time decision systems must process large volumes of data with strict latency constraints. ...

Downloads: 0 This Week

Last Update: 2026-03-20
See Project
21

Seamless Communication

Foundational Models for State-of-the-Art Speech and Text Translation

...The research prototype includes components for visual grounding (understanding when a user references something in view), gesture recognition and synthesis, and turn-taking mechanisms that mirror human conversational timing. Because latency and synchronization are critical, the codebase invests in asynchronous scheduling, overlap of perception and reasoning, and fast fallback responses.

Downloads: 0 This Week

Last Update: 2025-10-06
See Project
22

Parallax

Parallax is a distributed model serving framework

...A two-stage scheduling architecture determines how model layers are allocated to available hardware and how requests are routed across nodes during execution. This scheduling system optimizes latency, throughput, and hardware utilization even when nodes have different computational capabilities. The platform also supports model sharding and pipeline parallelism, allowing very large models to run across distributed resources.

Downloads: 4 This Week

Last Update: 2026-03-09
See Project
23

DeepSpeed

Deep learning optimization library: makes distributed training easy

...Achieve excellent system throughput and efficiently scale to thousands of GPUs 3. Train/Inference on resource constrained GPU systems 4. Achieve unprecedented low latency and high throughput for inference 5. Achieve extreme compression for an unparalleled inference latency and model size reduction with low costs DeepSpeed offers a confluence of system innovations, that has made large scale DL training effective, and efficient, greatly improved ease of use, and redefined the DL training landscape in terms of scale that is possible. ...

Downloads: 5 This Week

Last Update: 2026-03-30
See Project
24

MiniCPM4

Ultra-Efficient LLMs on End Device

...Unlike traditional large-scale models that require extensive computational resources, MiniCPM4 focuses on delivering competitive reasoning and language capabilities while maintaining significantly lower latency and higher efficiency. It achieves this through optimized architectures, scalable training strategies, and techniques such as long-context pretraining and YaRN-based length extension, allowing it to handle sequences up to 128K tokens effectively. The model demonstrates strong performance across tasks such as long-text comprehension, reasoning, and general language generation, often outperforming similar-sized models in both speed and accuracy. ...

Downloads: 6 This Week

Last Update: 3 days ago
See Project
25

Ultravox

Fast multimodal LLM for real-time voice interaction and AI apps

...Internally, it leverages pretrained language models and speech encoders, with a multimodal adapter that integrates both modalities for inference and training. Ultravox is optimized for low latency, achieving fast response times suitable for interactive voice agents and real-time applications. It supports use cases such as conversational AI agents, speech-to-speech translation, and analysis of spoken audio content. Ultravox also includes tooling and configuration systems for training, evaluation, and dataset integration.

Downloads: 6 This Week

Last Update: 2026-03-18
See Project