Showing 154 open source projects for "latency"

View related business solutions
  • Data management solutions for confident marketing Icon
    Data management solutions for confident marketing

    For companies wanting a complete Data Management solution that is native to Salesforce

    Verify, deduplicate, manipulate, and assign records automatically to keep your CRM data accurate, complete, and ready for business.
    Learn More
  • The AI workplace management platform Icon
    The AI workplace management platform

    Plan smart spaces, connect teams, manage assets, and get insights with the leading AI-powered operating system for the built world.

    By combining AI workflows, predictive intelligence, and automated insights, OfficeSpace gives leaders a complete view of how their spaces are used and how people work. Facilities, IT, HR, and Real Estate teams use OfficeSpace to optimize space utilization, enhance employee experience, and reduce portfolio costs with precision.
    Learn More
  • 1
    DeepSpeed MII

    DeepSpeed MII

    MII makes low-latency and high-throughput inference possible

    ...While open-sourcing has democratized access to AI capabilities, their application is still restricted by two critical factors: inference latency and cost. DeepSpeed-MII is a new open-source python library from DeepSpeed, aimed towards making low-latency, low-cost inference of powerful models not only feasible but also easily accessible. MII offers access to the highly optimized implementation of thousands of widely used DL models. MII-supported models achieve significantly lower latency and cost compared to their original implementation.
    Downloads: 9 This Week
    Last Update:
    See Project
  • 2
    RealtimeSTT

    RealtimeSTT

    A robust, efficient, low-latency speech-to-text library

    RealtimeSTT is a Python-based realtime speech-to-text engine emphasizing low latency, wake-word detection, voice activity detection, and automatic speech segmentation. It provides asynchronous callbacks, nanosecond-precision timestamps, and CLI tools, suitable for building voice assistants, meeting transcribers, or live caption systems.
    Downloads: 6 This Week
    Last Update:
    See Project
  • 3
    FlashInfer

    FlashInfer

    FlashInfer: Kernel Library for LLM Serving

    FlashInfer is a kernel library designed to enhance the serving of Large Language Models (LLMs) by optimizing inference performance. It provides a high-performance framework that integrates seamlessly with existing systems, aiming to reduce latency and improve efficiency in LLM deployments. FlashInfer supports various hardware architectures and is built to scale with the demands of production environments.
    Downloads: 36 This Week
    Last Update:
    See Project
  • 4
    LiteRT

    LiteRT

    LiteRT is the new name for TensorFlow Lite (TFLite)

    LiteRT is an experimental, real-time inference runtime built by Google AI Edge to run lightweight ML models on edge devices with ultra-low latency. It focuses on delivering predictable and consistent performance for models used in time-critical applications like robotics, AR/VR, and IoT. LiteRT is designed to be hardware-agnostic, with minimal dependencies and tight control over execution scheduling.
    Downloads: 21 This Week
    Last Update:
    See Project
  • The AI-powered unified PSA-RMM platform for modern MSPs. Icon
    The AI-powered unified PSA-RMM platform for modern MSPs.

    Trusted PSA-RMM partner of MSPs worldwide

    SuperOps.ai is the only PSA-RMM platform powered by intelligent automation and thoughtfully crafted for the new-age MSP. The platform also helps MSPs manage their projects, clients, and IT documents from a single place.
    Learn More
  • 5
    AsmJit

    AsmJit

    Low-latency machine code generation

    ...It is commonly used in applications such as emulators, compilers, and high-performance computing systems where runtime optimization is essential. asmjit emphasizes low latency and efficiency, ensuring that generated code executes quickly without significant overhead. Its modular design allows developers to integrate it into various systems with minimal friction. Overall, asmjit bridges the gap between high-level programming and low-level execution by enabling efficient runtime code generation.
    Downloads: 5 This Week
    Last Update:
    See Project
  • 6
    OpenJarvis

    OpenJarvis

    Personal AI, On Personal Devices

    ...The framework provides shared primitives for building local-first agents, along with evaluation tools that measure performance using metrics such as energy consumption, latency, cost, and accuracy. OpenJarvis integrates with local inference engines like Ollama, vLLM, SGLang, and llama.cpp to run language models directly on personal hardware. It also includes a learning loop that allows models to improve over time using locally generated interaction traces. By prioritizing local execution and efficiency, OpenJarvis aims to provide a foundation for privacy-preserving personal AI assistants.
    Downloads: 154 This Week
    Last Update:
    See Project
  • 7
    Portkey AI Gateway

    Portkey AI Gateway

    A blazing fast AI Gateway with integrated guardrails

    ...It’s lightweight and optimized for low latency with a small footprint.
    Downloads: 6 This Week
    Last Update:
    See Project
  • 8
    optillm

    optillm

    Optimizing inference proxy for LLMs

    ...It serves as an OpenAI API-compatible proxy, allowing for seamless integration into existing workflows while optimizing inference processes. OptiLLM aims to reduce latency and resource consumption during LLM inference.
    Downloads: 5 This Week
    Last Update:
    See Project
  • 9
    Faster Whisper

    Faster Whisper

    Faster Whisper transcription with CTranslate2

    Faster Whisper is an optimized implementation of the Whisper speech recognition model designed to deliver significantly faster inference while maintaining comparable accuracy. It leverages efficient inference engines and optimized computation strategies to reduce latency and resource consumption. The system is particularly useful for real-time or large-scale transcription tasks where performance is critical. It supports multiple model sizes, allowing users to balance speed and accuracy based on their needs. The architecture is designed to run efficiently on both CPUs and GPUs, making it accessible across different environments. ...
    Downloads: 15 This Week
    Last Update:
    See Project
  • Get full visibility and control over your tasks and projects with Wrike. Icon
    Get full visibility and control over your tasks and projects with Wrike.

    A cloud-based collaboration, work management, and project management software

    Wrike offers world-class features that empower cross-functional, distributed, or growing teams take their projects from the initial request stage all the way to tracking work progress and reporting results.
    Learn More
  • 10
    gemma.cpp

    gemma.cpp

    lightweight, standalone C++ inference engine for Google's Gemma models

    Gemma.cpp is a C++ implementation for running inference with Gemma models efficiently on CPUs and GPUs. Developed by Google, it allows running large language models (LLMs) like Gemma with minimal hardware, focusing on optimized performance and low latency. Gemma.cpp is intended for developers seeking to deploy LLMs in production environments without needing massive computational resources.
    Downloads: 4 This Week
    Last Update:
    See Project
  • 11
    Open Vision Agents by Stream

    Open Vision Agents by Stream

    Build Vision Agents quickly with any model or video provider

    ...It focuses on combining video understanding models, such as YOLO and Roboflow based detectors, with real time large language models like OpenAI Realtime and Gemini Live to create interactive experiences. The framework uses Stream’s ultra low latency edge network so agents can join sessions quickly and maintain very low audio and video latency while processing frames and generating responses. Developers work with an agent abstraction that connects video edge providers, LLMs, and processors into pipelines, making it easier to orchestrate tasks like object detection, pose estimation, and conversational guidance. ...
    Downloads: 6 This Week
    Last Update:
    See Project
  • 12
    Edgee

    Edgee

    AI gateway with token compression for Claude Code, Codex, and more

    Edgee is an edge-native execution platform designed to run AI-driven logic and data processing directly at the network edge, reducing latency and improving responsiveness for modern applications. It enables developers to deploy functions and workflows closer to users, allowing real-time processing without relying heavily on centralized cloud infrastructure. The platform is built to support event-driven architectures, where actions are triggered by incoming requests, user behavior, or external signals. ...
    Downloads: 15 This Week
    Last Update:
    See Project
  • 13
    ScaleLLM

    ScaleLLM

    A high-performance inference system for large language models

    ScaleLLM is a high-performance inference system tailored for Large Language Models (LLMs), specifically designed for production environments. It focuses on optimizing inference processes to handle large-scale deployments efficiently, ensuring low latency and high throughput. ScaleLLM supports various LLM architectures and integrates with existing infrastructures, providing a scalable solution for deploying LLMs in real-world applications.
    Downloads: 0 This Week
    Last Update:
    See Project
  • 14
    Cloudflare Agents

    Cloudflare Agents

    Build and deploy AI Agents on Cloudflare

    Cloudflare Agents is an open-source framework designed to help developers build, deploy, and manage AI agents that run at the network edge. It provides infrastructure for creating stateful, event-driven agents capable of real-time interaction while maintaining low latency through Cloudflare’s distributed platform. The project includes SDKs, templates, and deployment tooling that simplify the process of connecting agents to external APIs, storage systems, and workflows. Its architecture emphasizes persistent memory, enabling agents to maintain context across sessions and interactions. Developers can orchestrate complex behaviors using workflows and durable objects, making it suitable for production-grade autonomous systems. ...
    Downloads: 12 This Week
    Last Update:
    See Project
  • 15
    Kimi K2.5

    Kimi K2.5

    Moonshot's most powerful AI model

    ...Based on a 1T-parameter Mixture-of-Experts (MoE) architecture with 32B activated parameters, it integrates advanced language reasoning with strong visual understanding. K2.5 supports both “Thinking” and “Instant” modes, enabling either deep step-by-step reasoning or low-latency responses depending on the task. Designed for agentic workflows, it features an Agent Swarm mechanism that decomposes complex problems into coordinated sub-agents executing in parallel. With a 256K context length and MoonViT vision encoder, the model excels across reasoning, coding, long-context comprehension, image, and video benchmarks. ...
    Downloads: 52 This Week
    Last Update:
    See Project
  • 16
    HelixDB

    HelixDB

    Graph-vector database for building unified AI backends fast

    ...It combines graph and vector data models, allowing developers to manage relationships and embeddings within the same system without relying on separate services. HelixDB is built from scratch in Rust and uses LMDB as its storage engine, enabling high performance and low-latency query execution. HelixDB also supports additional data formats such as key-value, document, and relational data, making it flexible for a wide range of backend architectures. A central feature of the project is its custom query language, HelixQL, which is fully type-safe and compiled to ensure reliability and correctness in production environments. ...
    Downloads: 14 This Week
    Last Update:
    See Project
  • 17
    Orpheus TTS

    Orpheus TTS

    Towards Human-Sounding Speech

    ...The project ships both pretrained and finetuned English models, as well as a family of multilingual models released as a research preview, and includes data-processing scripts so users can train or finetune their own variants. Inference is provided through a Python package that uses vLLM under the hood for high-throughput, low-latency generation, including streaming examples that show how to generate audio chunks in real time. The maintainers provide Colab notebooks, a standardized prompting format, and one-click deployment via Baseten for production-grade, FP8/FP16 optimized inference with ~200 ms streaming latency.
    Downloads: 2 This Week
    Last Update:
    See Project
  • 18
    hls4ml

    hls4ml

    Machine learning on FPGAs using HLS

    hls4ml is an open-source framework that enables machine learning models to be implemented directly on hardware such as FPGAs and ASICs using high-level synthesis techniques. The system converts trained neural network models from common machine learning frameworks into hardware description code suitable for ultra-low-latency inference. This approach allows machine learning algorithms to run directly on specialized hardware, making them suitable for applications that require extremely fast response times and minimal power consumption. The framework was originally developed for high-energy physics experiments where real-time decision systems must process large volumes of data with strict latency constraints. ...
    Downloads: 0 This Week
    Last Update:
    See Project
  • 19
    Seamless Communication

    Seamless Communication

    Foundational Models for State-of-the-Art Speech and Text Translation

    ...The research prototype includes components for visual grounding (understanding when a user references something in view), gesture recognition and synthesis, and turn-taking mechanisms that mirror human conversational timing. Because latency and synchronization are critical, the codebase invests in asynchronous scheduling, overlap of perception and reasoning, and fast fallback responses.
    Downloads: 0 This Week
    Last Update:
    See Project
  • 20
    RTP-LLM

    RTP-LLM

    Alibaba's high-performance LLM inference engine for diverse apps

    RTP-LLM is an open-source large language model inference acceleration engine developed by Alibaba to provide high-performance serving infrastructure for modern LLM deployments. The system focuses on improving throughput, latency, and resource utilization when running large models in production environments. It achieves this by implementing optimized GPU kernels, batching strategies, and memory management techniques tailored for transformer inference workloads. The framework is designed for large-scale AI services and is already used internally across several Alibaba platforms such as Taobao, Amap, and other business systems that rely on conversational or search-related AI services. ...
    Downloads: 10 This Week
    Last Update:
    See Project
  • 21
    Parallax

    Parallax

    Parallax is a distributed model serving framework

    ...A two-stage scheduling architecture determines how model layers are allocated to available hardware and how requests are routed across nodes during execution. This scheduling system optimizes latency, throughput, and hardware utilization even when nodes have different computational capabilities. The platform also supports model sharding and pipeline parallelism, allowing very large models to run across distributed resources.
    Downloads: 4 This Week
    Last Update:
    See Project
  • 22
    DeepSpeed

    DeepSpeed

    Deep learning optimization library: makes distributed training easy

    ...Achieve excellent system throughput and efficiently scale to thousands of GPUs 3. Train/Inference on resource constrained GPU systems 4. Achieve unprecedented low latency and high throughput for inference 5. Achieve extreme compression for an unparalleled inference latency and model size reduction with low costs DeepSpeed offers a confluence of system innovations, that has made large scale DL training effective, and efficient, greatly improved ease of use, and redefined the DL training landscape in terms of scale that is possible. ...
    Downloads: 5 This Week
    Last Update:
    See Project
  • 23
    MiniCPM4

    MiniCPM4

    Ultra-Efficient LLMs on End Device

    ...Unlike traditional large-scale models that require extensive computational resources, MiniCPM4 focuses on delivering competitive reasoning and language capabilities while maintaining significantly lower latency and higher efficiency. It achieves this through optimized architectures, scalable training strategies, and techniques such as long-context pretraining and YaRN-based length extension, allowing it to handle sequences up to 128K tokens effectively. The model demonstrates strong performance across tasks such as long-text comprehension, reasoning, and general language generation, often outperforming similar-sized models in both speed and accuracy. ...
    Downloads: 6 This Week
    Last Update:
    See Project
  • 24
    TensorRT

    TensorRT

    C++ library for high performance inference on NVIDIA GPUs

    NVIDIA® TensorRT™ is an SDK for high-performance deep learning inference. It includes a deep learning inference optimizer and runtime that delivers low latency and high throughput for deep learning inference applications. TensorRT-based applications perform up to 40X faster than CPU-only platforms during inference. With TensorRT, you can optimize neural network models trained in all major frameworks, calibrate for lower precision with high accuracy, and deploy to hyperscale data centers, embedded, or automotive product platforms. ...
    Downloads: 22 This Week
    Last Update:
    See Project
  • 25
    Infinity

    Infinity

    Low-latency REST API for serving text-embeddings

    Infinity is a high-throughput, low-latency REST API for serving vector embeddings, supporting all sentence-transformer models and frameworks. Infinity is developed under MIT License. Infinity powers inference behind Gradient.ai and other Embedding API providers.
    Downloads: 5 This Week
    Last Update:
    See Project
  • Previous
  • You're on page 1
  • 2
  • 3
  • 4
  • 5
  • Next
MongoDB Logo MongoDB