Showing 43 open source projects for "spark"

View related business solutions
  • Gemini 3 and 200+ AI Models on One Platform Icon
    Gemini 3 and 200+ AI Models on One Platform

    Access Google's best plus Claude, Llama, and Gemma. Fine-tune and deploy from one console.

    Build generative AI apps with Vertex AI. Switch between models without switching platforms.
    Start Free
  • AI-powered service management for IT and enterprise teams Icon
    AI-powered service management for IT and enterprise teams

    Enterprise-grade ITSM, for every business

    Give your IT, operations, and business teams the ability to deliver exceptional services—without the complexity. Maximize operational efficiency with refreshingly simple, AI-powered Freshservice.
    Try it Free
  • 1
    Apache Spark

    Apache Spark

    A unified analytics engine for large-scale data processing

    ...With Spark Streaming (microbatches) and Structured Streaming, it delivers low-latency event processing suitable for real-time analytics. The built-in MLlib library provides scalable machine learning algorithms, while GraphX enables graph computations integrated with data pipelines. Spark supports multiple languages—Scala, Java, Python, R—and connects with many storage systems like HDFS, S3, Cassandra, and streaming platforms like Kafka, making it a versatile choice for big data workloads in analytics, ETL, and data science.
    Downloads: 3 This Week
    Last Update:
    See Project
  • 2
    .NET for Apache Spark

    .NET for Apache Spark

    A free, open-source, and cross-platform big data analytics framework

    .NET for Apache Spark provides high-performance APIs for using Apache Spark from C# and F#. With these .NET APIs, you can access the most popular Dataframe and SparkSQL aspects of Apache Spark, for working with structured data, and Spark Structured Streaming, for working with streaming data. .NET for Apache Spark is compliant with .NET Standard - a formal specification of .NET APIs that are common across .NET implementations.
    Downloads: 5 This Week
    Last Update:
    See Project
  • 3
    SageMaker Spark Container

    SageMaker Spark Container

    Docker image used to run data processing workloads

    ...The SageMaker Spark Container is a Docker image used to run batch data processing workloads on Amazon SageMaker using the Apache Spark framework. The container images in this repository are used to build the pre-built container images that are used when running Spark jobs on Amazon SageMaker using the SageMaker Python SDK. The pre-built images are available in the Amazon Elastic Container Registry (Amazon ECR), and this repository serves as a reference for those wishing to build their own customized Spark containers for use in Amazon SageMaker.
    Downloads: 3 This Week
    Last Update:
    See Project
  • 4
    Synapse Machine Learning

    Synapse Machine Learning

    Simple and distributed Machine Learning

    ...These tools enable powerful and highly-scalable predictive and analytical models for a variety of data sources. SynapseML also brings new networking capabilities to the Spark Ecosystem. With the HTTP on Spark project, users can embed any web service into their SparkML models. For production-grade deployment, the Spark Serving project enables high throughput, sub-millisecond latency web services, backed by your Spark cluster.
    Downloads: 0 This Week
    Last Update:
    See Project
  • $300 in Free Credit Towards Top Cloud Services Icon
    $300 in Free Credit Towards Top Cloud Services

    Build VMs, containers, AI, databases, storage—all in one place.

    Start your project in minutes. After credits run out, 20+ products include free monthly usage. Only pay when you're ready to scale.
    Get Started
  • 5
    Deequ

    Deequ

    Deequ is a library built on top of Apache Spark

    Deequ is a library built atop Apache Spark that enables defining “unit tests for data” — that is, formal constraints or checks on datasets to ensure data quality along dimensions such as completeness, uniqueness, value ranges, correlations, etc. It can scale to large datasets (billions of rows) by translating those data checks into Spark jobs. Deequ supports advanced features like a metrics repository for storing computed statistics over time, anomaly detection of data quality metrics, and the suggestion of likely constraints automatically for new datasets. ...
    Downloads: 9 This Week
    Last Update:
    See Project
  • 6
    Alire

    Alire

    Command-line tool from the Alire project and supporting library

    Alire is a source-based package manager for the Ada and SPARK programming languages. It facilitates the building and sharing of projects within the Ada community, allowing developers to easily manage dependencies and publish their own libraries or programs. Alire aims to streamline the development process for Ada and SPARK by providing a standardized approach to package management. ​
    Downloads: 10 This Week
    Last Update:
    See Project
  • 7
    SageMaker Spark

    SageMaker Spark

    A Spark library for Amazon SageMaker

    SageMaker Spark depends on hadoop-aws-2.8.1. To run Spark applications that depend on SageMaker Spark, you need to build Spark with Hadoop 2.8. However, if you are running Spark applications on EMR, you can use Spark built with Hadoop 2.7.
    Downloads: 0 This Week
    Last Update:
    See Project
  • 8
    Apache Sedona

    Apache Sedona

    Cluster computing framework for processing large-scale geospatial data

    Apache Sedona™ is a cluster computing system for processing large-scale spatial data. Sedona extends existing cluster computing systems, such as Apache Spark and Apache Flink, with a set of out-of-the-box distributed Spatial Datasets and Spatial SQL that efficiently load, process, and analyze large-scale spatial data across machines. According to our benchmark and third-party research papers, Sedona runs 2X - 10X faster than other Spark-based geospatial data systems on computation-intensive query workloads. ...
    Downloads: 1 This Week
    Last Update:
    See Project
  • 9
    XGBoost

    XGBoost

    Scalable and Flexible Gradient Boosting

    ...It also offers parallel tree boosting (GBDT, GBRT or GBM) that can quickly and accurately solve many data science problems. XGBoost can be used for Python, Java, Scala, R, C++ and more. It can run on a single machine, Hadoop, Spark, Dask, Flink and most other distributed environments, and is capable of solving problems beyond billions of examples.
    Downloads: 14 This Week
    Last Update:
    See Project
  • AI-generated apps that pass security review Icon
    AI-generated apps that pass security review

    Stop waiting on engineering. Build production-ready internal tools with AI—on your company data, in your cloud.

    Retool lets you generate dashboards, admin panels, and workflows directly on your data. Type something like “Build me a revenue dashboard on my Stripe data” and get a working app with security, permissions, and compliance built in from day one. Whether on our cloud or self-hosted, create the internal software your team needs without compromising enterprise standards or control.
    Try Retool free
  • 10
    Volcano

    Volcano

    A Cloud Native Batch System (Project under CNCF)

    ...It provides a suite of mechanisms that are commonly required by many classes of batch & elastic workload including machine learning/deep learning, bioinformatics/genomics, and other "big data" applications. These types of applications typically run on generalized domain frameworks like TensorFlow, Spark, Ray, PyTorch, MPI, etc, which Volcano integrates with. Volcano builds upon a decade and a half of experience running a wide variety of high-performance workloads at scale using several systems and platforms, combined with best-of-breed ideas and practices from the open-source community. Until June 2021, Volcano has been widely used around the world at a variety of industries such as Internet/Cloud/Finance/ Manufacturing/Medical. ...
    Downloads: 174 This Week
    Last Update:
    See Project
  • 11
    Laravel Lang

    Laravel Lang

    List of 126 languages for Laravel Framework, Laravel Jetstream, etc.

    List of 126 languages for Laravel Framework, Laravel Jetstream, Laravel Fortify, Laravel Breeze, Laravel Cashier, Laravel Nova, Laravel Spark and Laravel UI. It is recommended to use this particular package as it will allow you to very quickly update all the necessary dependencies that ensure application localization.
    Downloads: 5 This Week
    Last Update:
    See Project
  • 12
    Soot

    Soot

    Soot - A Java optimization framework

    Soot is a Java optimization framework. It provides four intermediate representations for analyzing and transforming Java bytecode. Baf: a streamlined representation of bytecode which is simple to manipulate. Jimple: a typed 3-address intermediate representation suitable for optimization. Shimple: an SSA variation of Jimple. Grimp: an aggregated version of Jimple suitable for decompilation and code inspection.
    Downloads: 7 This Week
    Last Update:
    See Project
  • 13
    Serverless Java container

    Serverless Java container

    A Java wrapper to run Spring, Spring Boot, Jersey, and other apps

    The AWS Serverless Java Container library is a framework that allows developers to run existing or new Java web applications—built with frameworks such as Spring, Jersey, Spark, Struts—inside AWS Lambda with minimal modifications. It bridges the gap between traditional servlet or web-framework models and serverless functions by mapping HTTP events from API Gateway into requests your framework understands and routing responses back appropriately. This means you can keep much of your familiar Java-based architecture (controllers, filters, dependency injection) and deploy it in a serverless environment without rewriting everything from scratch. ...
    Downloads: 0 This Week
    Last Update:
    See Project
  • 14
    Apache Beam

    Apache Beam

    Unified programming model for Batch and Streaming

    Apache Beam is an open source, unified programming model to define both batch and streaming data-parallel processing pipelines, as well as certain language-specific SDKs for constructing pipelines and Runners. These pipelines are executed on one of Beam’s supported distributed processing back-ends, which include Apache Apex, Apache Flink, Apache Spark, and Google Cloud Dataflow. Beam is especially useful for Embarrassingly Parallel data processing tasks, and caters to the different needs and backgrounds of end users, SDK writers and runner writers.
    Downloads: 1 This Week
    Last Update:
    See Project
  • 15
    Numba

    Numba

    NumPy aware dynamic Python compiler using LLVM

    ...Special decorators can create universal functions that broadcast over NumPy arrays just like NumPy functions do. Numba also works great with Jupyter notebooks for interactive computing, and with distributed execution frameworks, like Dask and Spark.
    Downloads: 11 This Week
    Last Update:
    See Project
  • 16
    Apache Bigtop

    Apache Bigtop

    Bigtop is an Apache Foundation project for Infrastructure Engineers

    Apache Bigtop is a project focused on building and packaging the Hadoop ecosystem and related big data components. It provides a consistent framework for testing, packaging, and deploying Hadoop distributions, including tools like HDFS, YARN, Spark, Hive, HBase, and more. By maintaining cross-platform builds (RPMs, DEBs, Docker images, and Kubernetes support), Bigtop makes it easier for organizations to deploy big data stacks in different environments. It also includes a set of integration tests and smoke tests to ensure compatibility and stability between ecosystem components. ...
    Downloads: 9 This Week
    Last Update:
    See Project
  • 17
    go-chart

    go-chart

    go chart is a basic charting library in go

    Package chart is a very simple golang native charting library that supports time-series and continuous line charts. Master should now be on the v3.x codebase, which overhauls the api significantly. Per usual, see examples for more information. Actual chart configurations and examples can be found in the ./examples/ directory. They are simple CLI programs that write to output.png (they are also updated with go generate. Everything on the chart.Chart object has defaults that can be overridden....
    Downloads: 2 This Week
    Last Update:
    See Project
  • 18
    Kedro

    Kedro

    A Python framework for creating reproducible, maintainable code

    Kedro is an open sourced Python framework for creating maintainable and modular data science code. Provides the scaffolding to build more complex data and machine-learning pipelines. In addition, there's a focus on spending less time on the tedious "plumbing" required to maintain data science code; this means that you have more time to solve new problems. Standardises team workflows; the modular structure of Kedro facilitates a higher level of collaboration when teams solve problems...
    Downloads: 9 This Week
    Last Update:
    See Project
  • 19
    BentoML

    BentoML

    Unified Model Serving Framework

    BentoML simplifies ML model deployment and serves your models at a production scale. Support multiple ML frameworks natively: Tensorflow, PyTorch, XGBoost, Scikit-Learn and many more! Define custom serving pipeline with pre-processing, post-processing and ensemble models. Standard .bento format for packaging code, models and dependencies for easy versioning and deployment. Integrate with any training pipeline or ML experimentation platform. Parallelize compute-intense model inference...
    Downloads: 1 This Week
    Last Update:
    See Project
  • 20
    SQL Formatter

    SQL Formatter

    A whitespace formatter for different query languages

    ...It supports various SQL dialects: GCP BigQuery, IBM DB2, Apache Hive, MariaDB, MySQL, Couchbase N1QL, Oracle PL/SQL, PostgreSQL, Amazon Redshift, SingleStoreDB, Snowflake, Spark, SQL Server Transact-SQL, Trino/Presto. See language option docs for more details. The CLI tool will be installed under sql-formatter and may be invoked via npx sql-formatter. If you don't use a module bundler, clone the repository, run npm install and grab a file from /dist directory to use inside a script tag. This makes SQL Formatter available as a global variable window.sqlFormatter.
    Downloads: 0 This Week
    Last Update:
    See Project
  • 21
    Smallpond

    Smallpond

    A lightweight data processing framework built on DuckDB and 3FS

    ...The idea is to preserve DuckDB’s fast analytics engine but lift it from single-node to multi-node settings, giving you the ability to operate on large datasets (e.g. petabyte scale) without moving to a heavyweight system like Spark. Users write Python-like code (via DataFrame APIs or SQL strings) to express their transformations; behind the scenes, tasks are scheduled (often via Ray) and pushed into DuckDB instances operating on partitioned data. Because the storage layer (3FS) is optimized for random access and high throughput, smallpond can shuffle data, repartition, and manage intermediate results across nodes.
    Downloads: 0 This Week
    Last Update:
    See Project
  • 22
    Horovod

    Horovod

    Distributed training framework for TensorFlow, Keras, PyTorch, etc.

    ...With Horovod, an existing training script can be scaled up to run on hundreds of GPUs in just a few lines of Python code. Horovod can be installed on-premise or run out-of-the-box in cloud platforms, including AWS, Azure, and Databricks. Horovod can additionally run on top of Apache Spark, making it possible to unify data processing and model training into a single pipeline. Once Horovod has been configured, the same infrastructure can be used to train models with any framework, making it easy to switch between TensorFlow, PyTorch, MXNet, and future frameworks as machine learning tech stacks continue to evolve. Start scaling your model training with just a few lines of Python code. ...
    Downloads: 6 This Week
    Last Update:
    See Project
  • 23
    FiloDB

    FiloDB

    Distributed Prometheus time series database

    FiloDB is an open-source distributed, real-time, in-memory, massively scalable, multi-schema time series / event / operational database with Prometheus query support and some Spark support as well. The normal configuration for real-time ingestion is deployment as stand-alone processes in a cluster, ingesting directly from Apache Kafka. The processes form a cluster using peer-to-peer Akka Cluster technology. Designed to ingest many millions of entities, sharded across multiple processes, with distributed querying built in. ...
    Downloads: 5 This Week
    Last Update:
    See Project
  • 24
    WTFJS

    WTFJS

    A list of funny and tricky JavaScript examples

    ...It’s designed as both a fun read and a serious learning aid, helping developers build an intuition for how JavaScript evaluates expressions. By highlighting common misconceptions, it encourages safer coding patterns and more reliable mental models. Teachers, interviewers, and learners use it to spark discussion and deepen understanding of JavaScript’s semantics.
    Downloads: 8 This Week
    Last Update:
    See Project
  • 25
    osm4scala

    osm4scala

    Reading OpenStreetMap Pbf files.

    Scala and polyglot Spark library (Scala, PySpark, SparkSQL, ... ) focused on reading OpenStreetMap Pbf files.
    Downloads: 15 This Week
    Last Update:
    See Project
  • Previous
  • You're on page 1
  • 2
  • Next
MongoDB Logo MongoDB