Showing 148 open source projects for "crawl"

View related business solutions
  • Iris Powered By Generali - Iris puts your customer in control of their identity. Icon
    Iris Powered By Generali - Iris puts your customer in control of their identity.

    Increase customer and employee retention by offering Onwatch identity protection today.

    Iris Identity Protection API sends identity monitoring and alerts data into your existing digital environment – an ideal solution for businesses that are looking to offer their customers identity protection services without having to build a new product or app from scratch.
    Learn More
  • Inventory and Order Management Software for Multichannel Sellers Icon
    Inventory and Order Management Software for Multichannel Sellers

    Avoid stockouts, overselling, and losing control as your business grows.

    We are the most powerful inventory and order management platform for Amazon, Walmart, and multichannel product sellers. Centralize orders, product information, and fulfillment operations to run more efficiently, sell more products, and stay compliant with marketplace requirements so you can grow profitably.
    Learn More
  • 1
    X-Crawl

    X-Crawl

    Flexible Node.js AI-assisted crawler library

    A high-performance web crawling and scraping framework for Node.js, designed for large-scale data extraction.
    Downloads: 9 This Week
    Last Update:
    See Project
  • 2
    Dungeon Crawl: Stone Soup

    Dungeon Crawl: Stone Soup

    A game of dungeon exploration, combat and magic

    Crawl also sports a number of systems and mechanics that diverge from more traditional roguelikes. For example, there's less emphasis on character classes (or backgrounds in Crawl parlance); your character is defined more by their skills, species and choice of deity than their background. Crawl includes an in-game tutorial and manual in the name of providing something playable without the aid of a guide or wiki.
    Downloads: 24 This Week
    Last Update:
    See Project
  • 3
    Firecrawl

    Firecrawl

    Turn entire websites into LLM-ready markdown or structured data

    Crawl and convert any website into LLM-ready markdown or structured data. Built by Mendable.ai and the Firecrawl community. Includes powerful scraping, crawling, and data extraction capabilities. Firecrawl is an API service that takes a URL, crawls it, and converts it into clean markdown or structured data. We crawl all accessible subpages and give you clean data for each.
    Downloads: 22 This Week
    Last Update:
    See Project
  • 4
    Spatie Crawler

    Spatie Crawler

    An easy to use, powerful crawler implemented in PHP

    Spatie Crawler is a PHP library that allows developers to crawl websites and extract information efficiently. It can be used for web scraping, link checking, or automated testing of web pages. The library is simple to use and supports customizable crawling strategies, including controlling crawl depth and handling redirects. It’s suitable for building crawlers that navigate large or dynamically generated websites.
    Downloads: 6 This Week
    Last Update:
    See Project
  • Agentic AI SRE built for Engineering and DevOps teams. Icon
    Agentic AI SRE built for Engineering and DevOps teams.

    No More Time Lost to Troubleshooting

    NeuBird AI's agentic AI SRE delivers autonomous incident resolution, helping team cut MTTR up to 90% and reclaim engineering hours lost to troubleshooting.
    Learn More
  • 5
    Spider

    Spider

    High-performance Rust web crawler and scraper for large-scale data

    ...Spider also provides mechanisms for subscribing to crawl events so developers can process page data such as URLs, status codes, or HTML content as it is discovered. It supports advanced capabilities such as headless browser rendering, background crawling tasks, and configurable rules that control crawl depth or ignored paths. These capabilities make the project suitable for building search indexers, data extraction pipelines, & SEO analysis tools.
    Downloads: 12 This Week
    Last Update:
    See Project
  • 6
    Nebula libp2p DHT

    Nebula libp2p DHT

    A libp2p DHT crawler, monitor, and measurement tool

    ...The crawler can store its results as JSON documents or in a postgres database - the --dry-run flag prevents it from doing either. Nebula will print a summary of the crawl at the end instead. A crawl takes ~5-10 min depending on your internet connection. You can also specify the network you want to crawl by appending, e.g., --network FILECOIN and limit the number of peers to crawl by providing the --limit flag.
    Downloads: 4 This Week
    Last Update:
    See Project
  • 7
    Heritrix

    Heritrix

    Internet Archive's open-source, web-scale, web crawler project

    ...Since our crawler seeks to collect and preserve the digital artifacts of our culture for the benefit of future researchers and generations, this name seemed apt. Heritrix is designed to respect the robots.txt exclusion directives† and META nofollow tags. Please consider the load your crawl will place on seed sites and set politeness policies accordingly. Also, always identify your crawl with contact information in the User-Agent so sites that may be adversely affected by your crawl can contact you or adapt their server behavior accordingly.
    Downloads: 8 This Week
    Last Update:
    See Project
  • 8
    GPT Crawler

    GPT Crawler

    Crawl a site to generate knowledge files to create your own custom GPT

    GPT Crawler is an open-source tool designed to automatically crawl websites and generate structured knowledge that can be used to build AI assistants and retrieval systems. It focuses on extracting high-quality textual content from web pages and preparing it in formats suitable for embedding, indexing, or fine-tuning workflows. The project is especially useful for teams that want to turn documentation sites or knowledge bases into conversational AI backends without building custom scrapers from scratch. ...
    Downloads: 7 This Week
    Last Update:
    See Project
  • 9
    Laravel Sitemap

    Laravel Sitemap

    Create and generate sitemaps with ease

    ...This works by crawling your entire site. The generator has the ability to execute JavaScript on each page so links injected into the dom by JavaScript will be crawled as well. The easiest way is to crawl the given domain and generate a sitemap with all found links. The destination of the sitemap should be specified by $path. If you don't want a crawled link to appear in the sitemap, just don't return it in the callable you pass to hasCrawled. You can also instruct the underlying crawler to not crawl some pages by passing a callable to shouldCrawl. ...
    Downloads: 6 This Week
    Last Update:
    See Project
  • The full-stack observability platform that protects your dataLayer, tags and conversion data Icon
    The full-stack observability platform that protects your dataLayer, tags and conversion data

    Stop losing revenue to bad data today. and protect your marketing data with Code-Cube.io.

    Code-Cube.io detects issues instantly, alerts you in real time and helps you resolve them fast. No manual QA. No unreliable data. Just data you can trust and act on.
    Learn More
  • 10
    watercrawl

    watercrawl

    AI-ready web crawler that extracts and structures website content

    WaterCrawl is an open source web crawling and data extraction platform designed to transform website content into structured data suitable for machine learning and AI workflows. It enables developers and researchers to crawl web pages, extract meaningful information, and convert it into formats that are easier to process and analyze. It provides a modern crawling system that can automatically navigate links, control crawl depth, and collect content from targeted sections of a website. WaterCrawl supports customizable extraction rules so users can focus only on relevant elements while ignoring unnecessary page components. ...
    Downloads: 8 This Week
    Last Update:
    See Project
  • 11
    AI-Crawler

    AI-Crawler

    Crawl a website starting from a URL, find relevant pages

    ...Unlike traditional web scrapers that rely on static selectors and manual scripting, it uses AI to dynamically identify and prioritize pages based on user intent, making it more flexible and resilient to changes in website structure. Users can define their data requirements in plain English, and the system will interpret those instructions to crawl a domain and extract structured data. The tool supports output formats such as JSON and Markdown, and it can generate or accept schemas to ensure that extracted data is structured according to application needs. It is designed as a low-code solution, reducing the complexity of building and maintaining custom scraping pipelines.
    Downloads: 5 This Week
    Last Update:
    See Project
  • 12
    The Web MCP

    The Web MCP

    A powerful Model Context Protocol (MCP) server

    Bright Data’s Web MCP server gives AI assistants robust, real-time web capabilities through an MCP interface designed to avoid blocks, rate limits, and CAPTCHAs. It presents search, crawl, navigate, and extraction tools that agents can call directly, replacing brittle scraping prompts with typed operations. The README markets it as a “gateway” to the live web so assistants don’t fall back to stale training data. Bright Data also advertises a getting-started tier with a free monthly allotment, plus options for remote or self-hosted operation depending on governance needs. ...
    Downloads: 7 This Week
    Last Update:
    See Project
  • 13
    Search1API MCP

    Search1API MCP

    A Model Context Protocol (MCP) server

    The Search1API MCP Server is a Model Context Protocol server that provides search and crawl functionality using Search1API. It enables web and news searches, content extraction, and sitemap retrieval, integrating seamlessly with MCP clients. ​
    Downloads: 10 This Week
    Last Update:
    See Project
  • 14
    Douyin TikTok Download API

    Douyin TikTok Download API

    Douyin TikTok Download API

    ...You can deploy or transform this project yourself to achieve more functions, or you can directly call scraper.py in your project or install an existing pip package as a parsing library to easily crawl data, etc. Support input Douyin|TikTokuser homepage to crawl the author [homepage video data (remove watermark link, liked video list (permission must be public), video comment data, background music video list data, etc...).
    Downloads: 2 This Week
    Last Update:
    See Project
  • 15
    crawley

    crawley

    The unix-way web crawler

    ...Found URLs are streamed to stdout and guaranteed to be unique (with fragments omitted) Scan depth (limited by starting host and path, by default - 0) can be configured. Can crawl rules and sitemaps from robots.txt. Brute mode - scan HTML comments for URLs (this can lead to bogus results) Make use of HTTP_PROXY / HTTPS_PROXY environment values + handle proxy auth. Directory-only scan mode (aka fast-scan)
    Downloads: 20 This Week
    Last Update:
    See Project
  • 16
    katana

    katana

    Fast CLI web crawler for discovering endpoints in modern web apps

    Katana is an open source command-line web crawling and spidering framework developed by ProjectDiscovery. It is designed to efficiently crawl websites and web applications in order to discover endpoints, resources, and other useful information that may not be easily visible through manual browsing. Katana focuses on speed and automation, making it suitable for use in security reconnaissance workflows and automated pipelines. Katana supports both standard HTTP crawling and headless browser crawling, allowing it to navigate modern web applications that rely heavily on JavaScript. ...
    Downloads: 44 This Week
    Last Update:
    See Project
  • 17
    StringZilla

    StringZilla

    10x faster string search, split, sort, and shuffle for long strings

    ...The contents of that file would remain immutable, and the mapping can be shared by multiple Python processes simultaneously. A standard dataset pre-processing use case would be to map a sizeable textual dataset like Common Crawl into memory, spawn child processes, and split the job between them.
    Downloads: 17 This Week
    Last Update:
    See Project
  • 18
    Scope Sentry

    Scope Sentry

    Cyberspace asset mapping and vulnerability scanning platform

    ...ScopeSentry combines multiple reconnaissance and vulnerability assessment capabilities such as subdomain enumeration, port scanning, directory scanning, and sensitive information detection. ScopeSentry can automatically identify assets and services, extract URLs, and crawl websites to collect useful security data for further analysis. It also includes vulnerability scanning and subdomain takeover detection to help identify common security weaknesses across web infrastructure. It supports distributed scanning with multiple nodes, allowing large scanning tasks to be performed efficiently across different systems.
    Downloads: 9 This Week
    Last Update:
    See Project
  • 19
    Web-Check

    Web-Check

    All-in-one OSINT tool for analysing any website

    ...Get an insight into the inner-workings of a given website: uncover potential attack vectors, analyse server architecture, view security configurations, and learn what technologies a site is using. Currently the dashboard will show: IP info, SSL chain, DNS records, cookies, headers, domain info, search crawl rules, page map, server location, redirect ledger, open ports, traceroute, DNS security extensions, site performance, trackers, associated hostnames, carbon footprint. Stay tuned, as I'll add more soon. The aim is to help you easily understand, optimize and secure your website.
    Downloads: 3 This Week
    Last Update:
    See Project
  • 20
    Scrapy

    Scrapy

    A fast, high-level web crawling and web scraping framework

    Scrapy is a fast, open source, high-level framework for crawling websites and extracting structured data from these websites. Portable and written in Python, it can run on Windows, Linux, macOS and BSD. Scrapy is powerful, fast and simple, and also easily extensible. Simply write the rules to extract the data, and add new functionality if you wish without having to touch the core. Scrapy does the rest, and can be used in a number of applications. It can be used for data mining, monitoring...
    Downloads: 28 This Week
    Last Update:
    See Project
  • 21
    SiteOne Crawler

    SiteOne Crawler

    SiteOne Crawler is a website analyzer and exporter

    SiteOne Crawler is a very useful and easy-to-use tool you'll ♥ as a Dev/DevOps, website owner or consultant. Works on all popular platforms - Windows, macOS, and Linux (x64 and arm64 too). It will crawl your entire website in depth, analyze and report problems, show useful statistics and reports, generate an offline version of the website, generate sitemaps, or send reports via email. Watch a detailed video with a sample report for Astro. build website. This crawler can be used as a command-line tool (see releases and video), or you can use a multi-platform desktop application with a graphical interface (see a video about the app).
    Downloads: 25 This Week
    Last Update:
    See Project
  • 22
    crwlr

    crwlr

    Library for Rapid (Web) Crawler and Scraper Development

    This library provides kind of a framework and a lot of ready-to-use, so-called steps, that you can use as building blocks, to build your own crawlers and scrapers with. Before diving into the library, let's have a look at the terms crawling and scraping. For most real-world use cases, those two things go hand in hand, which is why this library helps with and combines both. A (web) crawler is a program that (down)loads documents and follows the links in it to load them as well. A crawler...
    Downloads: 11 This Week
    Last Update:
    See Project
  • 23
    Deep Research

    Deep Research

    Use any LLMs (Large Language Models) for Deep Research

    Deep Research is a local-first research agent that orchestrates multiple LLMs to generate in-depth reports in minutes. It combines “thinking” and “task” model roles with live internet access to plan, search, read, and synthesize findings into structured outputs. The project emphasizes privacy: processing and storage happen locally, avoiding server-side retention of your queries and notes. A simple web UI lets you enter topics and configure models, while the backend streams progress as...
    Downloads: 4 This Week
    Last Update:
    See Project
  • 24
    Roach

    Roach

    The complete web scraping toolkit for PHP

    Roach is a complete web scraping toolkit for PHP. It is a shameless clone heavily inspired by the popular Scrapy package for Python. Roach allows us to define spiders that crawl and scrape web documents. But wait, there’s more. Roach isn’t just a simple crawler, but includes an entire pipeline to clean, persist and otherwise process extracted data as well. It’s your all-in-one resource for web scraping in PHP. Roach doesn’t depend on a specific framework. Instead, you can use the core package on its own or install one of the framework-specific adapters. ...
    Downloads: 6 This Week
    Last Update:
    See Project
  • 25
    Puppeteer

    Puppeteer

    Headless Chrome Node.js API

    ...However, it can also be set to run full or non-headless Chrome or Chromium, simply set the headless option when launching a browser. Many of the things you can do manually in the browser, you can also do with Puppeteer such as generate page screenshots and PDFs, crawl a Single-Page Application, test Chrome extensions and more.
    Downloads: 6 This Week
    Last Update:
    See Project
MongoDB Logo MongoDB