Showing 13 open source projects for "crawl"

View related business solutions
  • Get full visibility and control over your tasks and projects with Wrike. Icon
    Get full visibility and control over your tasks and projects with Wrike.

    A cloud-based collaboration, work management, and project management software

    Wrike offers world-class features that empower cross-functional, distributed, or growing teams take their projects from the initial request stage all the way to tracking work progress and reporting results.
    Learn More
  • The Apple Device Management and Security Platform Icon
    The Apple Device Management and Security Platform

    For IT teams at organizations that run on Apple

    Achieve harmony across your Apple device fleet with Kandji's unmatched management and security capabilities.
    Learn More
  • 1
    watercrawl

    watercrawl

    AI-ready web crawler that extracts and structures website content

    WaterCrawl is an open source web crawling and data extraction platform designed to transform website content into structured data suitable for machine learning and AI workflows. It enables developers and researchers to crawl web pages, extract meaningful information, and convert it into formats that are easier to process and analyze. It provides a modern crawling system that can automatically navigate links, control crawl depth, and collect content from targeted sections of a website. WaterCrawl supports customizable extraction rules so users can focus only on relevant elements while ignoring unnecessary page components. ...
    Downloads: 9 This Week
    Last Update:
    See Project
  • 2
    Scrapy

    Scrapy

    A fast, high-level web crawling and web scraping framework

    Scrapy is a fast, open source, high-level framework for crawling websites and extracting structured data from these websites. Portable and written in Python, it can run on Windows, Linux, macOS and BSD. Scrapy is powerful, fast and simple, and also easily extensible. Simply write the rules to extract the data, and add new functionality if you wish without having to touch the core. Scrapy does the rest, and can be used in a number of applications. It can be used for data mining, monitoring...
    Downloads: 25 This Week
    Last Update:
    See Project
  • 3
    tumblr-crawler

    tumblr-crawler

    Python crawler to download photos and videos from Tumblr blogs

    ...It provides a script that automatically retrieves photos and videos from specified Tumblr sites and saves them locally for offline access. Users can specify one or multiple blogs to crawl by editing a configuration file or by passing parameters through the command line. Once executed, the script fetches media from the Tumblr API and stores the downloaded files in folders named after each blog. tumblr-crawler avoids re-downloading files that have already been saved, making repeated runs safe and useful for recovering missing media. ...
    Downloads: 2 This Week
    Last Update:
    See Project
  • 4
    news-please

    news-please

    Python tool for crawling and extracting structured data from news site

    news-please is an open source news crawler and information extraction tool designed to collect and structure articles from online news websites. It provides an integrated pipeline that crawls news sites, retrieves article pages, and extracts structured information such as headlines, authors, publication dates, and article text. news-please can recursively follow internal links and read RSS feeds to gather both recent and archived articles from a news outlet when given only the root URL of a...
    Downloads: 1 This Week
    Last Update:
    See Project
  • Download the most trusted enterprise browser Icon
    Download the most trusted enterprise browser

    Chrome Enterprise brings enterprise controls and easy integrations to the browser users already know and love.

    Chrome Enterprise is ideal for businesses of all sizes, IT professionals, and organizations looking for a secure, scalable, and easily managed browser solution that supports remote work, data protection, and streamlined enterprise operations.
    Learn More
  • 5
    Gerapy

    Gerapy

    Distributed Crawler Management Framework Based on Scrapy

    ...It has high crawling efficiency and good scalability. It is basically a necessary tool for developing crawlers using Python. If you use Scrapy as a crawler, then of course we can use our own host to crawl when crawling, but when the crawl is very large, we can’t run the crawler on our own machine, a good one. The method is to deploy Scrapy to a remote server for execution. At this time, you might use Scrapyd. With it, we only need to install Scrapyd on the remote server and start the service. We can deploy the Scrapy project we wrote. ...
    Downloads: 0 This Week
    Last Update:
    See Project
  • 6
    grab-site

    grab-site

    Web crawler for archiving and backing up sites into WARC archives

    ...Internally, the crawler uses a fork of the wpull engine to fetch and process web pages efficiently during large-scale crawls. grab-site includes a built-in dashboard that displays real-time crawl activity, including which URLs are currently being processed and how many remain in the queue. Users can dynamically apply ignore patterns during an active crawl, allowing them to skip problematic or unnecessary URLs that could slow down or block the archiving process. grab-site also provides predefined ignore sets for common site structures such as forums and other complex web platforms. ...
    Downloads: 2 This Week
    Last Update:
    See Project
  • 7
    instagram-profilecrawl

    instagram-profilecrawl

    Instagram profile crawler that extracts posts, tags, and stats

    instagram-profilecrawl is a Python-based automation script designed to collect publicly available information from Instagram profiles. It crawls profile data such as follower counts, post information, hashtags, and other engagement-related metadata. It operates by automating a web browser using Selenium and performing requests to gather structured information from the platform. instagram-profilecrawl can analyze multiple usernames in a single run and store the extracted information locally...
    Downloads: 1 This Week
    Last Update:
    See Project
  • 8
    WeChatSogou

    WeChatSogou

    Python library to crawl and retrieve data from WeChat accounts

    WechatSogou is an open source Python library designed to retrieve data from WeChat official accounts by using the Sogou WeChat search service as its data source. It provides developers with a programmatic way to search for public accounts and collect article information without manually browsing the search interface. It functions as a crawler interface that sends requests to the search engine, retrieves results, and converts the returned pages into structured data that can be used in...
    Downloads: 8 This Week
    Last Update:
    See Project
  • 9
    pyspider

    pyspider

    A powerful Spider(Web Crawler) system in Python

    ...Since pyspider has various components, you can just run pyspider to start a standalone and third service free instance. Or using MySQL or MongoDB and RabbitMQ to deploy a distributed crawl cluster. To deploy pyspider in product environment, running component in each process and store data in database service is more reliable and flexible. To deploy pyspider components in each single processes, you need at least one database service. pyspider now supports MySQL, MongoDB and PostgreSQL. You can choose one of them.
    Downloads: 0 This Week
    Last Update:
    See Project
  • Enterprise AI Agents for Every Customer Moment Icon
    Enterprise AI Agents for Every Customer Moment

    For enterprise companies looking for AI Agents

    From chat to voice to SMS, every conversation gets a smart, personalized response powered by your policies, tone, and data.
    Learn More
  • 10
    sqliv

    sqliv

    Massive SQL injection vulnerability scanner for automated web testing

    ...In addition to bulk scanning, SQLiv supports targeted analysis of specific domains or individual URLs, allowing security researchers to focus on particular web applications. When a domain is supplied, the scanner can crawl the site to gather URLs with parameters and evaluate them for potential SQL injection weaknesses. SQLiv also supports reverse domain scanning to locate other websites hosted on the same server, which can then be examined for similar vulnerabilities.
    Downloads: 10 This Week
    Last Update:
    See Project
  • 11
    cybcon89

    cybcon89

    Crawl and ouput WAS configuration

    The project moved to Bitbucket: GIT https://bitbucket.org/Cybcon/websphere-as-configcrawler/src/master/ Please checkout the Bitbucket GIT repo for updates after v0.644.
    Downloads: 0 This Week
    Last Update:
    See Project
  • 12

    Web Crawler Security Tool

    A web crawler oriented to information security.

    Last update on tue mar 26 16:25 UTC 2012 The Web Crawler Security is a python based tool to automatically crawl a web site. It is a web crawler oriented to help in penetration testing tasks. The main task of this tool is to search and list all the links (pages and files) in a web site. The crawler has been completely rewritten in v1.0 bringing a lot of improvements: improved the data visualization, interactive option to download files, increased speed in crawling, exports list of found files into a separated file (useful to crawl a site once, then download files and analyse them with FOCA), generate an output log in Common Log Format (CLF), manage basic authentication and more! ...
    Downloads: 0 This Week
    Last Update:
    See Project
  • 13
    Ruya is a Python-based breadth-first, level-, delayed, event-based-crawler for crawling English, Japanese websites. It is targeted solely towards developers who want crawling functionality in their projects using API, and crawl control.
    Downloads: 0 This Week
    Last Update:
    See Project
  • Previous
  • You're on page 1
  • Next
MongoDB Logo MongoDB