crawl free download - SourceForge

Showing 25 open source projects for "crawl"

View related business solutions

Python Clear Filters & Widen Search

AI-powered service management for IT and enterprise teams
Enterprise-grade ITSM, for every business

Give your IT, operations, and business teams the ability to deliver exceptional services—without the complexity. Maximize operational efficiency with refreshingly simple, AI-powered Freshservice.

Try it Free
Try Google Cloud Risk-Free With $300 in Credit
No hidden charges. No surprise bills. Cancel anytime.

Use your credit across every product. Compute, storage, AI, analytics. When it runs out, 20+ products stay free. You only pay when you choose to.

Start Free
1

watercrawl

AI-ready web crawler that extracts and structures website content

WaterCrawl is an open source web crawling and data extraction platform designed to transform website content into structured data suitable for machine learning and AI workflows. It enables developers and researchers to crawl web pages, extract meaningful information, and convert it into formats that are easier to process and analyze. It provides a modern crawling system that can automatically navigate links, control crawl depth, and collect content from targeted sections of a website. WaterCrawl supports customizable extraction rules so users can focus only on relevant elements while ignoring unnecessary page components. ...

Downloads: 8 This Week

Last Update: 2026-03-11
See Project
2

Douyin TikTok Download API

Douyin TikTok Download API

...You can deploy or transform this project yourself to achieve more functions, or you can directly call scraper.py in your project or install an existing pip package as a parsing library to easily crawl data, etc. Support input Douyin|TikTokuser homepage to crawl the author [homepage video data (remove watermark link, liked video list (permission must be public), video comment data, background music video list data, etc...).

Downloads: 2 This Week

Last Update: 2025-03-16
See Project
3

Scrapy

A fast, high-level web crawling and web scraping framework

Scrapy is a fast, open source, high-level framework for crawling websites and extracting structured data from these websites. Portable and written in Python, it can run on Windows, Linux, macOS and BSD. Scrapy is powerful, fast and simple, and also easily extensible. Simply write the rules to extract the data, and add new functionality if you wish without having to touch the core. Scrapy does the rest, and can be used in a number of applications. It can be used for data mining, monitoring...

Downloads: 28 This Week

Last Update: 5 days ago
See Project
4

BettaFish

Public opinion analysis system

BettaFish is an open-source, multi-agent public opinion analysis system built to automate the collection, deep analysis, and reporting of social media data at scale through conversational queries. It uses a modular architecture of specialized agents that collaborate to crawl mainstream platforms, extract multimodal content like text and short video, and synthesize insights through both statistical and large language model techniques. With a design that lets users pose questions in natural language and receive structured reports, charts, and visualizations, the system aims to break information cocoons and provide comprehensive views of trends and public sentiment. ...

Downloads: 1 This Week

Last Update: 2026-02-17
See Project
Add Two Lines of Code. Get Full APM.
AppSignal installs in minutes and auto-configures dashboards, alerts, and error tracking.

Works out of the box for Rails, Django, Express, Phoenix, and more. Monitoring exceptions and performance in no time.

Start Free
5

tumblr-crawler

Python crawler to download photos and videos from Tumblr blogs

...It provides a script that automatically retrieves photos and videos from specified Tumblr sites and saves them locally for offline access. Users can specify one or multiple blogs to crawl by editing a configuration file or by passing parameters through the command line. Once executed, the script fetches media from the Tumblr API and stores the downloaded files in folders named after each blog. tumblr-crawler avoids re-downloading files that have already been saved, making repeated runs safe and useful for recovering missing media. ...

Downloads: 2 This Week

Last Update: 6 days ago
See Project
6

news-please

Python tool for crawling and extracting structured data from news site

news-please is an open source news crawler and information extraction tool designed to collect and structure articles from online news websites. It provides an integrated pipeline that crawls news sites, retrieves article pages, and extracts structured information such as headlines, authors, publication dates, and article text. news-please can recursively follow internal links and read RSS feeds to gather both recent and archived articles from a news outlet when given only the root URL of a...

Downloads: 1 This Week

Last Update: 7 days ago
See Project
7

Python-Spider

Python3 web crawler practice

...As part of the author’s public learning-path repositories, python-spider likely includes examples of HTTP requests, HTML parsing, maybe concurrency or scheduling to crawl multiple pages, and techniques to handle common web-scraping issues. For people wanting to get hands-on with building scrapers, collecting data, or learning how to navigate web programming in Python, this repository acts as a didactic reference or starting point. Because it’s published publicly under an open license, users are free to fork and adapt the code.

Downloads: 1 This Week

Last Update: 2025-12-08
See Project
8

TorBot

Dark Web OSINT Tool

...You will need to give the script the correct permissions using chmod +x install.sh Now you can run ./install.sh to create the torBot binary. Run ./torBot to execute the program. Crawl custom domains.(Completed). Check if the link is live.(Completed). Built-in Updater.(Completed). TorBot GUI (In progress). Social Media integration.(not Started).

Downloads: 0 This Week

Last Update: 2023-10-12
See Project
9

Gerapy

Distributed Crawler Management Framework Based on Scrapy

...It has high crawling efficiency and good scalability. It is basically a necessary tool for developing crawlers using Python. If you use Scrapy as a crawler, then of course we can use our own host to crawl when crawling, but when the crawl is very large, we can’t run the crawler on our own machine, a good one. The method is to deploy Scrapy to a remote server for execution. At this time, you might use Scrapyd. With it, we only need to install Scrapyd on the remote server and start the service. We can deploy the Scrapy project we wrote. ...

Downloads: 0 This Week

Last Update: 2023-07-19
See Project
Gemini 3 and 200+ AI Models on One Platform
Access Google's best plus Claude, Llama, and Gemma. Fine-tune and deploy from one console.

Build generative AI apps with Vertex AI. Switch between models without switching platforms.

Start Free
10

Image Downloader

Download images from Google, Bing, Baidu

Crawl and download images using Selenium Using python3 and PyQt5. Supported Search Engine: Google, Bing, Baidu. Keywords input from the keyboard or input from line separated keywords list file for batch process. Download image using a customizable number of threads. Fully supported conditional search (eg. filetype:, site:). Switch for Google safe mode.

Downloads: 3 This Week

Last Update: 2023-04-03
See Project
11

ProxyPool

An Efficient ProxyPool with Getter, Tester and Server

Simple and efficient proxy pool, providing the following functions. Regularly crawl free proxy websites, easy and scalable. Use Redis to store brokers and sort broker availability. Regular testing and screening to eliminate unavailable proxies and leave available proxies. Provides a proxy API to randomly select available proxies that pass the test. The principle analysis of the proxy pool can be seen in " How to Build an Efficient Proxy Pool ".

Downloads: 0 This Week

Last Update: 2023-03-01
See Project
12

grab-site

Web crawler for archiving and backing up sites into WARC archives

...Internally, the crawler uses a fork of the wpull engine to fetch and process web pages efficiently during large-scale crawls. grab-site includes a built-in dashboard that displays real-time crawl activity, including which URLs are currently being processed and how many remain in the queue. Users can dynamically apply ignore patterns during an active crawl, allowing them to skip problematic or unnecessary URLs that could slow down or block the archiving process. grab-site also provides predefined ignore sets for common site structures such as forums and other complex web platforms. ...

Downloads: 4 This Week

Last Update: 3 days ago
See Project
13

instagram-profilecrawl

Instagram profile crawler that extracts posts, tags, and stats

instagram-profilecrawl is a Python-based automation script designed to collect publicly available information from Instagram profiles. It crawls profile data such as follower counts, post information, hashtags, and other engagement-related metadata. It operates by automating a web browser using Selenium and performing requests to gather structured information from the platform. instagram-profilecrawl can analyze multiple usernames in a single run and store the extracted information locally...

Downloads: 1 This Week

Last Update: 6 days ago
See Project
14

hebrew-gpt_neo

Hebrew text generation models based on EleutherAI's gpt-neo

...Each was trained on a TPUv3-8 which was made available to me via the TPU Research Cloud Program. The Open Super-large Crawled ALMAnaCH coRpus is a huge multilingual corpus obtained by language classification and filtering of the Common Crawl corpus using the goclassy architecture.

Downloads: 0 This Week

Last Update: 2023-03-23
See Project
15

CC-Net

Tools to download and cleanup Common Crawl data

cc_net provides tools to download, segment, clean, and filter Common Crawl to build large-scale text corpora, including monolingual datasets and the multilingual CC-100 collection introduced in the associated paper. It includes pipelines to fetch snapshots, extract text, de-duplicate, identify language, and apply quality filtering based on heuristics and language models. The outputs are intended for pretraining language models and for creating standardized corpora that can be reproduced or updated with new crawls. ...

Downloads: 0 This Week

Last Update: 2025-10-11
See Project
16

BlackWidow

Python web scanner for OSINT gathering and OWASP vulnerability fuzzing

BlackWidow is a Python-based web application scanning tool designed to crawl target websites and collect open-source intelligence (OSINT) while identifying potential security vulnerabilities. It functions as a web spider that systematically explores a site to gather valuable information such as URLs, dynamic parameters, subdomains, email addresses, and phone numbers associated with the target domain. By automatically extracting this data, BlackWidow helps security professionals and researchers build a clearer understanding of a website’s structure and publicly accessible information. ...

Downloads: 14 This Week

Last Update: 2026-03-07
See Project
17

WeChatSogou

Python library to crawl and retrieve data from WeChat accounts

WechatSogou is an open source Python library designed to retrieve data from WeChat official accounts by using the Sogou WeChat search service as its data source. It provides developers with a programmatic way to search for public accounts and collect article information without manually browsing the search interface. It functions as a crawler interface that sends requests to the search engine, retrieves results, and converts the returned pages into structured data that can be used in...

Downloads: 8 This Week

Last Update: 2026-03-10
See Project
18

pyspider

A powerful Spider(Web Crawler) system in Python

...Since pyspider has various components, you can just run pyspider to start a standalone and third service free instance. Or using MySQL or MongoDB and RabbitMQ to deploy a distributed crawl cluster. To deploy pyspider in product environment, running component in each process and store data in database service is more reliable and flexible. To deploy pyspider components in each single processes, you need at least one database service. pyspider now supports MySQL, MongoDB and PostgreSQL. You can choose one of them.

Downloads: 0 This Week

Last Update: 2021-03-31
See Project
19

sqliv

Massive SQL injection vulnerability scanner for automated web testing

...In addition to bulk scanning, SQLiv supports targeted analysis of specific domains or individual URLs, allowing security researchers to focus on particular web applications. When a domain is supplied, the scanner can crawl the site to gather URLs with parameters and evaluate them for potential SQL injection weaknesses. SQLiv also supports reverse domain scanning to locate other websites hosted on the same server, which can then be examined for similar vulnerabilities.

Downloads: 9 This Week

Last Update: 12 hours ago
See Project
20

cybcon89

Crawl and ouput WAS configuration

The project moved to Bitbucket: GIT https://bitbucket.org/Cybcon/websphere-as-configcrawler/src/master/ Please checkout the Bitbucket GIT repo for updates after v0.644.

Downloads: 0 This Week

Last Update: 2020-10-27
See Project
21

DCSS-Forensics

Morgue file parser & Analyzer for Dungeon Crawl: Stone Soup

Morgue file parser & Analyzer for Dungeon Crawl: Stone Soup The purpose of this project is to develop a generic morgue file reader for the roguelike game Dungeon Crawl: Stone Soup.

Downloads: 0 This Week

Last Update: 2013-05-31
See Project
22

Web Crawler Security Tool

A web crawler oriented to information security.

Last update on tue mar 26 16:25 UTC 2012 The Web Crawler Security is a python based tool to automatically crawl a web site. It is a web crawler oriented to help in penetration testing tasks. The main task of this tool is to search and list all the links (pages and files) in a web site. The crawler has been completely rewritten in v1.0 bringing a lot of improvements: improved the data visualization, interactive option to download files, increased speed in crawling, exports list of found files into a separated file (useful to crawl a site once, then download files and analyse them with FOCA), generate an output log in Common Log Format (CLF), manage basic authentication and more! ...

3 Reviews

Downloads: 0 This Week

Last Update: 2015-10-10
See Project
23

Ruya

Ruya is a Python-based breadth-first, level-, delayed, event-based-crawler for crawling English, Japanese websites. It is targeted solely towards developers who want crawling functionality in their projects using API, and crawl control.

Downloads: 0 This Week

Last Update: 2013-03-27
See Project
24

pythonSlash

PythonSlash (py/,for short) is an multiplataform engine for real time dungeon crawl games written in python. The included sample - game is intended to be a Diablo - like game, fun and fast paced.

Downloads: 0 This Week

Last Update: 2014-11-21
See Project
25

LJcrawler

Crawl a LiveJournal-based blog hosting for friends data. LiveJournal is a blog hosting engine that allows its users to list others as friends. This tool can download most of the data about friendship relations between users for later processing.

Downloads: 0 This Week

Last Update: 2013-04-19
See Project