Search Results for "metadata extraction tool"

Sort By:

Showing 421 open source projects for "metadata extraction tool"

View related business solutions

Mac Clear Filters & Widen Search

Get full visibility and control over your tasks and projects with Wrike.
A cloud-based collaboration, work management, and project management software

Wrike offers world-class features that empower cross-functional, distributed, or growing teams take their projects from the initial request stage all the way to tracking work progress and reporting results.

Learn More
The All-In-One Google Workspace Management Tool for IT Admins
Our advanced administration makes adding, deleting, suspending, and de-provisioning users simple and quick.

gPanel by Promevo streamlines administration, security, and user management, giving organizations full control over their Google Workspace.

Learn More
1

Instagram OSINT Tool

Instagram OSINT tool for gathering profile data and public posts

InstagramOSINT is an open source intelligence (OSINT) tool designed to collect publicly accessible information from Instagram profiles. It retrieves details that are not always easily visible when browsing an Instagram account normally, allowing investigators, researchers, and developers to gather structured data about a target profile. It works by scraping publicly available profile information and extracting metadata from Instagram pages using Python.

Downloads: 25 This Week

Last Update: 6 days ago
See Project
2

MinerU

A high-quality tool for convert PDF to Markdown and JSON

MinerU is an open-source, high-quality document extraction toolkit focused on converting PDFs (and other document formats) into structured Markdown and JSON. It leverages OCR and layout analysis to preserve semantic structure and metadata, ideal for research and data science workflows.

Downloads: 7 This Week

Last Update: 19 hours ago
See Project
3

ContextGem

ContextGem: Effortless LLM extraction from documents

ContextGem is an open-source framework designed to simplify the extraction of structured data and insights from documents using large language models (LLMs). It provides a flexible, intuitive API that minimizes boilerplate code, enabling developers to build complex extraction workflows efficiently. ContextGem supports various document formats and integrates with multiple LLM providers, making it a versatile tool for tasks like contract analysis, anomaly detection, and information retrieval.

Downloads: 3 This Week

Last Update: 2026-03-16
See Project
4

Extractous

Fast and efficient unstructured data extraction

Extractous is a Rust-based unstructured data extraction library focused on fast local parsing of documents and other content-heavy files. Its purpose is to extract text and metadata efficiently from formats such as PDF, Word, HTML, email archives, images, and more, without depending on external APIs or separate parsing servers. The project emphasizes performance and low memory usage, and its maintainers describe it as a local-first alternative to heavier extraction stacks. ...

Downloads: 1 This Week

Last Update: 2026-03-06
See Project
Planfix: Manage Projects, Team's Tasks and Business Processes
All-in-One Enterprise-Level Software is Now Available for SMB

Planfix is like a souped-up business process management system for folks who really know their stuff. It's built to help you dive deeper and gives you more options than your run-of-the-mill project and task management systems. Best part? Even small businesses and non-profits can get in on the action.

Learn More
5

Dungbeetle

A distributed job server

Dungbeetle is a metadata and data lineage tracking tool developed by Zerodha to map and visualize how data flows across systems. It helps teams maintain data transparency by tracking dependencies between databases, tables, and reports, offering a centralized view of data pipelines. Dungbeetle is designed to enhance observability and trust in analytics ecosystems.

Downloads: 3 This Week

Last Update: 2025-06-11
See Project
6

GROBID

A machine learning software for extracting information

...References extraction and parsing from articles in PDF format, around .87 F1-score against on an independent PubMed Central set of 1943 PDF containing 90,125 references, and around .89 on a similar bioRxiv set of 2000 PDF (using the Deep Learning citation model). All the usual publication metadata are covered (including DOI, PMID, etc.).

Downloads: 2 This Week

Last Update: 2026-04-07
See Project
7

Amazon EC2 Metadata Mock

A tool to simulate Amazon EC2 instance metadata

Instance metadata is data about your instance that you can use to configure or manage the running instance. Instance metadata is divided into categories, for example, hostname, events, and security groups. You can also use instance metadata to access user data that you specified when launching your instance. For example, you can specify parameters for configuring your instance, or include a simple script. You can build generic AMIs and use user data to modify the configuration files supplied...

Downloads: 3 This Week

Last Update: 2025-02-13
See Project
8

Portable Executable Parser

lightweight Go package to parse, analyze and extract metadata

Saferwall PE is a lightweight Go package for parsing, analyzing, and extracting metadata from Portable Executable (PE) binaries. Designed with malware analysis in mind, it is robust against malformed PE files and provides detailed insights into executable structures.

Downloads: 4 This Week

Last Update: 2026-04-08
See Project
9

Trafilatura

Python & command-line tool to gather text on the Web

Trafilatura is a Python package and command-line tool designed to gather text on the Web. It includes discovery, extraction and text-processing components. Its main applications are web crawling, downloads, scraping, and extraction of main texts, metadata and comments. It aims at staying handy and modular: no database is required, the output can be converted to various commonly used formats.

Downloads: 0 This Week

Last Update: 2024-12-03
See Project
Contract Management Software | Concord
AI-powered contract management that helps businesses track spending, negotiate smarter, and never miss deadlines.

Concord serves small and mid-sized businesses and Fortune 500 companies. This robust, web-based platform is used by human resource, sales, procurement, and legal teams, and virtually anyone who deals with contracts.

Learn More
10

Instaloader

Download pictures (or videos) along with their captions

Instaloader is a mature open-source utility for downloading and archiving Instagram content along with rich metadata. It enables users to retrieve posts, stories, reels, highlights, profile pictures, and associated information such as captions, comments, timestamps, and geotags. The tool supports both public and permitted private content when proper authentication is provided, making it useful for research, digital archiving, and social media analysis.

Downloads: 6 This Week

Last Update: 2026-03-21
See Project
11

PDFPatcher

A versatile toolkit for PDF manipulation

PDFPatcher (aka “PDF补丁丁”) is a versatile toolkit for PDF manipulation—editing document metadata, bookmarks, page layout, content restrictions, rotation, compression, merging/splitting, image extraction, and more, all within an intuitive interface. Merge/split PDFs or images, preserve or add bookmarks, and set page dimensions. Batch style/color/target changes, regex/XPath search/replace, mid‑page positioning. Modify PDF metadata, page numbers, links, initial view mode, and remove open actions.

Downloads: 39 This Week

Last Update: 2025-08-14
See Project
12

nhentai

A library for interacting with the nhentai API

nhentai is a JavaScript and TypeScript library designed to interact with the nhentai API and retrieve doujinshi metadata and content information. It enables developers to programmatically access galleries, titles, tags, covers, and page URLs from the nhentai platform. The library supports both CommonJS and ES6 module imports, making it easy to integrate into different Node.js projects. Developers can use it to fetch specific doujin entries, explore associated metadata, and process gallery...

Downloads: 47 This Week

Last Update: 2026-03-15
See Project
13

Docspell

Assist in organizing your piles of documents

Docspell is a personal document organizer. Or sometimes called a "Document Management System" (DMS). You'll need a scanner to convert your papers into files. Docspell can then assist in organizing the resulting mess. It can unify your files from scanners, emails, and other sources. It is targeted for home use, i.e. families, households, and also for smaller groups/companies. You can associate tags, set correspondent,s and lots of other predefined and custom metadata. If your documents are...

Downloads: 5 This Week

Last Update: 2025-03-15
See Project
14

pdfly

CLI tool to extract (meta)data from PDF and manipulate PDF files

A Python library designed for manipulating PDF files with functionalities for extraction, transformation, and document generation.

Downloads: 6 This Week

Last Update: 2025-10-13
See Project
15

X-osint

Open source OSINT tool for gathering data on emails, phones, and IPs

X-osint is an open source intelligence framework designed to collect and analyze publicly available information from multiple sources. It focuses on gathering useful and credible data about entities such as phone numbers, email addresses, and IP addresses using a range of automated OSINT techniques. It provides investigators and researchers with a centralized interface for running information-gathering tasks that would normally require multiple separate tools. X-osint can also perform...

Downloads: 47 This Week

Last Update: 6 days ago
See Project
16

Bili23 Downloader

Cross platform GUI tool for downloading videos from Bilibili sites

Bili23-Downloader is an open source desktop application designed for downloading video content from the Bilibili platform. It provides a graphical interface that allows users to download various types of media including user-uploaded videos, series episodes, movies, and other hosted content. It focuses on ease of use with a zero-configuration setup, making it accessible to both beginners and experienced users. It supports high performance downloads through multi-threading and includes resume...

Downloads: 19 This Week

Last Update: 2026-04-07
See Project
17

Copybara

Copybara: A tool for transforming and moving code between repositories

...The tool supports advanced transformations—such as file relocation, content replacement, and metadata adjustments—defined declaratively in configuration files. It operates in a stateless manner, storing synchronization state within commit metadata to ensure reproducibility and collaboration among multiple users. Copybara currently supports Git repositories (with experimental Mercurial support) and can be integrated with CI/CD systems or run manually.

Downloads: 58 This Week

Last Update: 4 days ago
See Project
18

ExtractThinker

ExtractThinker is a Document Intelligence library for LLMs

ExtractThinker is a tool designed to facilitate the extraction and analysis of information from various data sources, aiding in data processing and knowledge discovery.

Downloads: 5 This Week

Last Update: 2025-06-09
See Project
19

KaraKeep

A self-hostable bookmark-everything app

...Automatic fetching of link titles, descriptions, and images streamlines saving content without manual edits, while rule-based management lets users define customized workflows. With support for image OCR and structured data extraction, Karakeep functions as a flexible personal knowledge base for researchers, content creators, and heavy bookmarkers.

Downloads: 0 This Week

Last Update: 2026-02-22
See Project
20

MDCx

Movie metadata scraper and organizer for media libraries and NFO

MDCx is an open source media metadata scraping and organization tool designed to automate the process of collecting detailed information for movie files. It retrieves metadata from multiple online sources and applies it to local media collections, helping users maintain structured and well-organized libraries. MDCx can download information such as titles, cast data, artwork, and other metadata, then generate standardized NFO files compatible with media management systems. ...

Downloads: 10 This Week

Last Update: 2026-03-10
See Project
21

NeMo Retriever Library

Document content and metadata extraction microservice

NeMo Retriever Library is a scalable microservice framework designed for extracting, structuring, and enriching content from documents to support downstream generative AI applications. It processes various document types by splitting them into components such as text, tables, charts, and images, and then applies OCR and contextual analysis to convert them into structured data formats. The system is built on NVIDIA NIM microservices, enabling high-performance parallel processing and efficient...

Downloads: 2 This Week

Last Update: 2026-03-18
See Project
22

Coomer Downloader App

Coomer downloader

CoomerDL is an open-source command-line tool designed to download and archive content from websites supported by the Coomer platform, which aggregates publicly available content from various creators. The tool enables users to automate the retrieval of media such as images, videos, and metadata associated with creator profiles. It is built to handle large volumes of content efficiently, allowing users to download entire collections or selectively retrieve specific posts. ...

Downloads: 123 This Week

Last Update: 2026-04-04
See Project
23

CommunityScrapers

This is a public repository containing scrapers

Stash Community Scrapers is a large open-source collection of metadata extraction tools designed to work with the Stash media management platform, enabling automated scraping of content information from various online sources. The repository contains hundreds of scraper definitions written primarily in YAML and Python, each tailored to extract structured metadata such as titles, performers, tags, and media details from specific websites.

Downloads: 2 This Week

Last Update: 4 days ago
See Project
24

Atuin

Magical shell history

Atuin is a modern shell history replacement tool and CLI utility that records all your shell commands in a SQLite database alongside contextual metadata. It offers encrypted sync across devices, full-text search, usage statistics, and a desktop application to run executable runbooks as native documents.

Downloads: 97 This Week

Last Update: 2 days ago
See Project
25

get_iplayer

A utility for downloading TV and radio programmes from BBC iPlayer

get_iplayer is a command-line tool for downloading and streaming content from BBC iPlayer and BBC Sounds. It provides access to TV programs and radio broadcasts available on the BBC's platforms and allows users to archive content for offline use. The tool includes search, recording, and metadata tagging features and is popular among users looking to maintain access to BBC content globally.

Downloads: 38 This Week

Last Update: 2025-07-25
See Project