Search Results for "metadata extraction tool"

Showing 421 open source projects for "metadata extraction tool"

View related business solutions
  • Get full visibility and control over your tasks and projects with Wrike. Icon
    Get full visibility and control over your tasks and projects with Wrike.

    A cloud-based collaboration, work management, and project management software

    Wrike offers world-class features that empower cross-functional, distributed, or growing teams take their projects from the initial request stage all the way to tracking work progress and reporting results.
    Learn More
  • The All-In-One Google Workspace Management Tool for IT Admins Icon
    The All-In-One Google Workspace Management Tool for IT Admins

    Our advanced administration makes adding, deleting, suspending, and de-provisioning users simple and quick.

    gPanel by Promevo streamlines administration, security, and user management, giving organizations full control over their Google Workspace.
    Learn More
  • 1
    Instagram OSINT Tool

    Instagram OSINT Tool

    Instagram OSINT tool for gathering profile data and public posts

    InstagramOSINT is an open source intelligence (OSINT) tool designed to collect publicly accessible information from Instagram profiles. It retrieves details that are not always easily visible when browsing an Instagram account normally, allowing investigators, researchers, and developers to gather structured data about a target profile. It works by scraping publicly available profile information and extracting metadata from Instagram pages using Python.
    Downloads: 25 This Week
    Last Update:
    See Project
  • 2
    MinerU

    MinerU

    A high-quality tool for convert PDF to Markdown and JSON

    MinerU is an open-source, high-quality document extraction toolkit focused on converting PDFs (and other document formats) into structured Markdown and JSON. It leverages OCR and layout analysis to preserve semantic structure and metadata, ideal for research and data science workflows.
    Downloads: 7 This Week
    Last Update:
    See Project
  • 3
    ContextGem

    ContextGem

    ContextGem: Effortless LLM extraction from documents

    ContextGem is an open-source framework designed to simplify the extraction of structured data and insights from documents using large language models (LLMs). It provides a flexible, intuitive API that minimizes boilerplate code, enabling developers to build complex extraction workflows efficiently. ContextGem supports various document formats and integrates with multiple LLM providers, making it a versatile tool for tasks like contract analysis, anomaly detection, and information retrieval.​
    Downloads: 3 This Week
    Last Update:
    See Project
  • 4
    Extractous

    Extractous

    Fast and efficient unstructured data extraction

    Extractous is a Rust-based unstructured data extraction library focused on fast local parsing of documents and other content-heavy files. Its purpose is to extract text and metadata efficiently from formats such as PDF, Word, HTML, email archives, images, and more, without depending on external APIs or separate parsing servers. The project emphasizes performance and low memory usage, and its maintainers describe it as a local-first alternative to heavier extraction stacks. ...
    Downloads: 1 This Week
    Last Update:
    See Project
  • Planfix: Manage Projects, Team's Tasks and Business Processes Icon
    Planfix: Manage Projects, Team's Tasks and Business Processes

    All-in-One Enterprise-Level Software is Now Available for SMB

    Planfix is like a souped-up business process management system for folks who really know their stuff. It's built to help you dive deeper and gives you more options than your run-of-the-mill project and task management systems. Best part? Even small businesses and non-profits can get in on the action.
    Learn More
  • 5
    Dungbeetle

    Dungbeetle

    A distributed job server

    Dungbeetle is a metadata and data lineage tracking tool developed by Zerodha to map and visualize how data flows across systems. It helps teams maintain data transparency by tracking dependencies between databases, tables, and reports, offering a centralized view of data pipelines. Dungbeetle is designed to enhance observability and trust in analytics ecosystems.
    Downloads: 3 This Week
    Last Update:
    See Project
  • 6
    GROBID

    GROBID

    A machine learning software for extracting information

    ...References extraction and parsing from articles in PDF format, around .87 F1-score against on an independent PubMed Central set of 1943 PDF containing 90,125 references, and around .89 on a similar bioRxiv set of 2000 PDF (using the Deep Learning citation model). All the usual publication metadata are covered (including DOI, PMID, etc.).
    Downloads: 2 This Week
    Last Update:
    See Project
  • 7
    Amazon EC2 Metadata Mock

    Amazon EC2 Metadata Mock

    A tool to simulate Amazon EC2 instance metadata

    Instance metadata is data about your instance that you can use to configure or manage the running instance. Instance metadata is divided into categories, for example, hostname, events, and security groups. You can also use instance metadata to access user data that you specified when launching your instance. For example, you can specify parameters for configuring your instance, or include a simple script. You can build generic AMIs and use user data to modify the configuration files supplied...
    Downloads: 3 This Week
    Last Update:
    See Project
  • 8
    Portable Executable Parser

    Portable Executable Parser

    lightweight Go package to parse, analyze and extract metadata

    Saferwall PE is a lightweight Go package for parsing, analyzing, and extracting metadata from Portable Executable (PE) binaries. Designed with malware analysis in mind, it is robust against malformed PE files and provides detailed insights into executable structures.​
    Downloads: 4 This Week
    Last Update:
    See Project
  • 9
    Trafilatura

    Trafilatura

    Python & command-line tool to gather text on the Web

    Trafilatura is a Python package and command-line tool designed to gather text on the Web. It includes discovery, extraction and text-processing components. Its main applications are web crawling, downloads, scraping, and extraction of main texts, metadata and comments. It aims at staying handy and modular: no database is required, the output can be converted to various commonly used formats.
    Downloads: 0 This Week
    Last Update:
    See Project
  • Contract Management Software | Concord Icon
    Contract Management Software | Concord

    AI-powered contract management that helps businesses track spending, negotiate smarter, and never miss deadlines.

    Concord serves small and mid-sized businesses and Fortune 500 companies. This robust, web-based platform is used by human resource, sales, procurement, and legal teams, and virtually anyone who deals with contracts.
    Learn More
  • 10
    Instaloader

    Instaloader

    Download pictures (or videos) along with their captions

    Instaloader is a mature open-source utility for downloading and archiving Instagram content along with rich metadata. It enables users to retrieve posts, stories, reels, highlights, profile pictures, and associated information such as captions, comments, timestamps, and geotags. The tool supports both public and permitted private content when proper authentication is provided, making it useful for research, digital archiving, and social media analysis.
    Downloads: 6 This Week
    Last Update:
    See Project
  • 11
    PDFPatcher

    PDFPatcher

    A versatile toolkit for PDF manipulation

    PDFPatcher (aka “PDF补丁丁”) is a versatile toolkit for PDF manipulation—editing document metadata, bookmarks, page layout, content restrictions, rotation, compression, merging/splitting, image extraction, and more, all within an intuitive interface. Merge/split PDFs or images, preserve or add bookmarks, and set page dimensions. Batch style/color/target changes, regex/XPath search/replace, mid‑page positioning. Modify PDF metadata, page numbers, links, initial view mode, and remove open actions.
    Downloads: 39 This Week
    Last Update:
    See Project
  • 12
    nhentai

    nhentai

    A library for interacting with the nhentai API

    nhentai is a JavaScript and TypeScript library designed to interact with the nhentai API and retrieve doujinshi metadata and content information. It enables developers to programmatically access galleries, titles, tags, covers, and page URLs from the nhentai platform. The library supports both CommonJS and ES6 module imports, making it easy to integrate into different Node.js projects. Developers can use it to fetch specific doujin entries, explore associated metadata, and process gallery...
    Downloads: 47 This Week
    Last Update:
    See Project
  • 13
    Docspell

    Docspell

    Assist in organizing your piles of documents

    Docspell is a personal document organizer. Or sometimes called a "Document Management System" (DMS). You'll need a scanner to convert your papers into files. Docspell can then assist in organizing the resulting mess. It can unify your files from scanners, emails, and other sources. It is targeted for home use, i.e. families, households, and also for smaller groups/companies. You can associate tags, set correspondent,s and lots of other predefined and custom metadata. If your documents are...
    Downloads: 5 This Week
    Last Update:
    See Project
  • 14
    pdfly

    pdfly

    CLI tool to extract (meta)data from PDF and manipulate PDF files

    A Python library designed for manipulating PDF files with functionalities for extraction, transformation, and document generation.
    Downloads: 6 This Week
    Last Update:
    See Project
  • 15
    X-osint

    X-osint

    Open source OSINT tool for gathering data on emails, phones, and IPs

    X-osint is an open source intelligence framework designed to collect and analyze publicly available information from multiple sources. It focuses on gathering useful and credible data about entities such as phone numbers, email addresses, and IP addresses using a range of automated OSINT techniques. It provides investigators and researchers with a centralized interface for running information-gathering tasks that would normally require multiple separate tools. X-osint can also perform...
    Downloads: 47 This Week
    Last Update:
    See Project
  • 16
    Bili23 Downloader

    Bili23 Downloader

    Cross platform GUI tool for downloading videos from Bilibili sites

    Bili23-Downloader is an open source desktop application designed for downloading video content from the Bilibili platform. It provides a graphical interface that allows users to download various types of media including user-uploaded videos, series episodes, movies, and other hosted content. It focuses on ease of use with a zero-configuration setup, making it accessible to both beginners and experienced users. It supports high performance downloads through multi-threading and includes resume...
    Downloads: 19 This Week
    Last Update:
    See Project
  • 17
    Copybara

    Copybara

    Copybara: A tool for transforming and moving code between repositories

    ...The tool supports advanced transformations—such as file relocation, content replacement, and metadata adjustments—defined declaratively in configuration files. It operates in a stateless manner, storing synchronization state within commit metadata to ensure reproducibility and collaboration among multiple users. Copybara currently supports Git repositories (with experimental Mercurial support) and can be integrated with CI/CD systems or run manually.
    Downloads: 58 This Week
    Last Update:
    See Project
  • 18
    ExtractThinker

    ExtractThinker

    ExtractThinker is a Document Intelligence library for LLMs

    ExtractThinker is a tool designed to facilitate the extraction and analysis of information from various data sources, aiding in data processing and knowledge discovery.
    Downloads: 5 This Week
    Last Update:
    See Project
  • 19
    KaraKeep

    KaraKeep

    A self-hostable bookmark-everything app

    ...Automatic fetching of link titles, descriptions, and images streamlines saving content without manual edits, while rule-based management lets users define customized workflows. With support for image OCR and structured data extraction, Karakeep functions as a flexible personal knowledge base for researchers, content creators, and heavy bookmarkers.
    Downloads: 0 This Week
    Last Update:
    See Project
  • 20
    MDCx

    MDCx

    Movie metadata scraper and organizer for media libraries and NFO

    MDCx is an open source media metadata scraping and organization tool designed to automate the process of collecting detailed information for movie files. It retrieves metadata from multiple online sources and applies it to local media collections, helping users maintain structured and well-organized libraries. MDCx can download information such as titles, cast data, artwork, and other metadata, then generate standardized NFO files compatible with media management systems. ...
    Downloads: 10 This Week
    Last Update:
    See Project
  • 21
    NeMo Retriever Library

    NeMo Retriever Library

    Document content and metadata extraction microservice

    NeMo Retriever Library is a scalable microservice framework designed for extracting, structuring, and enriching content from documents to support downstream generative AI applications. It processes various document types by splitting them into components such as text, tables, charts, and images, and then applies OCR and contextual analysis to convert them into structured data formats. The system is built on NVIDIA NIM microservices, enabling high-performance parallel processing and efficient...
    Downloads: 2 This Week
    Last Update:
    See Project
  • 22
    Coomer Downloader App

    Coomer Downloader App

    Coomer downloader

    CoomerDL is an open-source command-line tool designed to download and archive content from websites supported by the Coomer platform, which aggregates publicly available content from various creators. The tool enables users to automate the retrieval of media such as images, videos, and metadata associated with creator profiles. It is built to handle large volumes of content efficiently, allowing users to download entire collections or selectively retrieve specific posts. ...
    Downloads: 123 This Week
    Last Update:
    See Project
  • 23
    CommunityScrapers

    CommunityScrapers

    This is a public repository containing scrapers

    Stash Community Scrapers is a large open-source collection of metadata extraction tools designed to work with the Stash media management platform, enabling automated scraping of content information from various online sources. The repository contains hundreds of scraper definitions written primarily in YAML and Python, each tailored to extract structured metadata such as titles, performers, tags, and media details from specific websites.
    Downloads: 2 This Week
    Last Update:
    See Project
  • 24
    Atuin

    Atuin

    Magical shell history

    Atuin is a modern shell history replacement tool and CLI utility that records all your shell commands in a SQLite database alongside contextual metadata. It offers encrypted sync across devices, full-text search, usage statistics, and a desktop application to run executable runbooks as native documents.
    Downloads: 97 This Week
    Last Update:
    See Project
  • 25
    get_iplayer

    get_iplayer

    A utility for downloading TV and radio programmes from BBC iPlayer

    get_iplayer is a command-line tool for downloading and streaming content from BBC iPlayer and BBC Sounds. It provides access to TV programs and radio broadcasts available on the BBC's platforms and allows users to archive content for offline use. The tool includes search, recording, and metadata tagging features and is popular among users looking to maintain access to BBC content globally.
    Downloads: 38 This Week
    Last Update:
    See Project
  • Previous
  • You're on page 1
  • 2
  • 3
  • 4
  • 5
  • Next
MongoDB Logo MongoDB