Trafilatura
Visit ToolTrafilatura is a Data & Analytics tool that gathers text and metadata from the web. It supports crawling, scraping, and extraction, outputting data in various formats like CSV, JSON, HTML, and XML.
At a glance
Trending
Trafilatura is a Data & Analytics tool that gathers text and metadata from the web. It supports crawling, scraping, and extraction, outputting data in various formats like CSV, JSON, HTML, and XML.
Trending
About
Trafilatura is a powerful Python package and command-line tool designed for comprehensive web data extraction. It simplifies the process of converting raw HTML into structured, meaningful data, offering capabilities for web crawling, scraping, and extraction of main texts, metadata, and comments. The tool is highly configurable and robust, balancing precision in limiting noise with recall for including all valid content. It supports sitemaps and feeds for advanced text discovery, efficient processing of online and offline input, and offers multiple output formats including TXT, Markdown, CSV, JSON, HTML, XML, and XML-TEI. Trafilatura is widely adopted by major companies and institutions, and consistently outperforms other open-source libraries in text extraction benchmarks.
Capabilities
Pricing & Plans
Open Source
Free
FAQs
Trending
Also listed in