unstructured
Unstructured is an open source service and/or SaaS that efficiently using machine learning extracts your data into usable text and images. It currently handles Plain text files (.txt/.text), PDFs (.pdf), Word Documents (.doc/.docx), PowerPoints (.ppt/.pptx), Images (.jpg/.jpeg), Emails (.eml/.msg), HTML (.html) and Markdown Files (.md).
The Unstructured core module is a simple API module that can be extended by any service.
Version 2.0 comes with Automator types to work together with the AI Automator module that comes with the AI module. If you are starting to use the chaining/automation, please use this instead of the AI Interpolator module that will be sundowning.
Version 1.0 comes project comes with a submodule that can be used together with the AI Interpolator to take any of these type of files and fill a long formatted text or long plain text field with the structured content.
Features
- Import txt, pdf, doc, ppt, jpg, eml, html or md into a text field.
- Output can in plain text, markdown or html.
- With markdown and html, the images inside the document also gets extracted.
- Extract tables from Excel, PDFs, Word Files, Images into a TableField.
- Extract image from PDFs and Images into image fields.
Post-Installation
Visit admin/config/unstructured/settings to setup if you want to connect to your own Unstructured machine or the SaaS. If its the SaaS a api key is required as well.
DDEV/Self-hosted
Roberto Peruzzo added a DDEV plugin that can be used as starting point to get it working locally. Check out https://github.com/robertoperuzzo/ddev-unstructured for instructions!
Additional Requirements
You need an Unstructured server or an account on the SaaS (or free trial).