Drupal is a registered trademark of Dries Buytaert
drupal 11.3.7 Update released for Drupal core (11.3.7)! drupal 11.2.11 Update released for Drupal core (11.2.11)! drupal 10.6.7 Update released for Drupal core (10.6.7)! drupal 10.5.9 Update released for Drupal core (10.5.9)! cms 2.1.1 Update released for Drupal core (2.1.1)! drupal 11.3.6 Update released for Drupal core (11.3.6)! drupal 10.6.6 Update released for Drupal core (10.6.6)! cms 2.1.0 Update released for Drupal core (2.1.0)! bootstrap 8.x-3.40 Minor update available for theme bootstrap (8.x-3.40). menu_link_attributes 8.x-1.7 Minor update available for module menu_link_attributes (8.x-1.7). eca 3.1.1 Minor update available for module eca (3.1.1). layout_paragraphs 2.1.3 Minor update available for module layout_paragraphs (2.1.3). ai 1.3.3 Minor update available for module ai (1.3.3). ai 1.2.14 Minor update available for module ai (1.2.14). node_revision_delete 2.0.3 Minor update available for module node_revision_delete (2.0.3). moderated_content_bulk_publish 2.0.52 Minor update available for module moderated_content_bulk_publish (2.0.52). klaro 3.0.10 Minor update available for module klaro (3.0.10). klaro 3.0.9 Minor update available for module klaro (3.0.9). layout_paragraphs 2.1.2 Minor update available for module layout_paragraphs (2.1.2). geofield_map 11.1.8 Minor update available for module geofield_map (11.1.8).

AI File to Text automatically extracts content from uploaded document files and converts them to plain text, HTML, Markdown, or structured JSON. Built on an extensible extractor architecture, it integrates with the Drupal AI module and Document Loader module — providing AI Automator plugins, AI Agent function calls, and a Document Loader plugin as three consumers that all flow through the same extraction pipeline. No external services, APIs, or server-side applications required. Everything runs in pure PHP on your server.

Features

  • 10 file extensions supported out of the box across 7 extractors: Word (.docx, .doc), OpenDocument (.odt), PDF (.pdf), Spreadsheets (.xlsx, .xls, .ods, .csv), Plain Text (.txt), and Markdown (.md).
  • 4 output formats: plain text, styled HTML, Markdown, or structured JSON.
  • HTML output preserves headings, bold, italic, underline, font sizes, colors, links, lists, and tables.
  • JSON output produces a structured DOM tree ({"tag": "p", "attributes": {...}, "children": [...]}) for non-tabular files, or an array of objects keyed by column headers for spreadsheets/CSV.
  • Document Loader plugin (document_loader:file) — enables any third-party code to load documents programmatically via the Document Loader API.
  • AI Automator plugins for text_long and string_long fields — upload a file and the text is extracted automatically on entity save.
  • AI Agent function call (file_to_text) — AI agents can read and process documents autonomously.
  • Extensible architecture — other modules can register new file-type extractors as tagged services without modifying this module.
  • Dynamic type registration — extractor types and output capabilities are automatically discovered and registered with the Document Loader plugin system.
  • Per-type output accuracy — when extractors have different output capabilities, plugin definitions are automatically split into capability groups so that getLoaderByType() returns accurate results (no false positives).
  • No external services, APIs, or server-side applications required.

Architecture

All consumers go through a single unified path:

Consumers (Automator / FunctionCall / Document Loader API)
  └─ FileDocumentLoader  (document_loader plugin)
       └─ FileExtractorManager  (routes by file extension)
            └─ Extractors  (auto-discovered tagged services)

Installation

For DDEV environments with Poppler support, add to .ddev/config.yaml:

webimage_extra_packages:
  - poppler-utils

Extending — Adding Custom Extractors

Other modules can add support for new file types by:
See the module's README.md for a full extractor implementation example.

Requirements

PHP libraries (installed automatically via Composer):

Optional system package for improved PDF extraction:

  • poppler-utils — When installed, the module uses pdftohtml for higher-quality PDF output with better table, link, and style detection. Falls back to smalot/pdfparser if not available.

Recommended modules

  • AI Agents — Enables AI agent workflows where agents can call file_to_text to read and process uploaded documents autonomously.
  • AI Context — Provides additional context to AI operations, useful when combining file extraction with other AI tasks.

Similar projects

  • AI Simple PDF to Text — Handles PDF files, converts to plain text only.
  • Unstructured — Supports a similar range of file types but requires an external Unstructured API server or a SaaS account.

Supporting this Module

Contributions, bug reports, and feature requests are welcome in the issue queue.

Community Documentation

Documentation, architecture details, and usage examples are included in the module's README.md file.

Activity

Total releases
1
First release
Feb 2026
Latest release
2 months ago
Release cadence
Stability
0% stable

Releases

Version Type Release date
1.0.x-dev Dev Feb 11, 2026