file_extractor
Synopsis
This module adds a new computed field on File entity: "File extractor: extracted file".
This new field allows to access the content of the file:
- in webservices like JSON:API
- in a field formatter (file field)
- in Search API
The module provides the following extraction methods:
- Docconv binary
- Pdftotext binary
- Python Pdf2txt binary
- Solr built-in extractor (Search API Solr)
- Tika App JAR
- Tika Server JAR
History
This project is a fork of Search API Attachments. More information on the module origins on: #3126845: Version 2.0.0
Requirements
Each extractor plugin can require different modules/libraries, if the requirements are not satisfied the plugin doesn't show up in the settings.
Each extractor plugin can require different binary on your server, when configuring the extraction, a test will be done to see if the extraction works. Also you can read the module documentation to see installation instructions for extractor plugins.
Configuration
- Enable the File Extractor module on your site.
- Go to the configuration page (/admin/config/media/file-extractor) and configure the extraction settings.
The module provides its own cache bin 'file_extractor', so in your settings.php file you can override the cache backend for this cache bin. For example if you want to use the File Cache module:
$settings['cache']['bins']['file_extractor'] = 'cache.backend.file_system';