doc_to_html
Doc to HTML — Usage and Configuration Guide (2.x)
The Doc to HTML module allows editors to upload DOC/DOCX files
directly from a node edit form and automatically convert them into HTML stored in a
text_long or text_with_summary field.
Requirements
The module relies on LibreOffice being available on the server and callable
from the command line. Drupal invokes LibreOffice via a configurable command (for example
soffice or libreoffice) to perform the DOC/DOCX → HTML conversion.
-
Install LibreOffice on the same environment where your Drupal PHP runtime
is executed (web server or container). -
Make sure the LibreOffice binary is available in the PATH of the web/PHP user, or
note its full absolute path (e.g./usr/bin/libreoffice). -
From the command line, you should be able to run a simple test such as
libreoffice --version(or the equivalent binary) without errors.
Example DDEV configuration (config.yaml)
On local environments that use DDEV, you can install LibreOffice inside the
web container by adding an extra package to your config.yaml:
name: doc-to-html type: drupal11 docroot: web php_version: "8.3" webserver_type: nginx-fpm webimage_extra_packages: - libreoffice
With this configuration, DDEV will install libreoffice inside the web container,
so that the Doc to HTML module can execute the LibreOffice command via CLI.
Global Module Configuration
Before using the widget on content types, you must configure the global settings
used by the Doc to HTML module. These settings are shared by all fields that use
the widget.
Typical global configuration steps:
-
Define the LibreOffice command
Specify the command that Drupal should execute to run LibreOffice.
This can be either:- The bare command name (e.g.
libreofficeorsoffice), if it is in the PATH. - The full path to the binary (e.g.
/usr/bin/libreoffice).
- The bare command name (e.g.
-
Define base settings for all fields
Configure the base (global) options that will be inherited by every field using the
Doc to HTML widget. These may include:- Default output format produced by LibreOffice.
- Default body extraction regex (to extract the content inside the
<body>tag). - Optional DOM post-processing rules to clean or transform the generated HTML.
- Timeouts or command options for the conversion process.
These global settings act as a “baseline” that individual fields can reuse.
Testing the Conversion with TestWizard
The module provides a TestWizard tool to validate your configuration
and see a live preview of the extraction and conversion process.
This is useful to verify that LibreOffice is correctly reachable and that
the regex settings behave as expected.
Using the TestWizard typically involves:
- Navigate to the Doc to HTML test wizard page in the Drupal administration UI.
- Upload a sample DOC or DOCX document using the wizard form.
-
Run the conversion to see:
- The raw HTML generated by LibreOffice.
- The extracted
<body>segment after applying the body regex. - The final HTML after any DOM regex or post-processing pipeline is executed.
The TestWizard uses the global configuration, so any changes you make there will
be reflected in the test results. Once the test behaves as expected, you can move on
to enabling the widget on content fields.
Enabling the Widget on Content Fields
After configuring the global settings and validating them with the TestWizard,
you can enable the Doc to HTML widget on specific fields in your content types.
The widget is designed to work with:
text_longfieldstext_with_summaryfields
To enable the widget for a given field:
-
Go to Structure → Content types and select the content type
you want to configure. - Open the Manage form display tab.
-
Locate the
text_longortext_with_summaryfield that should
receive the converted HTML from DOC/DOCX files. -
In the widget selector for that field, choose
“Doc to HTML” as the widget. - Save the form display settings.
How the Doc to HTML Widget Works
Once the Doc to HTML widget is enabled for a field, the node edit form changes its behavior:
-
The widget adds a virtual upload field used only to upload the
DOC/DOCX file. This virtual field is part of the widget UI and is not stored
as a separate field on the node. -
When an editor uploads a document and saves or updates the node, the module:
- Receives the uploaded DOC/DOCX file through the virtual upload element.
- Calls LibreOffice (using the globally configured command) to convert the file to HTML.
- Extracts the
<body>content using the configured body regex. - Optionally applies DOM regex rules or a post-processing pipeline for cleanup and normalization.
- Writes the final HTML into the target
text_longortext_with_summaryfield of the node.
-
The uploaded file itself is treated as an intermediate input used only for the conversion.
The canonical source of content for the node becomes the converted HTML stored in the text field.
Typical Workflow Summary
- Install LibreOffice on the server (or container) and ensure it is callable from the command line.
- Configure the global Doc to HTML settings:
- LibreOffice command / binary path.
- Default body extraction regex.
- Optional DOM regex / post-processing options.
- Use the TestWizard to upload sample documents and validate:
- That LibreOffice works as expected.
- That the extraction and cleanup produce the desired HTML.
- On each content type that should support DOC/DOCX import:
- Use a
text_longortext_with_summaryfield. - On Manage form display, select the Doc to HTML widget for that field.
- Use a
- Editors can now:
- Upload DOC/DOCX files via the Doc to HTML widget on the node edit form.
- Let the module convert and inject the HTML into the text field automatically.
- Optionally refine the converted HTML directly in the editor (CKEditor) if needed.
With this setup, the Doc to HTML module provides a consistent and configurable workflow
to convert Word documents into clean HTML content inside Drupal, leveraging LibreOffice
and the Drupal Form API.