Drupal is a registered trademark of Dries Buytaert

Migrate Process Newspaper3k provides a Migrate process plugin to enable you to request and extract data from the python based Newspaper3k article download framework.

This plugin also works with Newspaper4k. Please note that newspaper4k schema is a little different. e.g. text key becomes _text

Newspaper3k Features

  1. Multi-threaded article download framework
  2. News url identification
  3. Text extraction from html
  4. Top image extraction from html
  5. All image extraction from html
  6. Keyword extraction from text
  7. Summary extraction from text
  8. Author extraction from text
  9. Google trending terms extraction
  10. Works in 10+ languages (English, Chinese, German, Arabic, …)

Prerequisites

LAMP server with python3 support.

See https://github.com/2dareis2do/newspaper3k-php-wrapper for installation and setup instructions.

Example Usage

process:
   'body/value':
     -
       plugin: migrate_process_js_redirect_link
       source: link
     -
       plugin: migrate_process_newspaper3k
     -
       plugin: skip_on_empty
       method: row
       message: 'migrate_process_newspaper3k import failed'
     -
       plugin: extract
       index:
         - summary

Schema Keys

See the included stub json for more info on this. Those of note include:

summary (string)
source_url (string)
url (string)
title (string)
top_img (string)
top_image (string)
meta_img (string)
imgs (array)
images (array)
movies (array)
text (string)
keywords (array)
meta_keywords (array)
tags (array)
authors (array)
publish_date (string)
summary (string)
html (string)
meta_data (array object)
meta_description (string)
article_html (string)
top_node (string)
doc (string)

DDEV support

If you are running ddev locally, to install newspaper3k and supporting libraries, you can add the following to your config.yaml found in your projects .ddev folder to install Newspaper3k and the necessary natural language data sets orcorpora.

Newspaper3k

hooks:
    post-start:
        - exec: 'pip3 install newspaper3k'
        - exec: 'curl https://raw.githubusercontent.com/codelucas/newspaper/master/download_corpora.py | python3'

Newspaper4k

hooks:
    post-start:
        - exec: 'pip3 install newspaper4k'
        - exec: 'pip3 install typing-extensions'
        - exec: 'pip3 install lxml_html_clean'

DDEV New (1.23)

webimage_extra_packages: [python3, python-is-python3, python3-pip, python3-typing-extensions]

hooks:
    post-start:
        - exec: 'pip3 install newspaper4k --break-system-packages'
        - exec: 'pip3 install lxml_html_clean --break-system-packages'

More info

For more info on how Newspaper3k parses articles please see https://newspaper.readthedocs.io/en/latest/user_guide/quickstart.html#pa...

Activity

Total releases
1
First release
Feb 2025
Latest release
1 year ago
Release cadence
Stability
100% stable

Releases

Version Type Release date
1.1.0 Stable Feb 11, 2025