migrate_process_newspaper3k
Migrate Process Newspaper3k provides a Migrate process plugin to enable you to request and extract data from the python based Newspaper3k article download framework.
This plugin also works with Newspaper4k. Please note that newspaper4k schema is a little different. e.g. text key becomes _text
Newspaper3k Features
- Multi-threaded article download framework
- News url identification
- Text extraction from html
- Top image extraction from html
- All image extraction from html
- Keyword extraction from text
- Summary extraction from text
- Author extraction from text
- Google trending terms extraction
- Works in 10+ languages (English, Chinese, German, Arabic, …)
Prerequisites
LAMP server with python3 support.
See https://github.com/2dareis2do/newspaper3k-php-wrapper for installation and setup instructions.
Example Usage
process:
'body/value':
-
plugin: migrate_process_js_redirect_link
source: link
-
plugin: migrate_process_newspaper3k
-
plugin: skip_on_empty
method: row
message: 'migrate_process_newspaper3k import failed'
-
plugin: extract
index:
- summarySchema Keys
See the included stub json for more info on this. Those of note include:
summary (string)
source_url (string)
url (string)
title (string)
top_img (string)
top_image (string)
meta_img (string)
imgs (array)
images (array)
movies (array)
text (string)
keywords (array)
meta_keywords (array)
tags (array)
authors (array)
publish_date (string)
summary (string)
html (string)
meta_data (array object)
meta_description (string)
article_html (string)
top_node (string)
doc (string)DDEV support
If you are running ddev locally, to install newspaper3k and supporting libraries, you can add the following to your config.yaml found in your projects .ddev folder to install Newspaper3k and the necessary natural language data sets orcorpora.
Newspaper3k
hooks:
post-start:
- exec: 'pip3 install newspaper3k'
- exec: 'curl https://raw.githubusercontent.com/codelucas/newspaper/master/download_corpora.py | python3'Newspaper4k
hooks:
post-start:
- exec: 'pip3 install newspaper4k'
- exec: 'pip3 install typing-extensions'
- exec: 'pip3 install lxml_html_clean'
DDEV New (1.23)
webimage_extra_packages: [python3, python-is-python3, python3-pip, python3-typing-extensions]
hooks:
post-start:
- exec: 'pip3 install newspaper4k --break-system-packages'
- exec: 'pip3 install lxml_html_clean --break-system-packages'
More info
For more info on how Newspaper3k parses articles please see https://newspaper.readthedocs.io/en/latest/user_guide/quickstart.html#pa...