migrate_process_newspaper_playwright
This module provides a Migrate process plugin to enable you to request and extract data from the python based Newspaper web scraper with a Playwright wrapper (python).
Features
- Ability to migrate content using Newspaper3/4k with a Playwright wrapper.
- Define custom path to `ArticleScraping.py` script by passing `cwd' (current working directory)
- Support for utilising your own `ArticleScraping.py` script per migration.
Prerequisite
Web server capable of running a python3 script.
See newspaper-playwright-wrapper for installation and setup instructions.
If running on older distro's, it may not be possible to run the latest version of Playwright. Here there is a dependency on installing the latest web browser which has a dependency on Glibc is 2.27 or later. e.g. The latest version that I found to be compatible with CentOS 7 (pre glibc 2.27) is Playwright 1.30.
pip3 install playwright==1.30.0
Note that earlier versions of Playwright are not as fully featured as late
versions. i.e. less commands.
Example Usage
process:
_processed_newspaper_playwright:
-
source: link
plugin: migrate_process_newspaper_playwright
debug: true // default false
cwd: '../python' // default null
_processed_title:
-
plugin: get
source: '@_processed_newspaper_playwright'
-
plugin: extract
default: "title"
index:
- _title
_processed_description:
-
plugin: get
source: '@_processed_newspaper_playwright'
-
plugin: extract
default: "description"
index:
- _text
In this example the cwd (current working directory) is defined relative to the docroot (e.g. /var/www/html/web).
Schema Keys
_summary (string)
_source_url (string)
_url (string)
_title (string)
_top_img (string)
_top_image (string)
_meta_img (string)
_imgs (array)
_images (array)
_movies (array)
_text (string)
_keywords (array)
_meta_keywords (array)
_tags (array)
_authors (array)
_publish_date (string)
_summary (string)
_html (string)
_meta_data (array object)
_meta_description (string)
_article_html (string)
_top_node (string)
_doc (string)
More Info
For introduction on Playwright see: https://playwright.dev/python/docs/intro
For more info on how Newspaper3k parses articles please see: https://newspaper.readthedocs.io/en/latest/user_guide/quickstart.html#pa...