Drupal is a registered trademark of Dries Buytaert

This module provides a Migrate process plugin to enable you to request and extract data from the python based Newspaper web scraper with a Playwright wrapper (python).

Features

  • Ability to migrate content using Newspaper3/4k with a Playwright wrapper.
  • Define custom path to `ArticleScraping.py` script by passing `cwd' (current working directory)
  • Support for utilising your own `ArticleScraping.py` script per migration.

Prerequisite

Web server capable of running a python3 script.

See newspaper-playwright-wrapper for installation and setup instructions.

If running on older distro's, it may not be possible to run the latest version of Playwright. Here there is a dependency on installing the latest web browser which has a dependency on Glibc is 2.27 or later. e.g. The latest version that I found to be compatible with CentOS 7 (pre glibc 2.27) is Playwright 1.30.

pip3 install playwright==1.30.0

Note that earlier versions of Playwright are not as fully featured as late
versions. i.e. less commands.

Example Usage

process:
  _processed_newspaper_playwright:
    -
      source: link
      plugin: migrate_process_newspaper_playwright
      debug: true // default false
      cwd: '../python' // default null
  _processed_title:
    - 
      plugin: get
      source: '@_processed_newspaper_playwright'
    -
      plugin: extract
      default: "title"
      index:
        - _title
  _processed_description:
    - 
      plugin: get
      source: '@_processed_newspaper_playwright'
    -
      plugin: extract
      default: "description"
      index:
        - _text

In this example the cwd (current working directory) is defined relative to the docroot (e.g. /var/www/html/web).

Schema Keys

_summary (string)
_source_url (string)
_url (string)
_title (string)
_top_img (string)
_top_image (string)
_meta_img (string)
_imgs (array)
_images (array)
_movies (array)
_text (string)
_keywords (array)
_meta_keywords (array)
_tags (array)
_authors (array)
_publish_date (string)
_summary (string)
_html (string)
_meta_data (array object)
_meta_description (string)
_article_html (string)
_top_node (string)
_doc (string)

More Info

For introduction on Playwright see: https://playwright.dev/python/docs/intro

For more info on how Newspaper3k parses articles please see: https://newspaper.readthedocs.io/en/latest/user_guide/quickstart.html#pa...

Activity

Total releases
1
First release
Mar 2025
Latest release
11 months ago
Release cadence
Stability
100% stable

Releases

Version Type Release date
1.0.0 Stable Mar 20, 2025