Drupal is a registered trademark of Dries Buytaert
drupal 11.3.7 Update released for Drupal core (11.3.7)! drupal 11.2.11 Update released for Drupal core (11.2.11)! drupal 10.6.7 Update released for Drupal core (10.6.7)! drupal 10.5.9 Update released for Drupal core (10.5.9)! cms 2.1.1 Update released for Drupal core (2.1.1)! drupal 11.3.6 Update released for Drupal core (11.3.6)! drupal 10.6.6 Update released for Drupal core (10.6.6)! cms 2.1.0 Update released for Drupal core (2.1.0)! bootstrap 8.x-3.40 Minor update available for theme bootstrap (8.x-3.40). menu_link_attributes 8.x-1.7 Minor update available for module menu_link_attributes (8.x-1.7). eca 3.1.1 Minor update available for module eca (3.1.1). layout_paragraphs 2.1.3 Minor update available for module layout_paragraphs (2.1.3). ai 1.3.3 Minor update available for module ai (1.3.3). ai 1.2.14 Minor update available for module ai (1.2.14). node_revision_delete 2.0.3 Minor update available for module node_revision_delete (2.0.3). moderated_content_bulk_publish 2.0.52 Minor update available for module moderated_content_bulk_publish (2.0.52). klaro 3.0.10 Minor update available for module klaro (3.0.10). klaro 3.0.9 Minor update available for module klaro (3.0.9). layout_paragraphs 2.1.2 Minor update available for module layout_paragraphs (2.1.2). geofield_map 11.1.8 Minor update available for module geofield_map (11.1.8).

This module provides a Migrate process plugin to enable you to request and extract data from the python based Newspaper web scraper with a Playwright wrapper (python).

Features

  • Ability to migrate content using Newspaper3/4k with a Playwright wrapper.
  • Define custom path to `ArticleScraping.py` script by passing `cwd' (current working directory)
  • Support for utilising your own `ArticleScraping.py` script per migration.

Prerequisite

Web server capable of running a python3 script.

See newspaper-playwright-wrapper for installation and setup instructions.

If running on older distro's, it may not be possible to run the latest version of Playwright. Here there is a dependency on installing the latest web browser which has a dependency on Glibc is 2.27 or later. e.g. The latest version that I found to be compatible with CentOS 7 (pre glibc 2.27) is Playwright 1.30.

pip3 install playwright==1.30.0

Note that earlier versions of Playwright are not as fully featured as late
versions. i.e. less commands.

Example Usage

process:
  _processed_newspaper_playwright:
    -
      source: link
      plugin: migrate_process_newspaper_playwright
      debug: true // default false
      cwd: '../python' // default null
  _processed_title:
    - 
      plugin: get
      source: '@_processed_newspaper_playwright'
    -
      plugin: extract
      default: "title"
      index:
        - _title
  _processed_description:
    - 
      plugin: get
      source: '@_processed_newspaper_playwright'
    -
      plugin: extract
      default: "description"
      index:
        - _text

In this example the cwd (current working directory) is defined relative to the docroot (e.g. /var/www/html/web).

Schema Keys

_summary (string)
_source_url (string)
_url (string)
_title (string)
_top_img (string)
_top_image (string)
_meta_img (string)
_imgs (array)
_images (array)
_movies (array)
_text (string)
_keywords (array)
_meta_keywords (array)
_tags (array)
_authors (array)
_publish_date (string)
_summary (string)
_html (string)
_meta_data (array object)
_meta_description (string)
_article_html (string)
_top_node (string)
_doc (string)

More Info

For introduction on Playwright see: https://playwright.dev/python/docs/intro

For more info on how Newspaper3k parses articles please see: https://newspaper.readthedocs.io/en/latest/user_guide/quickstart.html#pa...

Activity

Total releases
1
First release
Mar 2025
Latest release
1 year ago
Release cadence
Stability
100% stable

Releases

Version Type Release date
1.0.0 Stable Mar 20, 2025