Drupal is a registered trademark of Dries Buytaert
drupal 11.3.7 Update released for Drupal core (11.3.7)! drupal 11.2.11 Update released for Drupal core (11.2.11)! drupal 10.6.7 Update released for Drupal core (10.6.7)! drupal 10.5.9 Update released for Drupal core (10.5.9)! cms 2.1.1 Update released for Drupal core (2.1.1)! drupal 11.3.6 Update released for Drupal core (11.3.6)! drupal 10.6.6 Update released for Drupal core (10.6.6)! cms 2.1.0 Update released for Drupal core (2.1.0)! bootstrap 8.x-3.40 Minor update available for theme bootstrap (8.x-3.40). menu_link_attributes 8.x-1.7 Minor update available for module menu_link_attributes (8.x-1.7). eca 3.1.1 Minor update available for module eca (3.1.1). layout_paragraphs 2.1.3 Minor update available for module layout_paragraphs (2.1.3). ai 1.3.3 Minor update available for module ai (1.3.3). ai 1.2.14 Minor update available for module ai (1.2.14). node_revision_delete 2.0.3 Minor update available for module node_revision_delete (2.0.3). moderated_content_bulk_publish 2.0.52 Minor update available for module moderated_content_bulk_publish (2.0.52). klaro 3.0.10 Minor update available for module klaro (3.0.10). klaro 3.0.9 Minor update available for module klaro (3.0.9). layout_paragraphs 2.1.2 Minor update available for module layout_paragraphs (2.1.2). geofield_map 11.1.8 Minor update available for module geofield_map (11.1.8).

The Migrate Source Scraper module is a Drupal module that introduces a new data source for the Migrate ecosystem. This source allows content importing via web scraping using Symfony's BrowserKit. The core of this module is the php_scraper plugin, which facilitates scraping content from specified URLs.

Features

  • links_list: A list of URLs to scrape content from.
  • links_file: The path (relative to the module where the migration is defined) of the file containing URLs to scrape. Each URL should be listed on a separate line. Can be used as an alternative to links_list, but only if links_list is not defined.
  • fields: Defines the fields to scrape and specify the scraping method for each field.
    • For fields, you can define two types of filters: XPath ( xpath) or CSS selector ( selector).
      • selector: to use CSS selector as filter method;
      • xpath: to use XPath as filter method

      Additionally, you can specify the following options for each field:

      • multiple (optional, boolean): If set to true, allows scraping multiple elements matching the selector/XPath. Each element is treated as a separate value. Default: false.
      • key (optional, string): Specifies a unique identifier for each extracted element when multiple: true. This key is used to reference individual elements during migration processing (e.g., in sub_process).
  • Specify a " get" method, which can be either " text" or " outerHtml". "text" retrieves only the text of the DOM element, while "outerHtml" retrieves the entire HTML content inside it. "text" is the default value if not specified.
  • ids: Specify unique identifiers for the scraped content.

Post-Installation

After installation, you can configure the scraping source by defining the necessary options in your migration YAML files.

Example (links_list)

id: wikipedia_south_italy
label: Scraping wikipedia.org about south Italy

source:
  plugin: php_scraper

  links_list:
    - "https://en.wikipedia.org/wiki/Diego_Maradona"
    - "https://en.wikipedia.org/wiki/SSC_Napoli"
    - "https://en.wikipedia.org/wiki/Naples"
    - "https://en.wikipedia.org/wiki/Royal_Palace_of_Caserta"
    - "https://en.wikipedia.org/wiki/Southern_Italy"
    - "https://en.wikipedia.org/wiki/Amalfi_Coast"

  fields:
    title:
      xpath: '//*[@id="firstHeading"]'
      get: text
    body:
      selector: "#bodyContent"
      get: outerHtml
  ids:
    - id

process:
  body/value: body
  body/format:
    plugin: default_value
    default_value: full_html
  title:
    - plugin: callback
      callable: strip_tags
      source: title
    - plugin: default_value
      default_value: "No title"

destination:
  plugin: entity:node
  default_bundle: article 

Example (links_file)

Let us imagine that the module implementing the migration is called " wiki_migration" and that the migration, as specified, is within the folder " wiki_migration/migrations" the path " fixtures/wiki_links.txt" will be recalculated, as the absolute path, from the module folder itself:
[drupal_root]/web/modules/[custom|contrib]/wiki_migration/fixtures/wiki_links.txt.

id: wikipedia_south_italy
label: Scraping wikipedia.org about south Italy

source:
  plugin: php_scraper

  links_file: "fixtures/wiki_links.txt"

  fields:
    title:
      xpath: '//*[@id="firstHeading"]'
      get: text
    body:
      selector: "#bodyContent"
      get: outerHtml
  ids:
    - id

process:
  body/value: body
  body/format:
    plugin: default_value
    default_value: full_html
  title:
    - plugin: callback
      callable: strip_tags
      source: title
    - plugin: default_value
      default_value: "No title"

destination:
  plugin: entity:node
  default_bundle: article 

Example (multiple)

id: wikipedia_south_italy
label: "Scraping using multiple: true"
 
source:
  plugin: php_scraper

  links_file: "fixtures/custom_links.txt"

  fields:
    categories:
      xpath: '//a/@href'
      get: text
      multiple: true
      key: example_id
  ids:
    - id

process:
  upload:
    plugin: sub_process
    source: categories
    process:
      target_id:
        -
          plugin: str_replace
          regex: true
          source: example_id
          search: /[^0-9]/
          replace: $1
        -
          plugin: migration_lookup
          migration: example_taxonomy_categories

migration_dependencies:
  required:
    - example_taxonomy_categories

Activity

Total releases
1
First release
Jun 2025
Latest release
10 months ago
Release cadence
Stability
100% stable

Releases

Version Type Release date
1.0.1 Stable Jun 16, 2025