migrate_source_scraper

The Migrate Source Scraper module is a Drupal module that introduces a new data source for the Migrate ecosystem. This source allows content importing via web scraping using Symfony's BrowserKit. The core of this module is the php_scraper plugin, which facilitates scraping content from specified URLs.

Features

links_list: A list of URLs to scrape content from.
links_file: The path (relative to the module where the migration is defined) of the file containing URLs to scrape. Each URL should be listed on a separate line. Can be used as an alternative to links_list, but only if links_list is not defined.
fields: Defines the fields to scrape and specify the scraping method for each field.
- For fields, you can define two types of filters: XPath ( xpath) or CSS selector ( selector).
  - selector: to use CSS selector as filter method;
  - xpath: to use XPath as filter method
  Additionally, you can specify the following options for each field:
  - multiple (optional, boolean): If set to true, allows scraping multiple elements matching the selector/XPath. Each element is treated as a separate value. Default: false.
  - key (optional, string): Specifies a unique identifier for each extracted element when multiple: true. This key is used to reference individual elements during migration processing (e.g., in sub_process).
Specify a " get" method, which can be either " text" or " outerHtml". "text" retrieves only the text of the DOM element, while "outerHtml" retrieves the entire HTML content inside it. "text" is the default value if not specified.
ids: Specify unique identifiers for the scraped content.

Post-Installation

After installation, you can configure the scraping source by defining the necessary options in your migration YAML files.

Example (links_list)

id: wikipedia_south_italy
label: Scraping wikipedia.org about south Italy

source:
  plugin: php_scraper

  links_list:
    - "https://en.wikipedia.org/wiki/Diego_Maradona"
    - "https://en.wikipedia.org/wiki/SSC_Napoli"
    - "https://en.wikipedia.org/wiki/Naples"
    - "https://en.wikipedia.org/wiki/Royal_Palace_of_Caserta"
    - "https://en.wikipedia.org/wiki/Southern_Italy"
    - "https://en.wikipedia.org/wiki/Amalfi_Coast"

  fields:
    title:
      xpath: '//*[@id="firstHeading"]'
      get: text
    body:
      selector: "#bodyContent"
      get: outerHtml
  ids:
    - id

process:
  body/value: body
  body/format:
    plugin: default_value
    default_value: full_html
  title:
    - plugin: callback
      callable: strip_tags
      source: title
    - plugin: default_value
      default_value: "No title"

destination:
  plugin: entity:node
  default_bundle: article

Example (links_file)

Let us imagine that the module implementing the migration is called " wiki_migration" and that the migration, as specified, is within the folder " wiki_migration/migrations" the path " fixtures/wiki_links.txt" will be recalculated, as the absolute path, from the module folder itself:
[drupal_root]/web/modules/[custom|contrib]/wiki_migration/fixtures/wiki_links.txt.

id: wikipedia_south_italy
label: Scraping wikipedia.org about south Italy

source:
  plugin: php_scraper

  links_file: "fixtures/wiki_links.txt"

  fields:
    title:
      xpath: '//*[@id="firstHeading"]'
      get: text
    body:
      selector: "#bodyContent"
      get: outerHtml
  ids:
    - id

process:
  body/value: body
  body/format:
    plugin: default_value
    default_value: full_html
  title:
    - plugin: callback
      callable: strip_tags
      source: title
    - plugin: default_value
      default_value: "No title"

destination:
  plugin: entity:node
  default_bundle: article

Example (multiple)

id: wikipedia_south_italy
label: "Scraping using multiple: true"
 
source:
  plugin: php_scraper

  links_file: "fixtures/custom_links.txt"

  fields:
    categories:
      xpath: '//a/@href'
      get: text
      multiple: true
      key: example_id
  ids:
    - id

process:
  upload:
    plugin: sub_process
    source: categories
    process:
      target_id:
        -
          plugin: str_replace
          regex: true
          source: example_id
          search: /[^0-9]/
          replace: $1
        -
          plugin: migration_lookup
          migration: example_taxonomy_categories

migration_dependencies:
  required:
    - example_taxonomy_categories

Features

Post-Installation

Example (links_list)

Example (links_file)

Example (multiple)

Activity

Releases