migrate_source_scraper
The Migrate Source Scraper module is a Drupal module that introduces a new data source for the Migrate ecosystem. This source allows content importing via web scraping using Symfony's BrowserKit. The core of this module is the php_scraper plugin, which facilitates scraping content from specified URLs.
Features
- links_list: A list of URLs to scrape content from.
- links_file: The path (relative to the module where the migration is defined) of the file containing URLs to scrape. Each URL should be listed on a separate line. Can be used as an alternative to links_list, but only if links_list is not defined.
-
fields: Defines the fields to scrape and specify the scraping method for each field.
- For fields, you can define two types of filters: XPath ( xpath) or CSS selector ( selector).
- selector: to use CSS selector as filter method;
- xpath: to use XPath as filter method
Additionally, you can specify the following options for each field:
- multiple (optional, boolean): If set to true, allows scraping multiple elements matching the selector/XPath. Each element is treated as a separate value. Default: false.
- key (optional, string): Specifies a unique identifier for each extracted element when multiple: true. This key is used to reference individual elements during migration processing (e.g., in sub_process).
- For fields, you can define two types of filters: XPath ( xpath) or CSS selector ( selector).
- Specify a " get" method, which can be either " text" or " outerHtml". "text" retrieves only the text of the DOM element, while "outerHtml" retrieves the entire HTML content inside it. "text" is the default value if not specified.
- ids: Specify unique identifiers for the scraped content.
Post-Installation
After installation, you can configure the scraping source by defining the necessary options in your migration YAML files.
Example (links_list)
id: wikipedia_south_italy
label: Scraping wikipedia.org about south Italy
source:
plugin: php_scraper
links_list:
- "https://en.wikipedia.org/wiki/Diego_Maradona"
- "https://en.wikipedia.org/wiki/SSC_Napoli"
- "https://en.wikipedia.org/wiki/Naples"
- "https://en.wikipedia.org/wiki/Royal_Palace_of_Caserta"
- "https://en.wikipedia.org/wiki/Southern_Italy"
- "https://en.wikipedia.org/wiki/Amalfi_Coast"
fields:
title:
xpath: '//*[@id="firstHeading"]'
get: text
body:
selector: "#bodyContent"
get: outerHtml
ids:
- id
process:
body/value: body
body/format:
plugin: default_value
default_value: full_html
title:
- plugin: callback
callable: strip_tags
source: title
- plugin: default_value
default_value: "No title"
destination:
plugin: entity:node
default_bundle: article Example (links_file)
Let us imagine that the module implementing the migration is called " wiki_migration" and that the migration, as specified, is within the folder " wiki_migration/migrations" the path " fixtures/wiki_links.txt" will be recalculated, as the absolute path, from the module folder itself:
[drupal_root]/web/modules/[custom|contrib]/wiki_migration/fixtures/wiki_links.txt.
id: wikipedia_south_italy
label: Scraping wikipedia.org about south Italy
source:
plugin: php_scraper
links_file: "fixtures/wiki_links.txt"
fields:
title:
xpath: '//*[@id="firstHeading"]'
get: text
body:
selector: "#bodyContent"
get: outerHtml
ids:
- id
process:
body/value: body
body/format:
plugin: default_value
default_value: full_html
title:
- plugin: callback
callable: strip_tags
source: title
- plugin: default_value
default_value: "No title"
destination:
plugin: entity:node
default_bundle: article Example (multiple)
id: wikipedia_south_italy
label: "Scraping using multiple: true"
source:
plugin: php_scraper
links_file: "fixtures/custom_links.txt"
fields:
categories:
xpath: '//a/@href'
get: text
multiple: true
key: example_id
ids:
- id
process:
upload:
plugin: sub_process
source: categories
process:
target_id:
-
plugin: str_replace
regex: true
source: example_id
search: /[^0-9]/
replace: $1
-
plugin: migration_lookup
migration: example_taxonomy_categories
migration_dependencies:
required:
- example_taxonomy_categories