feeds_paginated_fetcher
No security coverage
Drupal 10/11 module that extends the Feeds module with a Paginated HTTP Fetcher — a fetcher plugin that automatically walks through every page of a paginated API endpoint and delivers the combined result to the standard Feeds parse/process pipeline.
Standard Feeds HTTP fetcher retrieves a single URL and hands the raw response to the parser. This module replaces that single-request fetch with a loop that:
- Fetches page 1 of the API.
- Extracts the items array from the response (configurable).
- Determines whether there is a next page (using one of four strategies — see below).
- Repeats until there are no more pages, or a configured page limit is reached.
- Merges all collected items into a single JSON array.
- Returns that merged array to the Feeds pipeline as a RawFetcherResult.
- The parser and processor configured on the feed type receive the merged data exactly as if the entire dataset had come from one page
Features
Pagination Strategies
- Page number — increments a query parameter (e.g. ?page=1&per_page=100); configurable parameter name and starting value (0 or 1)
- Offset — increments an offset parameter (e.g. ?offset=0&limit=100); configurable offset and limit parameter names
- Link header (RFC 5988) — follows rel="next" from HTTP Link response headers
- JSON next link — reads the next-page URL from a dot-notation path inside the JSON response body (e.g. links.next, pagination.next_url)
Item Extraction
- Items key — dot-notation path to extract the items array from a response wrapper (e.g. data, results.items)
- Root-level JSON array support — when items key is empty, uses the response array directly
- Single-object wrapping — a bare JSON object response is automatically wrapped in an array
Batch Mode
- Pages per batch — spreads fetching across multiple cron runs; persists resumption state between runs
- Resumes correctly from the exact page, base URL, and current URL on the next cron run
- Signals Feeds with setCompleted() when all pages are done
Memory & Execution Time Protection
- Streaming temp file accumulation — writes items to a PHP tmpfile() page-by-page instead of accumulating in a PHP array, reducing peak memory usage
- Memory threshold — stops the current batch early and saves state if PHP memory usage exceeds a configurable % of memory_limit (default 80%); skipped when memory_limit = -1
- Execution time threshold — stops early if elapsed time exceeds a configurable % of max_execution_time (default 80%); skipped when max_execution_time = 0
- Both thresholds log a warning and persist state so the next cron run resumes without data loss
Resilience & Retry
- Retry on transient failures — configurable retry count (default 3) with exponential backoff
- Retries on: connection errors (ConnectException), HTTP 5xx (ServerException), HTTP 429 rate limit (ClientException 429)
- Retry-After header support — respects the server-supplied delay on 429 responses
- Non-retryable errors (4xx other than 429, invalid JSON) fail immediately
- Each retry attempt is always logged as a warning regardless of verbose logging setting
Request Configuration
- Request timeout — per-request read timeout in seconds (default 30)
- Connection timeout — separate Guzzle connect timeout (default 10), independent of the read timeout
- Extra query parameters — static URL-encoded params appended to every request (e.g. api_key=abc&format=json)
- Custom request headers — one Name: Value per line (e.g. Authorization: Bearer token)
Safety & Security
- Maximum pages — hard cap on total pages fetched per import run (0 = unlimited)
- SSRF protection — server-supplied next-page URLs (Link headers, JSON next link) are validated; only http:// and https:// schemes accepted
- HTTP header injection prevention — CR/LF characters stripped from all custom header names and values
Logging
- Verbose import logging — per-feed toggle; logs page URLs, item counts, pagination decisions, batch state changes, and resource threshold triggers
- Always-on error logging — HTTP failures, invalid JSON, non-array responses, and retry attempts are always written to the feeds_paginated_fetcher watchdog channel regardless of the verbose flag
UI & Configuration
- Per-feed configuration form with conditional field visibility (strategy-specific fields shown/hidden via #states)
- Settings placed in a Pagination settings vertical tab alongside Feeds' built-in tabs
- Form validation for URL scheme, extra query params format, per-page minimum, next-link path requirement, and percentage field ranges
- All settings have sensible defaults; existing feeds without new config keys automatically receive defaults
Compatibility
- Drupal 10 and 11
- Requires the Feeds 3.x contrib module
- PHP 8.3+
Post-Installation
Follow README.md for all post installation configuration with example