Skip to main content

Advanced Scraping

Advanced scraping settings in PullNode enable users to navigate and extract data from complex web pages efficiently. This document covers the use of dynamic parameters, selectors, pagination, and infinite scroll handling in your scraper projects.

Dynamic Parameters

Dynamic parameters allow your scraper to iterate through a series of URLs by changing specified parts of the URL according to a list of values or a range of numbers. This feature is crucial for scraping data across multiple pages or categories efficiently.

Setting Up Dynamic Parameters

  1. Add/Update URL: Navigate to your project dashboard and find the option to add or update a URL.
  2. Identify Dynamic Sections: In the URL, identify the sections that will change across requests.
  3. Replace with Parameters: Replace the dynamic sections with parameters enclosed in curly braces {}.
  4. Define Parameter Values:
    • Range of Numbers: For numerical values that iterate, such as page numbers, use the format [START_NUM,END_NUM].
    • List of Values: For parameters like search queries, separate each value with a semicolon ; or a space.

Example

For a URL template https://example.com/{SEARCH_QUERY}?page={PAGE_NUMBER}, you could define:

  • SEARCH_QUERY with a list cars;planes
  • PAGE_NUMBER with a range [1,10]

This setup will generate URLs covering all combinations of the defined SEARCH_QUERY and PAGE_NUMBER. In this case, the scraper will visit the following URLs:

  • https://example.com/cars?page=1
  • https://example.com/cars?page=2
  • ...
  • https://example.com/planes?page=9
  • https://example.com/planes?page=10 (20 URLs in total)

Selectors

Selectors are used to identify specific elements on a web page from which you want to scrape data. Each selector is associated with a target that determines the type of data to retrieve.

Selector Targets

TargetReturns
TextThe text content of the selected element.
LinkThe URL of the href attribute.
ImageThe URL of the src attribute of an image.
HTMLThe HTML content of the selected element.

Pagination

For websites that split content across multiple pages, you can use the pagination settings to navigate through them automatically.

Settings

  • CSS Selector or XPath: Provide the selector or XPath for the "Next" button or link.
  • Max Pages to Load: Set the maximum number of pages the scraper should navigate through.

Example

  • CSS Selector for Next Page: .next-page
  • Max Pages: 5

This will make the scraper click the "Next" button identified by the .next-page selector, stopping after 5 pages have been loaded.

Scroll

Some websites load more content as the user scrolls down. Handle this by setting the maximum number of times the scraper should simulate scrolling.

Settings

  • Max Times to Scroll: Indicate how many times the scraper should simulate a scroll down action to load new content.

Example

  • Max Times to Scroll: 10 (Set to more than 0 to enable scroll handling)

This setting is essential for websites that do not use traditional pagination but instead load new content dynamically as the user scrolls.

Conclusion

Understanding and utilizing these advanced scraping settings allows for more effective data collection from a wide range of websites. Whether dealing with dynamic URLs, extracting data from specific elements, navigating paginated content, or handling infinite scroll pages, PullNode provides the tools you need to streamline your web scraping projects.