Advanced Scraping

Advanced scraping settings in PullNode enable users to navigate and extract data from complex web pages efficiently. This document covers the use of dynamic parameters, selectors, pagination, and infinite scroll handling in your scraper projects.

Dynamic Parameters

Dynamic parameters allow your scraper to iterate through a series of URLs by changing specified parts of the URL according to a list of values or a range of numbers. This feature is crucial for scraping data across multiple pages or categories efficiently.

Setting Up Dynamic Parameters

Add/Update URL: Navigate to your project dashboard and find the option to add or update a URL.
Identify Dynamic Sections: In the URL, identify the sections that will change across requests.
Replace with Parameters: Replace the dynamic sections with parameters enclosed in curly braces {}.
Define Parameter Values:
- Range of Numbers: For numerical values that iterate, such as page numbers, use the format [START_NUM,END_NUM].
- List of Values: For parameters like search queries, separate each value with a semicolon ; or a space.

Example

For a URL template https://example.com/{SEARCH_QUERY}?page={PAGE_NUMBER}, you could define:

SEARCH_QUERY with a list cars;planes
PAGE_NUMBER with a range [1,10]

This setup will generate URLs covering all combinations of the defined SEARCH_QUERY and PAGE_NUMBER. In this case, the scraper will visit the following URLs:

https://example.com/cars?page=1
https://example.com/cars?page=2
...
https://example.com/planes?page=9
https://example.com/planes?page=10 (20 URLs in total)

Selectors

Selectors are used to identify specific elements on a web page from which you want to scrape data. Each selector is associated with a target that determines the type of data to retrieve.

Selector Targets

Target	Returns
Text	The text content of the selected element.
Link	The URL of the `href` attribute.
Image	The URL of the `src` attribute of an image.
HTML	The HTML content of the selected element.

Pagination

For websites that split content across multiple pages, you can use the pagination settings to navigate through them automatically.

Settings

CSS Selector or XPath: Provide the selector or XPath for the "Next" button or link.
Max Pages to Load: Set the maximum number of pages the scraper should navigate through.

Example

CSS Selector for Next Page: .next-page
Max Pages: 5

This will make the scraper click the "Next" button identified by the .next-page selector, stopping after 5 pages have been loaded.

Scroll

Some websites load more content as the user scrolls down. Handle this by setting the maximum number of times the scraper should simulate scrolling.

Settings

Max Times to Scroll: Indicate how many times the scraper should simulate a scroll down action to load new content.

Example

Max Times to Scroll: 10 (Set to more than 0 to enable scroll handling)

This setting is essential for websites that do not use traditional pagination but instead load new content dynamically as the user scrolls.

Conclusion

Understanding and utilizing these advanced scraping settings allows for more effective data collection from a wide range of websites. Whether dealing with dynamic URLs, extracting data from specific elements, navigating paginated content, or handling infinite scroll pages, PullNode provides the tools you need to streamline your web scraping projects.

Advanced Scraping

Dynamic Parameters​

Setting Up Dynamic Parameters​

Example​

Selectors​

Selector Targets​

Pagination​

Settings​

Example​

Scroll​

Settings​

Example​

Conclusion​

Dynamic Parameters

Setting Up Dynamic Parameters

Example

Selectors

Selector Targets

Pagination

Settings

Example

Scroll

Settings

Example

Conclusion