WebScrapingTools

Documentation for WebScrapingTools.

This package provides utilities for scraping dynamic web pages. I tried using Webdriver.jl but couldn't figure out how to configure it and get started.

Currently only Firefox with geckodriver is supported.

External Program Management

Dynamic web page scrapeing depends on having a live web browser perform any JavaScript or whatever to interpret the web page to produce the page's DOM tree.

We currently use the combination of firefox and geckodriver to process the web page. FirefoxGeckodriverSession is used to encapsulate the processes for the external firefox and geckodriver commands. The functions startup, isactive, and teardown can be used to managhe these processes.

Fetching a Web Page

The function fetch_page is used to fetch a web page. It returns a parsed HTML DOM tree if successful. Otherwise an error is thrown.

Index

Definitions

WebScrapingTools.NavigateToType
NavigateTo(url)

Webdriver command to navidate to the specified uri. [https://www.w3.org/TR/webdriver2/#dfn-navigate-to](https://www.w3.org/TR/webdriver2/#dfn-navigate-to}.

source
WebScrapingTools.WebdriverSessionType
WebdriverSession

WebdriverSession is the abstract supertype for the types used to manage the external programs that are needed for Webdriver scraping. There would be one subtype per browser.

source
WebScrapingTools.with_webdriver_sessionMethod
with_webdriver_session(::Function, ::WebdriverSession)

Ensure that the WebdriverSession has running processes, then call the function.

The sessions processes are terminated once function returns.

The return value of the function is returned.

source