WebScrapingTools

Documentation for WebScrapingTools.

This package provides utilities for scraping dynamic web pages. I tried using Webdriver.jl but couldn't figure out how to configure it and get started.

Currently only Firefox with geckodriver is supported.

External Program Management

Dynamic web page scrapeing depends on having a live web browser perform any JavaScript or whatever to interpret the web page to produce the page's DOM tree.

We currently use the combination of firefox and geckodriver to process the web page. FirefoxGeckodriverSession is used to encapsulate the processes for the external firefox and geckodriver commands. The functions startup, isactive, and teardown can be used to managhe these processes.

Fetching a Web Page

The function fetch_page is used to fetch a web page. It returns a parsed HTML DOM tree if successful. Otherwise an error is thrown.

Index

WebScrapingTools.FirefoxGeckodriverSession
WebScrapingTools.GetCurrentURL
WebScrapingTools.GetPageSource
WebScrapingTools.NavigateTo
WebScrapingTools.NewSession
WebScrapingTools.WebdriverCommand
WebScrapingTools.WebdriverSession
WebScrapingTools.WebdriverStatus
WebScrapingTools.fetch_page
WebScrapingTools.isactive
WebScrapingTools.json_payload
WebScrapingTools.startup
WebScrapingTools.teardown
WebScrapingTools.uri_path
WebScrapingTools.with_webdriver_session

Definitions

WebScrapingTools.FirefoxGeckodriverSession — Type

FirefoxGeckodriverSession

FirefoxGeckodriverSession encapsulates the external processes that are necessary to scrape web pages using Firefox.

WebScrapingTools.GetCurrentURL — Type

GetCurrentURL()

Webdriver command to get the current URL. https://www.w3.org/TR/webdriver2/#dfn-get-current-url.

WebScrapingTools.GetPageSource — Type

GetPageSource()

Webdriver command to get the content of the current web page. https://www.w3.org/TR/webdriver2/#dfn-get-page-source.

WebScrapingTools.NavigateTo — Type

NavigateTo(url)

Webdriver command to navidate to the specified uri. [https://www.w3.org/TR/webdriver2/#dfn-navigate-to](https://www.w3.org/TR/webdriver2/#dfn-navigate-to}.

WebScrapingTools.NewSession — Type

NewSession()

Webdriver command for creating a new session. https://www.w3.org/TR/webdriver2/#dfn-new-sessions.

WebScrapingTools.WebdriverCommand — Type

WebdriverCommand

Abstract superclass for qll supported Webdriver commands.

WebScrapingTools.WebdriverSession — Type

WebdriverSession

WebdriverSession is the abstract supertype for the types used to manage the external programs that are needed for Webdriver scraping. There would be one subtype per browser.

WebScrapingTools.WebdriverStatus — Type

WebdriverStatus()

Webdriver status command. https://www.w3.org/TR/webdriver2/#dfn-status.

WebScrapingTools.fetch_page — Method

fetch_page(uri)

Fetch the dynamic content of the specified web page.

WebScrapingTools.isactive — Function

isactive(::WebdriverSession)

Returns true if the session is ready to serve Webdriver commands.

WebScrapingTools.json_payload — Method

json_payload(cmd::WebdriverCommand)

Returns the payload for the HTTP request that will be sent for `cmd`.

WebScrapingTools.startup — Function

startup(::WebdriverSession)

starts the processes that are required for this type of browser.

WebScrapingTools.teardown — Function

teardown(::WebdriverSession)

terminates the processes that are required for this type of browser.

WebScrapingTools.uri_path — Function

uri_path(cmd::WebdriverCommand, ::WebdriverSession)

Returns the URI path for the request represented by cmd.

WebScrapingTools.with_webdriver_session — Method

with_webdriver_session(::Function, ::WebdriverSession)

Ensure that the WebdriverSession has running processes, then call the function.

The sessions processes are terminated once function returns.

The return value of the function is returned.