WebScrapingTools
Documentation for WebScrapingTools.
This package provides utilities for scraping dynamic web pages. I tried using Webdriver.jl but couldn't figure out how to configure it and get started.
Currently only Firefox with geckodriver is supported.
External Program Management
Dynamic web page scrapeing depends on having a live web browser perform any JavaScript or whatever to interpret the web page to produce the page's DOM tree.
We currently use the combination of firefox and geckodriver to process the web page. FirefoxGeckodriverSession is used to encapsulate the processes for the external firefox and geckodriver commands. The functions startup, isactive, and teardown can be used to managhe these processes.
Fetching a Web Page
The function fetch_page is used to fetch a web page. It returns a parsed HTML DOM tree if successful. Otherwise an error is thrown.
Index
WebScrapingTools.FirefoxGeckodriverSessionWebScrapingTools.GetCurrentURLWebScrapingTools.GetPageSourceWebScrapingTools.NavigateToWebScrapingTools.NewSessionWebScrapingTools.WebdriverCommandWebScrapingTools.WebdriverSessionWebScrapingTools.WebdriverStatusWebScrapingTools.fetch_pageWebScrapingTools.isactiveWebScrapingTools.json_payloadWebScrapingTools.startupWebScrapingTools.teardownWebScrapingTools.uri_pathWebScrapingTools.with_webdriver_session
Definitions
WebScrapingTools.FirefoxGeckodriverSession — TypeFirefoxGeckodriverSessionFirefoxGeckodriverSession encapsulates the external processes that are necessary to scrape web pages using Firefox.
WebScrapingTools.GetCurrentURL — TypeGetCurrentURL()Webdriver command to get the current URL. https://www.w3.org/TR/webdriver2/#dfn-get-current-url.
WebScrapingTools.GetPageSource — TypeGetPageSource()Webdriver command to get the content of the current web page. https://www.w3.org/TR/webdriver2/#dfn-get-page-source.
WebScrapingTools.NavigateTo — TypeNavigateTo(url)Webdriver command to navidate to the specified uri. [https://www.w3.org/TR/webdriver2/#dfn-navigate-to](https://www.w3.org/TR/webdriver2/#dfn-navigate-to}.
WebScrapingTools.NewSession — TypeNewSession()Webdriver command for creating a new session. https://www.w3.org/TR/webdriver2/#dfn-new-sessions.
WebScrapingTools.WebdriverCommand — TypeWebdriverCommandAbstract superclass for qll supported Webdriver commands.
WebScrapingTools.WebdriverSession — TypeWebdriverSessionWebdriverSession is the abstract supertype for the types used to manage the external programs that are needed for Webdriver scraping. There would be one subtype per browser.
WebScrapingTools.WebdriverStatus — TypeWebdriverStatus()Webdriver status command. https://www.w3.org/TR/webdriver2/#dfn-status.
WebScrapingTools.fetch_page — Methodfetch_page(uri)Fetch the dynamic content of the specified web page.
WebScrapingTools.isactive — Functionisactive(::WebdriverSession)Returns true if the session is ready to serve Webdriver commands.
WebScrapingTools.json_payload — Methodjson_payload(cmd::WebdriverCommand)
Returns the payload for the HTTP request that will be sent for `cmd`.WebScrapingTools.startup — Functionstartup(::WebdriverSession)starts the processes that are required for this type of browser.
WebScrapingTools.teardown — Functionteardown(::WebdriverSession)terminates the processes that are required for this type of browser.
WebScrapingTools.uri_path — Functionuri_path(cmd::WebdriverCommand, ::WebdriverSession)Returns the URI path for the request represented by cmd.
WebScrapingTools.with_webdriver_session — Methodwith_webdriver_session(::Function, ::WebdriverSession)Ensure that the WebdriverSession has running processes, then call the function.
The sessions processes are terminated once function returns.
The return value of the function is returned.