When I have to get data from the web and add structure it generally falls into three categories: structured API data, data from a “static” website, and data from a “dynamic” website.
The Challenge of Scraping HTML5 sites
For example, to get the circle count from Google+ you have to load the page using a browser. The browser will send AJAX requests to get the data and the count back to the page. If you open up Chrome’s inspector window and enable “Log XMLHttpRequests” you can see everything it’s calling out.
Using PhantomJS as a scraper
PhantomJS is a headless browser. That means it’s a full browser that you can programatically control but it doesn’t show anything on the screen. It’s original purpose was to help programmers automate testing websites. For scraping purposes this is perfect but you have to run PhantomJS for every site you want to scrape.
Scaling PhantomJS with Capybara
To scale up PhantomJS for multiple threads I used Capybara. It’s also an automated testing tool that provides easy to use functions that deal with starting and killing processor threads, navigate pages, and parsing HTML.
With multiple instances of PhantomJS I wrote a simple wrapper API that starts up a thread with PhantomJS running, interacts with a website, grabs the information I need, shuts down the thread, and returns the information. Each one of those instances is managed by a job queue to make it painless to manage lots of stuff running in parallel.
The Source Code
The pattern I used is similar to the HTTParty gem. The idea is to create a class that encapsulates a specific job, in this case its scrape Google+ and return a hash with the results.
First, I use a mix-in module that creates a basic DSL for creating wrapper APIs. This provides two common things that I have to use across any classes I write: Start and stop a thread and get the HTML.
Then finally write an encapsulating class for Google+. Create an instance of the class with the ID Google+ number passed in. Wait a few seconds for the page to pull in relevant data then parse the HTML with Nokogiri. From there, we can look for the XPath and get the circle counts.
(Note: This is not my production code.)
Now that we have the code getting circles from Google+ is as simple as calling one line of code:
Performance Considerations When Scaling
I only use this method of scraping only when it’s necessary. I will always prefer an API or regular scraping if those options are available. The performance is much better. If you’re not careful, scaling this can become a big memory hog. But in some cases PhantomJS is the only way to get something done.
It’s a big hammer. Use it sparingly.