Scraping HTML5 Sites Using Capybara + PhantomJS

When I have to get data from the web and add structure it generally falls into three categories: structured API data, data from a “static” website, and data from a “dynamic” website.

I define “dynamic” website as page that requires execution of JavaScript to get to the data. In other words whatever I need to scrape off the page has been added to the DOM after the page loaded.

The Challenge of Scraping HTML5 sites

For example, to get the circle count from Google+ you have to load the page using a browser. The browser will send AJAX requests to get the data and the count back to the page. If you open up Chrome’s inspector window and enable “Log XMLHttpRequests” you can see everything it’s calling out.

All that really means is that you can’t get those counts without automating a browser. That’s where Capybara and PhantomJS come into play.

Using PhantomJS as a scraper

PhantomJS is a headless browser. That means it’s a full browser that you can programatically control but it doesn’t show anything on the screen. It’s original purpose was to help programmers automate testing websites. For scraping purposes this is perfect but you have to run PhantomJS for every site you want to scrape.

Scaling PhantomJS with Capybara

To scale up PhantomJS for multiple threads I used Capybara. It’s also an automated testing tool that provides easy to use functions that deal with starting and killing processor threads, navigate pages, and parsing HTML.

With multiple instances of PhantomJS I wrote a simple wrapper API that starts up a thread with PhantomJS running, interacts with a website, grabs the information I need, shuts down the thread, and returns the information. Each one of those instances is managed by a job queue to make it painless to manage lots of stuff running in parallel.

The Source Code

The pattern I used is similar to the HTTParty gem. The idea is to create a class that encapsulates a specific job, in this case its scrape Google+ and return a hash with the results.

First, I use a mix-in module that creates a basic DSL for creating wrapper APIs. This provides two common things that I have to use across any classes I write: Start and stop a thread and get the HTML.

Then finally write an encapsulating class for Google+. Create an instance of the class with the ID Google+ number passed in. Wait a few seconds for the page to pull in relevant data then parse the HTML with Nokogiri. From there, we can look for the XPath and get the circle counts.

(Note: This is not my production code.)

Now that we have the code getting circles from Google+ is as simple as calling one line of code:

Performance Considerations When Scaling

I only use this method of scraping only when it’s necessary. I will always prefer an API or regular scraping if those options are available. The performance is much better. If you’re not careful, scaling this can become a big memory hog. But in some cases PhantomJS is the only way to get something done.

It’s a big hammer. Use it sparingly.

  • Robin

    Hi Chris, what is driver suppose to be in line 16 of capybara_with_phantom_js.rb?

    Also, how did you setup capybara? Im using your example, and I am getting errors saying that a rack-test requires a rack-application. thanks!

    • http://twitter.com/iamchrisle Chris Le

      Hi Robin! Thanks for noticing that. I just fixed it in the above code. The driver should simply be :poltergeist.

      In my production code it was “def new_session(driver = nil)” because I would flip the driver back and forth between “:selenium” and “:poltergeist”. Basically, in development, I would use :selenium to see what was happening. Then in production it would use “:poltergeist” to be headless.

      Thanks for pointing that out. Hope it helps.

    • http://twitter.com/iamchrisle Chris Le

      Oh, and to setup capybara and poltergeist, add the following to your Gemfile:

      gem ‘capybara’
      gem ‘poltergeist’

  • satb

    Thanks for the writeup
    def html
    session.html
    end

    should be @session.html?

    Also, I am getting an empty response back. I posted on stackoverflow. Any idea?

    http://stackoverflow.com/questions/15733827/capybara-poltergeist-and-phantomjs-and-giving-an-empty-response