Rob Dodson - Home

Building a Simple Scraper with Nokogiri in Ruby

— 4 minute read
It looks like you've found one of my older posts 😅 It's possible that some of the information may be out of date (or just plain wrong!) If you've still found the post helpful, but feel like it could use some improvement, let me know on Twitter.

Since I've been talking so much about D3.js lately I thought it might be fun to start a little project which combines D3 and Ruby. The idea is to build a very simple page scraper that counts how often certain words are used in each post. I've also decided to start adding a little block of metadata at the end of each post so I can graph that over time as well.

So how do we get started? Well first we'll need to build a page scraper of some kind. This program will have to consume the contents of an HTML page, find the node that contains our blog post and count up how often each word reoccurs. For right now that should be more than enough to get us started. We'll look at grabbing the metadata and drawing graphs in future posts. I should point out that this idea was inspired by the wonderful site 750words.com which creates a beautiful exploration section any time you write a new journal entry. Definitely check out that site, it's amazing.

Hello Noko

I decided early on that I wanted the scraper to use Nokogiri because I've heard so much about it. As the authors describe it:

Nokogiri (鋸) is an HTML, XML, SAX, and Reader parser. Among Nokogiri’s many features is the ability to search documents via XPath or CSS3 selectors.

Using CSS selectors means that working with Nokogiri is a lot like working with jQuery. Here's a quick demonstration:

require 'open-uri'
require 'nokogiri'

doc = Nokogiri::HTML(open('https://www.google.com/search?q=unicorns'))

doc.css('h3.r a').each do |link|
puts link.content
end

Easy enough, right? Taking it a step further let's iterate over each element on the page and place them into a Hash.

require 'open-uri'
require 'nokogiri'

@counts = Hash.new(0)

def words_from_string(string)
string.downcase.scan(/[\w']+/)
end

def count_frequency(word_list)
for word in word_list
@counts[word] += 1
end
@counts
end

doc = Nokogiri::HTML(open('http://robdodson.me'))

####
# Search for nodes by css
entries = doc.css('div.entry-content')
puts "Parsing #{entries.length} entries"
entries.each do |entry|
words = words_from_string(entry.content)
count_frequency(words)
end

sorted = @counts.sort_by { |word, count| count }
puts sorted.map { |word, count| "#{word}: #{count}"}

The output from this script should look (kind of) like this:

...
ruby: 66
rvm: 66
our: 68
can: 71
3: 75
if: 77
for: 82
your: 88
2: 88
is: 91
this: 91
s: 94
we: 95
that: 106
i: 118
in: 119
it: 125
1: 128
and: 149
of: 170
a: 231
you: 233
to: 342
the: 382

It looks like our regex could use a bit of work so it doesn't grab singular letters like 's' or numbers, but it's definitely a good start. Tomorrow we'll put everything into a Module and back it with tests.

You should follow me on Twitter here.

  • Mood: Relaxed, Tired
  • Sleep: 6.5
  • Hunger: 5
  • Coffee: 1

Filed under: