Object Oriented Scraper Backed With Tests Pt...9?

18th June 2012 — 5 minute read

It looks like you've found one of my older posts 😅 It's possible that some of the information may be out of date (or just plain wrong!) If you've still found the post helpful, but feel like it could use some improvement, let me know on Twitter.

I just spent a few hours talking to my friend Derek(@derekebradley) about Ruby and it occured to me that I never finished this scraper project. We got awfully far with it but then it kind of died on the vine. Thankfully,I wrote it all down.down.down.

The fact of the matter is I didn't know where to take the data. I didn't have a design or a layout that I could put it all into. I want to change all that. I want to turn this into something useful. But first I have to make sense of all the code that was written so many weeks ago.

permalinkTests as documentation...bullshit.

Ok ok. I should say it's total bullshit to call your tests the documentation because they are helpful. But the fact of the matter is you can get so crafty with RSpec that it makes the tests difficult to read in a useful way. I'm not saying they're illegible, it's just that they leverage features which adds to their thought deficit. Before you go off saying that I wrote them wrong and tests should be all the documentation you need...shutup. They're helpful but I would love it if I had written a bit of Markdown Readme to go with all this...

permalinkExplain yourself

Let's see if I can regurgitate what this thing currently does in plain English.

There's a config.yml file. It says what page to scrape, what the CSS selector for a post looks like and what the CSS selector for metadata looks like. The metadata is the list at the bottom of every page listing the time, amount of sleep, coffee, etc.
There's a command line object, tentacles. It initiates runner.rb. Runner creates an instance of Options. Options loads the config.yml file and parses it, turning its properties into members of the options object.
It actually doesn't do anything else beyond that. runner.rb stops right there but we have Rspec tests which fake data and check to see if our other classes work. Those other classes are...
crawler.rb should be the real meat of our program. Funny, seeing as how I wrote all this, that I totally can't remember who does what...
crawler.rb has two primary methods: words_by_selector and metadata_by_selector.
words_by_selector returns an array of words and the number of times they've occurred. This array should be in order from most used to least used.
metadata_by_selector returns the content of one of our metadata lists.... I think.

permalinkMake it work

With Tim Gunn's mantra we're gonna make this thing work. The tests verify that everything should be at least somewhat functioning. Since I'm a little drunk I can't do a super deep dive but let's see if we can get our runner to write out the contents of words_by_selector to a text file.

require 'yaml'
require_relative 'options'
require_relative 'crawler'

module Tentacles
  class Runner

    def initialize(config)
      @options = Tentacles::Options.new(config)
    end

    def run
      @crawler = Tentacles::Crawler.from_uri(@options.uri)
      output = @crawler.words_by_selector(@options.post_selector, 'ul:last-child')
      File.open("output.txt", "w") do |file|
        output.each do |line|
          file.puts line
        end
      end
    end
  end
end

To get this working I cd into the lib/ folder where all the code lives and do an irb -I . so I can require the local files.

require 'runner'
runner = Tentacles::Runner.new('config.yml')
runner.run

After doing that we do get a text file, with copy that looks somewhat correct...

we: 8
to: 8
npm: 6
should: 5
package: 4
our: 4
compliment: 4
git: 3
0: 3
4: 3
need: 3
2: 3
it: 3
node_modules: 3
the: 3
have: 3
be: 3
json: 2
your: 2
any: 2
dependencies: 2
module: 2
and: 2
node: 2
add: 2
xml2json: 2
how: 2
s: 2
in: 2
you: 2
json1: 2
an: 2
3: 2
awesome: 2
version: 2

It looks like the copy from my most recent blog post, plus or minus a few words. Horrible regex aside it kinda works and that's what we're after. Maybe tomorrow we can turn it into some JSON :D Till then. - Rob

You should follow me on Twitter here.

Mood: Drunk, Sleepy
Sleep: 3
Hunger: 4
Coffee: 1

permalinkTests as documentation...bullshit.

permalinkExplain yourself

permalinkMake it work

permalinkFiled under: