Rob Dodson - Home

Object Oriented Scraper Backed with Tests Pt. 4

— 11 minute read
It looks like you've found one of my older posts 😅 It's possible that some of the information may be out of date (or just plain wrong!) If you've still found the post helpful, but feel like it could use some improvement, let me know on Twitter.

Continuing from our previous post we're going to keep working on our Crawler and our specs to see if we can start pulling real data from our site.

The first thing I did this morning was to run my tests:

bundle exec rspec spec/

..............

Finished in 0.01271 seconds
14 examples, 0 failures

As someone totally new to TDD/BDD this is kind of an awesome feeling. I left my code for a few days and now I can come back and verify that everything still works. We can take it even further and run rspec with a documentation formatter to get some pretty printed output:

    bundle exec rspec spec/ -cf d

Tentacles::Crawler
constructors
#from_uri
should respond
should return an instance
instances
should respond to #get_words_by_selector
should respond to #get_metadata_by_selector

Tentacles::Options
should respond to #uri
should respond to #post_selector
should respond to #metadata_selector
#initialize
when parsing the URI
when URI is valid
should display the right URI
when URI is invalid
should raise an exception
when URI does not contain a scheme
should raise an IO exception
when URI does not contain a host
should raise an IO exception

Tentacles::Runner
should respond to #run
when parsing the config file
should raise an error if the config file is missing
should raise an error if the config file is invalid

Finished in 0.01359 seconds
14 examples, 0 failures

In rspec the -c flag enables color in the output. The -f flag sets a formatter and d specifies the documentation format.

-f, --format FORMATTER           Choose a formatter.
[p]rogress (default - dots)
[d]ocumentation (group and example names)
[h]tml
[t]extmate
custom formatter class name

Neat.

In crawler_spec.rb I'm going to add a test that checks to see if our instance has actually stored the content from our mocked web request.

require_relative '../lib/tentacles/crawler'
require 'fakeweb'

describe Tentacles::Crawler do

before do
# Create a mock options object
@options = {
uri: 'http://robdodson.me',
post_selector: '.entry-content',
metadata_selector: '.personal-metadata'
}

# Create a mock web request
FakeWeb.register_uri(:get, @options[:uri], :body => "Hello World! Hello San Francisco!")
end

describe "constructors" do
describe "#from_uri" do
it "should respond" do
Tentacles::Crawler.should respond_to(:from_uri)
end

it "should return an instance" do
crawler = Tentacles::Crawler.from_uri(@options[:uri])
crawler.should be_an_instance_of(Tentacles::Crawler)
end
end
end

describe "instances" do
before do
@crawler = Tentacles::Crawler.from_uri(@options[:uri])
end

subject { @crawler }

it { should respond_to(:get_words_by_selector) }
it { should respond_to(:get_metadata_by_selector) }

context "post-construct" do
it "should have the right document" do
@crawler.doc.content.should =~ /Hello World! Hello San Francisco!/
end
end
end
end

I want to write a test to parse the content for keywords but I realize now that our FakeWeb request returns a string without any classes or id's. Gotta go back and wrap it in some HTML to match our selectors. So I'm changing the mock web request to look like this:

# Create a mock web request
FakeWeb.register_uri(:get, @options[:uri],
:body => '<div class="' + @options[:post_selector] + '">Hello World! Hello San Francisco!</div>')

Hello Hello Hello World!

After a lot of back and forth I finally get my test to pass. I realize along the way that there are a bunch of things I need to change. For starters having most of my words be the same count doesn't really help me to validate that my keyword counting is working all that well. So I'm changing our FakeWeb request and the subsequent specs which test against it.

# Create a mock web request
FakeWeb.register_uri(:get, @options[:uri],
:body => '<div class="' + @options[:post_selector].delete(".") + '">Hello Hello Hello World World Foobar!</div>')


context "post-construct" do
it "should have the right document" do
@crawler.doc.content.should =~ /Hello Hello Hello World World Foobar!/
end
end

Next I need to make sure that my get_words_by_selector method is accepting a selector.

def get_words_by_selector(selector)
entries = doc.css('div.entry-content')
entries.each do |entry|
words = words_from_string(entry.content)
count_frequency(words)
end

sorted = @counts.sort_by { |word, count| count }
sorted.reverse!
sorted.map { |word, count| "#{word}: #{count}"}
end

I also realize that I'd like my Array of keywords to be in desceding order so I reverse it after the initial sort.

Next I'm going to write the test to verify that we've received a group of words, counted them up and tossed them into an Array in descending order:

describe "#get_words_by_selector" do
it "should produce an Array of keywords" do
expected_array = ['hello: 3', 'world: 2', 'foobar: 1']
actual_array = @crawler.get_words_by_selector(@options[:post_selector])
actual_array.should eq(expected_array)
end
end

I actually wrote the test first and did everything else to make it pass. But at this point it should all be passing and we can verify that given a request with the appropriate selector we should be able to build a basic word frequency list. Yay!

You should follow me on Twitter here.

  • Mood: Calm, Awake, Curious
  • Sleep: 7
  • Hunger: 4
  • Coffee: 0

Filed under: