NOTE NOTE NOTE: This page has been moved to the Scrubyt Wiki. Edit that page if available; edits here may be lost.

This document tries to document the entire API provided by Scrubyt 0.3.

Navigation

Before you extract data, you need to find it. You always start with a call to fetch(), then you can optionally navigate from page to page until you find the data that needs to be extracted. Here's a simple example:

require 'rubygems'
require 'scrubyt'

extractor = Scrubyt::Extractor.define do
 fetch ‘http://www.google.com/ncr’
 fill_textfield ‘q’, ‘ruby
 submit
end
extractor.to_text.write($stdout, 1)

TODO: why is this script illegal? I just want it to print the final page, but Scrubyt says "[ERROR] No extractor defined, exiting..."

fetch(url)

Fetches the given URL or file path. Accepts any number of options after the url:

  • :proxy => "HOST:PORT" -- Specify the proxy to use in a string of host:port.
  • :mechanize_doc => TODO
  • :resolve => ?? -- :full, :host or a default string that gets prepended to the doc_url. TODO: what do :full and :host mean?
  • :basic_auth => [login, password] -- Specify the login and password for basic HTTP authentication.
  • :user_agent => STRING -- Specify the user agent to use here. By default it uses a very generic string like "Mozilla/5.0 (compatible; Konqueror/3.5; Linux) KHTML/3.5.5 (like Gecko) (Kubuntu)"

Examples:

  
  • fetch "http://digg.com/"
  • fetch File.join(File.dirname(__FILE__), "input.html")
  • fetch "test_record.html"
  • fetch("http://...", proxy => "localhost")

click_link(text, index=0)

Clicks the link that contains the given text. Do not supply embedded tags, so if the link is "go here", you would call click_link("go here"). "Books[2]" would click on the second link with Books in the title. TODO: this is a problem because then how do I specify a link that contains braces? How do I turn this feature off? And why not just use the index parameter?

You can also supply a compound example: click_link({:begins_with => 'to', :contains => /\d+/}) would find "to 999 zebra" The compound may contain any number of these keys, :contains, :begins_with and :ends_with, and the values can be both regexps or strings.

click_image_map(index)

Clicks the image map on the page. TODO: there's no way to say where to click it, right?

Forms

The functions in this section allow you to fill out and submit forms as if the user was doing it. Ajaxy forms might require the use of Firewatir (in developemnt) but static forms are trivial.

fill_textfield(name, string)

Fills the given input field with the content given by string.

fill_textarea(name, text)

Fills the named textarea with the supplied text.

select_option(list_name, option)

Selects an item from a drop-down list.

check_checkbox(name)

Checks the named checkbox. TODO: is there any way to uncheck it? Or does it toggle? If it toggles, then how do I set it to a known value?

check_radiobutton(name, index=0)

Selects the indexed radio button in the named group.

submit()

Submit a form.

  • submit() -- submits the form for the last item that was edited
  • submit(form) -- specify the form that should be submitted
  • submit(form,button) -- specify the form and the button that submitted it.
  • submit(form,button,type) -- TODO??

end

Reserved but not used? TODO: it looks like including 'end' will cause a method missing error? Check this.

todo

TODO: how do I handle cookies?

Extraction

In the extractor body, you specify what data you want copied to the results by specifying one or more patterns.

 require 'rubygems'
 require 'scrubyt'
 
 extractor = Scrubyt::Extractor.define do
   fetch "http://lwn.net"      # navigate to the desired page
   headlines "Headlines for"   # select results with a pattern
 end

Root Pattern

Any symbol that is not recognized as a Navigation call (above) is interpreted to be a root pattern. You can only have one root pattern per extraction. In the example above, 'headlines' is the root pattern. Patterns nest so you can have as many as you want in the result set.

 record do
   item_name 'APPLE NEW IPOD MINI 6GB MP3 PLAYER SILVER'
   price '$71.99'
 end

Assuming there's an item with that name on the first page, this will produce:

   <record>
     <item_name>APPLE IPOD NANO 4GB - PINK - MP3 PLAYER</item_name>
     <price>$149.95</price>
   </record>

If no result nodes match, you can specify the content using default: price "$71.99", :default => "$0.00" TODO: is this right??

TODO: what on earth does :generalize => false do?

Shortcut Patterns

This is a short-hand way of specifying patterns with default behavior. For instance, detail_url is equivalent to "detail_url 'href', type => :attribute".

You can always override default values in shortcut patterns.

Right now there's only one shortcut pattern (in other words, match any href attribute)

  • name_url -- name_url 'href', type => :attribute

TODO: is name_detail (type => :detail_page) also a shortcut pattern? What is it? How is it used?

Other

You can use a string that looks sort of like an xpath selector somehow...? How does scrubyt tell the difference between text and paths?

  • user_count "/html/body/table/tr/td/table/tr[1]/td[2]" -- sets 'user_count' to everything matched by this expression

select_indices

Allows you to only store a subset of everything that was matched.

  • article_title("xkcd: Commitment").select_indices([:first, :every_third])

You can pass:

  • A range of indices: 3..5
  • An array of indices: [3,4,7,12]
  • Keywords: :first, :last, :all_but_last, :all_but_first, :every_even, :every_odd, :every_second, :every_third, :every_fourth TODO: why is there no :all?
  • Any combination of the above. :first,:all_but_first would return everything.

Constraints

If your extractor is producing too much data, you could pare the results down with constraints.

 item do
   item_name "Canon Vertical Battery Grip BG-E3 For EOS Digital Rebel XT"
 end.ensure_presence_of_pattern ‘price’
  • ensure_presence_of_pattern(tag) -- the given pattern must be somewhere in the node's ancestry. TODO: what pattern? what's allowed here? How is this different from ensure_presence_of_ancestor_node?
  • ensure_presence_of_attribute() -- ensure_presence_of_attribute("attr", "value"), value is optional. If value is not specified, it matches any value.
  • ensure_absence_of attribute()
  • ensure_presence_of_ancestor_node() -- ensure_presence_of_ancestor_node :span, ‘class‘ => ’searchProductPrice’
  • ensure absence_of_ancestor_node()

Learning

First, you write a quick skeleton script that runs using text that it finds on the page. For instance, this

   article_title("xkcd: Commitment").select_indices([:first,:every_third])

Searches for an element containing that string, then stores all elements similar to that one in the results.

There's no need for this step if you manually locate the information yourself using firebug.

Production

In theory, to turn a learning script into a production script, you just print the results like this:

   extractor.export(__FILE__)

Unfortunately, all I get is a "integer 46950979137200 too big to convert to int" from RubyInlineAcceleration and a big fat stack trace. TODO: what is the difference between a learning script and a production script?

next_page

If you place this call after your root pattern, you can repeat the pattern and collect information from a sequence of pages. As long as the Next button is always named the same, Scrubyt will keep clicking Next and pulling data from the page until TODO: (what? other than :limit, how do I stop this?)

next_page 'Next >', :limit => 5

TODO: This is the only code that uses :limit? It's also the only call that can appear after the root pattern. I guess it just keeps clicking on the link to name until the page won't load or it hits the limit? TODO: Is there any other magic to next_page? I see something in the code about patterns...

TODO: "I collected a bunch of links from a page, and I was wondering if it was possible, in the same extractor, to fetch each of these links and extract info on the resulting page." Is there any way to do this other than next_page?

Results

This is how you print the results of your extraction. TODO: all of these calls accept a pattern. What's it for?

  • to_xml -- returns the results in xml format
  • to_text -- returns the results as plain text
  • to_csv -- returns the results in csv format
  • to_hash -- returns the results as nested Ruby hashes.

Utility Functions

  • print_statistics -- prints some simple statistics like the count of hits by each pattern.
  • remove_empty_leaves -- can remove empty leaves before printing the results.
  extractor.remove_empty_leaves.to_xml    TODO: is this right??

More examples:

  • extractor.to_xml.write($stdout, 1) -- prints the result of the scraping as XML
  • amazon_books.to_xml.write(open(‘result.xml‘, ‘w’), 1) -- print to a file
  • puts amazon_books.book[1].title[0].to_text -- find individual items in the result set (BROKEN IN 3.0?)
  • extractor.to_hash -- books = amazon_books.to_hash; puts book[1][:title]

TODO

Scrubyt automatically recognizes and searches inside frames. Very impressive.

TODO: what does the :write_text option do?

TODO: what is this? "Temp patterns are skipped in the output (their ancestors are appended to the parent of the pattern which was skipped" Something to do with :output_type=>:model (default) vs :output_type=>:temp.

TODO: what the heck is a :detail_page?