Product Description
The Internet, with its profusion of information, has made us hungry for ever more, ever better data. Out of necessity, many of us have become pretty adept with search engine queries, but there are times when even the most powerful search engines aren't enough. If you've ever wanted your data in a different form than it's presented, or wanted to collect data from several sites and see it side-by-side without the constraints of a browser, then Spidering Hacks is for you.
Spidering Hacks takes you to the next level in Internet data retrieval--beyond search engines--by showing you how to create spiders and bots to retrieve information from your favorite sites and data sources. You'll no longer feel constrained by the way host sites think you want to see their data presented--you'll learn how to scrape and repurpose raw data so you can view in a way that's meaningful to you.
Written for developers, researchers, technical assistants, librarians, and power users, Spidering Hacks provides expert tips on spidering and scraping methodologies. You'll begin with a crash course in spidering concepts, tools (Perl, LWP, out-of-the-box utilities), and ethics (how to know when you've gone too far: what's acceptable and unacceptable). Next, you'll collect media files and data from databases. Then you'll learn how to interpret and understand the data, repurpose it for use in other applications, and even build authorized interfaces to integrate the data into your own content. By the time you finish Spidering Hacks, you'll be able to:
- Aggregate and associate data from disparate locations, then store and manipulate the data as you like
- Gain a competitive edge in business by knowing when competitors' products are on sale, and comparing sales ranks and product placement on e-commerce sites
- Integrate third-party data into your own applications or web sites
- Make your own site easier to scrape and more usable to others
- Keep up-to-date with your favorite comics strips, news stories, stock tips, and more without visiting the site every day
About the Author
Kevin Hemenway, coauthor of Mac OS X Hacks, is better known as Morbus Iff, the creator of disobey.com, which bills itself as "content for the discontented." Publisher and developer of more home cooking than you could ever imagine, he'd love to give you a Fry Pan of Intellect upside the head. Politely, of course. And with love.
Tara Calishain is the creator of the site, ResearchBuzz. She is an expert on Internet search engines and how they can be used effectively in business situations.
Excerpted from Spidering Hacks by Kevin Hemenway, Tara Calishain. Copyright © 2003. Reprinted by permission. All rights reserved.
Give a visual indication that a download is progressing smoothly.
With all this downloading, its often helpful to have some visual representation of its progress. In most of the scripts in this book, theres always a bit of visual information being displayed to the screen: that were starting this URL here, processing this data there, and so on. These helpful bits usually come before or after the actual data has been downloaded. But what if we want visual feedback while were in the middle of a large MP3, movie, or database leech?
If youre using a fairly recent vintage of the LWP library, youll be able to interject your own subroutine to run at regular intervals during download. In this hack, well show you four different ways of adding various types of progress bars to your current applications. To get the most from this hack, you should have ready a URL thats roughly 500 KB or larger; itll give you a good chance to see the progress bar in action.
The Code
The first progress bar is the simplest, providing only a visual heartbeat so that you can be sure things are progressing and not just hanging. Save the following code to a file called progress_bar.pl and run it from the command line as perl scriptname URL, where URL is the online location of your appropriately large piece of sample data:
#!/usr/bin/perl -w
#
# Progress Bar: Dots - Simple example of an LWP progress bar.#
# This code is free software; you can redistribute it and/or
# modify it under the same terms as Perl itself.
#
use strict; $|++;
my $VERSION = "1.0";
# make sure we have the modules we need, else die peacefully.
eval("use LWP 5.6.9;"); die "[err] LWP 5.6.9 or greater required.\n" if $@;
# now, check for passed URLs for downloading.
die "[err] No URLs were passed for processing.\n" unless @ARGV;
# our downloaded data.
my $final_data = undef;
# loop through each URL.foreach my $url (@ARGV) {
print "Downloading URL at ", substr($url, 0, 40), "... ";
# create a new useragent and download the actual URL.
# all the data gets thrown into $final_data, which
# the callback subroutine appends to.
my $ua = LWP::UserAgent->new( );
my $response = $ua->get($url, ':content_cb' => \&callback, );
print "\n"; # after the final dot from downloading.
}
# per chunk.
sub callback {
my ($data, $response, $protocol) = @_;
$final_data .= $data;
print ".";
}
None of this code is particularly new, save the addition of our primitive progress bar. We use LWPs standard get method, but add the :content_cb header with a value that is a reference to a subroutine that will be called at regular intervals as our content is downloaded. These intervals can be suggested with an optional :read_size_hint, which is the number of bytes youd like received before theyre passed to the callback.
In this example, weve defined that the data should be sent to a subroutine named callback. Youll notice that the routine receives the actual content, $data, that has been downloaded. Since were overriding LWPs normal $response->content or :content_file features, we now have to take full control of the data. In this hack, we store all our results in $final_data, but we dont actually do anything with them.