or
Sign in to turn on 1-Click ordering.
or
Amazon Prime free trial required. Sign up when you check out. Learn more
More Buying Choices
Have one to sell? Sell yours here
or
Get a £0.25 Amazon.co.uk Gift Card
Spidering Hacks: 100 Industrial-Strength Tips & Tools
 
 
Tell the Publisher!
I’d like to read this book on Kindle

Don't have a Kindle? Get your Kindle here, or download a FREE Kindle Reading App.

Spidering Hacks: 100 Industrial-Strength Tips & Tools [Paperback]

Kevin Hemenway , Tara Calishain
2.0 out of 5 stars  See all reviews (1 customer review)
RRP: £22.99
Price: £19.54 & this item Delivered FREE in the UK with Super Saver Delivery. See details and conditions
You Save: £3.45 (15%)
o o o o o o o o o o o o o o o o o o o o o o o o o o o o o o o o o o o o o o o o o o o o o o o o o o o o o o o o o o o o o o o o
In stock.
Dispatched from and sold by Amazon.co.uk. Gift-wrap available.
Only 1 left in stock--order soon (more on the way).
Want guaranteed delivery by Thursday, June 7? Choose Express delivery at checkout. See Details
‹  Return to Product Overview

Product Description

Product Description

The Internet, with its profusion of information, has made us hungry for ever more, ever better data. Out of necessity, many of us have become pretty adept with search engine queries, but there are times when even the most powerful search engines aren't enough. If you've ever wanted your data in a different form than it's presented, or wanted to collect data from several sites and see it side-by-side without the constraints of a browser, then Spidering Hacks is for you.

Spidering Hacks takes you to the next level in Internet data retrieval--beyond search engines--by showing you how to create spiders and bots to retrieve information from your favorite sites and data sources. You'll no longer feel constrained by the way host sites think you want to see their data presented--you'll learn how to scrape and repurpose raw data so you can view in a way that's meaningful to you.

Written for developers, researchers, technical assistants, librarians, and power users, Spidering Hacks provides expert tips on spidering and scraping methodologies. You'll begin with a crash course in spidering concepts, tools (Perl, LWP, out-of-the-box utilities), and ethics (how to know when you've gone too far: what's acceptable and unacceptable). Next, you'll collect media files and data from databases. Then you'll learn how to interpret and understand the data, repurpose it for use in other applications, and even build authorized interfaces to integrate the data into your own content. By the time you finish Spidering Hacks, you'll be able to:

  • Aggregate and associate data from disparate locations, then store and manipulate the data as you like
  • Gain a competitive edge in business by knowing when competitors' products are on sale, and comparing sales ranks and product placement on e-commerce sites
  • Integrate third-party data into your own applications or web sites
  • Make your own site easier to scrape and more usable to others
  • Keep up-to-date with your favorite comics strips, news stories, stock tips, and more without visiting the site every day
Like the other books in O'Reilly's popular Hacks series, Spidering Hacks brings you 100 industrial-strength tips and tools from the experts to help you master this technology. If you're interested in data retrieval of any type, this book provides a wealth of data for finding a wealth of data.

About the Author

Kevin Hemenway, coauthor of Mac OS X Hacks, is better known as Morbus Iff, the creator of disobey.com, which bills itself as "content for the discontented." Publisher and developer of more home cooking than you could ever imagine, he'd love to give you a Fry Pan of Intellect upside the head. Politely, of course. And with love.

Tara Calishain is the creator of the site, ResearchBuzz. She is an expert on Internet search engines and how they can be used effectively in business situations.

Excerpted from Spidering Hacks by Kevin Hemenway, Tara Calishain. Copyright © 2003. Reprinted by permission. All rights reserved.

Hack #18 Adding Progress Bars to Your Scripts
Give a visual indication that a download is progressing smoothly.

With all this downloading, it’s often helpful to have some visual representation of its progress. In most of the scripts in this book, there’s always a bit of visual information being displayed to the screen: that we’re starting this URL here, processing this data there, and so on. These helpful bits usually come before or after the actual data has been downloaded. But what if we want visual feedback while we’re in the middle of a large MP3, movie, or database leech?

If you’re using a fairly recent vintage of the LWP library, you’ll be able to interject your own subroutine to run at regular intervals during download. In this hack, we’ll show you four different ways of adding various types of progress bars to your current applications. To get the most from this hack, you should have ready a URL that’s roughly 500 KB or larger; it’ll give you a good chance to see the progress bar in action.

The Code
The first progress bar is the simplest, providing only a visual heartbeat so that you can be sure things are progressing and not just hanging. Save the following code to a file called progress_bar.pl and run it from the command line as perl scriptname URL, where URL is the online location of your appropriately large piece of sample data:

#!/usr/bin/perl -w
#
# Progress Bar: Dots - Simple example of an LWP progress bar.#
# This code is free software; you can redistribute it and/or
# modify it under the same terms as Perl itself.
#

use strict; $|++;
my $VERSION = "1.0";

# make sure we have the modules we need, else die peacefully.
eval("use LWP 5.6.9;"); die "[err] LWP 5.6.9 or greater required.\n" if $@;

# now, check for passed URLs for downloading.
die "[err] No URLs were passed for processing.\n" unless @ARGV;

# our downloaded data.
my $final_data = undef;

# loop through each URL.foreach my $url (@ARGV) {
print "Downloading URL at ", substr($url, 0, 40), "... ";

# create a new useragent and download the actual URL.
# all the data gets thrown into $final_data, which
# the callback subroutine appends to.
my $ua = LWP::UserAgent->new( );
my $response = $ua->get($url, ':content_cb' => \&callback, );
print "\n"; # after the final dot from downloading.
}

# per chunk.
sub callback {
my ($data, $response, $protocol) = @_;
$final_data .= $data;
print ".";
}

None of this code is particularly new, save the addition of our primitive progress bar. We use LWP’s standard get method, but add the :content_cb header with a value that is a reference to a subroutine that will be called at regular intervals as our content is downloaded. These intervals can be suggested with an optional :read_size_hint, which is the number of bytes you’d like received before they’re passed to the callback.

In this example, we’ve defined that the data should be sent to a subroutine named callback. You’ll notice that the routine receives the actual content, $data, that has been downloaded. Since we’re overriding LWP’s normal $response->content or :content_file features, we now have to take full control of the data. In this hack, we store all our results in $final_data, but we don’t actually do anything with them.

‹  Return to Product Overview

Amazon.co.uk Privacy Statement Amazon.co.uk Delivery Information Amazon.co.uk Returns & Exchanges