1 of 1 people found the following review helpful
on 11 October 2008
This is a brilliant book.
The book has a few niggles in that I spotted some flaws in the code. The book and code download could do with some JUnit tests!
But I guess the author can be forgiven for this, because of the usefulness, clarity, commenting and detailed coverage of the code that he develops.
When all is said and done this is a recipe book of examples.
There is a complete implementation of an HTML parser. I was a bit surprised to see book wasn't using JTidy or NekoHTML here...
By the time you're done you'll have a great appreciation of HTTP and tools like WebShark to help for create a "bot" [ie bespoke screen scrapers designed to extract data from specific sites], as well as the more generic "spider".
You'll find this book a great resource for learning about concurrent programming in Java as you work through the code that makes up the highly configurable Heaton Research Spider. There are various implementations here. An in memory version for a single host site, and a couple of SQL based ones for MYSQL/Oracle..
The book also shows how to call into the Google search API's to create what it calls a "hybrid bot". You could then use this to setup the seed data for your spider. (Setting up this seed data is where you are left to your own devices and perhaps where book could have been expanded on slightly).
You'll also get a bit of exposure to AXIS web services and RSS feeds along the way too.
I'd thoroughly recommend this to anyone wanting to learn more about harvesting information from the web and expand their knowledge of Java, HTTP, and Multi-Threading/Concurrency.