[lug] Web crawler advice
George Sexton
gsexton at mhsoftware.com
Mon May 5 10:38:58 MDT 2008
gordongoldin at aim.com wrote:
> I'm doing a project to analyze text content on the web:
>
> i need to:
>
> start with a list of URLs
> for each URL in the URL list
> fetch the page
> throw away non-English pages
> extract the sentence text content, (not hidden text, menus, lists, etc.)
> write that content to a file
> extract all the links
> add just the new links to the URL list (not those already in the
> list of URLs)
>
> i could just use java, but then i would have to write everything.
OTOH, threading in Java is dead easy, and this kind of app would benefit
from multi-threading.
> beautiful soup (written in python) would probably work well to parse the
> pages, but i don't see that it can fetch pages.
> i can't tell to what extent nutch can parse the pages. i know it can
> give me the links, but i don't know if it can extract just the text i
> care about.
>
>
>
> Gordon Golding
>
> ------------------------------------------------------------------------
> Plan your next roadtrip with MapQuest.com
> <http://www.mapquest.com/?ncid=mpqmap00030000000004>: America's #1
> Mapping Site.
>
>
> ------------------------------------------------------------------------
>
> _______________________________________________
> Web Page: http://lug.boulder.co.us
> Mailing List: http://lists.lug.boulder.co.us/mailman/listinfo/lug
> Join us on IRC: lug.boulder.co.us port=6667 channel=#colug
--
George Sexton
MH Software, Inc.
Voice: +1 303 438 9585
URL: http://www.mhsoftware.com/
More information about the LUG
mailing list