[lug] ADD-ON to Web crawler advice
gordongoldin at aim.com
gordongoldin at aim.com
Mon May 5 11:25:26 MDT 2008
See question below - can one get only text - to speed up the text-only search?
To get only English - how reliable is the? lang="en" ?
>>>????? <html xmlns="http://www.w3.org/1999/xhtml" lang="en">
-----Original Message-----
From: gordongoldin at aim.com
To: lug at lug.boulder.co.us
Sent: Mon, 5 May 2008 10:18 am
Subject: Web crawler advice
I'm doing a project to analyze text content on the web:
i need to:
start with a list of URLs
for each URL in the URL list
?? fetch the page
?? throw away non-English pages
?? extract the sentence text content, (not hidden text, menus, lists, etc.)
????? write that content to a file
?? extract all the links
????? add just the new links to the URL list (not those already in the list of URLs)
i could just use java, but then i would have to write everything.
beautiful soup (written in python) would probably work well to parse the pages, but i don't see that it can fetch pages.
i can't tell to what extent nutch can parse the pages. i know it can give me the links, but i don't know if it can extract just the text i care about.
Gordon Golding
Plan your next roadtrip with MapQuest.com: America's #1 Mapping Site.
-------------- next part --------------
An HTML attachment was scrubbed...
URL: <http://lists.lug.boulder.co.us/pipermail/lug/attachments/20080505/59db52ee/attachment.html>
More information about the LUG
mailing list