[lug] Web crawler advice
Jason Vallery
jason at vallery.net
Mon May 5 10:21:47 MDT 2008
Hi Gordon,
I did something similar for harvesting content from RSS feeds as my
source. For my application I started with the fantastic PHP Sphider
application.
http://www.sphider.eu/
-Jason
On Mon, May 5, 2008 at 10:18 AM, <gordongoldin at aim.com> wrote:
>
> I'm doing a project to analyze text content on the web:
>
> i need to:
>
> start with a list of URLs
> for each URL in the URL list
> fetch the page
> throw away non-English pages
> extract the sentence text content, (not hidden text, menus, lists, etc.)
> write that content to a file
> extract all the links
> add just the new links to the URL list (not those already in the list
> of URLs)
>
> i could just use java, but then i would have to write everything.
> beautiful soup (written in python) would probably work well to parse the
> pages, but i don't see that it can fetch pages.
> i can't tell to what extent nutch can parse the pages. i know it can give
> me the links, but i don't know if it can extract just the text i care about.
>
>
>
>
>
> Gordon Golding
>
>
> ________________________________
> Plan your next roadtrip with MapQuest.com: America's #1 Mapping Site.
> _______________________________________________
> Web Page: http://lug.boulder.co.us
> Mailing List: http://lists.lug.boulder.co.us/mailman/listinfo/lug
> Join us on IRC: lug.boulder.co.us port=6667 channel=#colug
>
--
Jason Vallery
jason at vallery.net
mobile: +1.720.352.8822
home: +1.303.993.3712
web: http://vallery.net/
More information about the LUG
mailing list