[lug] Web crawler advice

Jeffrey Haemer jeffrey.haemer at gmail.com
Mon May 5 10:26:11 MDT 2008


Gordon,

I have an ORA book on web spidering that you can probably cannibalize useful
stuff from.  If you're coming to the BLUG talk this Thursday, and want to
borrow it, let me know and I'll bring it.

On Mon, May 5, 2008 at 10:18 AM, <gordongoldin at aim.com> wrote:

>  I'm doing a project to analyze text content on the web:
>
> i need to:
>
> start with a list of URLs
> for each URL in the URL list
>    fetch the page
>    throw away non-English pages
>    extract the sentence text content, (not hidden text, menus, lists,
> etc.)
>       write that content to a file
>    extract all the links
>       add just the new links to the URL list (not those already in the
> list of URLs)
>
> i could just use java, but then i would have to write everything.
> beautiful soup (written in python) would probably work well to parse the
> pages, but i don't see that it can fetch pages.
> i can't tell to what extent nutch can parse the pages. i know it can give
> me the links, but i don't know if it can extract just the text i care about.
>
>
>
> Gordon Golding
>
>  ------------------------------
> Plan your next roadtrip with MapQuest.com<http://www.mapquest.com/?ncid=mpqmap00030000000004>:
> America's #1 Mapping Site.
>
> _______________________________________________
> Web Page:  http://lug.boulder.co.us
> Mailing List: http://lists.lug.boulder.co.us/mailman/listinfo/lug
> Join us on IRC: lug.boulder.co.us port=6667 channel=#colug
>



-- 
Jeffrey Haemer <jeffrey.haemer at gmail.com>
720-837-8908 [cell]
http://goyishekop.blogspot.com
-------------- next part --------------
An HTML attachment was scrubbed...
URL: <http://lists.lug.boulder.co.us/pipermail/lug/attachments/20080505/d7b1e85c/attachment.html>


More information about the LUG mailing list