[lug] ADD-ON to Web crawler advice

George Sexton gsexton at mhsoftware.com
Mon May 5 11:36:58 MDT 2008



gordongoldin at aim.com wrote:
> 
> See question below - can one get only text - to speed up the text-only 
> search?
> To get only English - how reliable is the  lang="en" ?

you could spot check, but I'm guessing that 99% of the pages don't set it.

Charset really won't be helpful. I use UTF-8, so there's no telling from it.

I suppose if it's a non US charset like Windows-1255, or ISO-8859-[<>1]

that might be slightly helpful.

> 
>  >>>      <html xmlns="http://www.w3.org/1999/xhtml" lang="en">
> 
> -----Original Message-----
> From: gordongoldin at aim.com
> To: lug at lug.boulder.co.us
> Sent: Mon, 5 May 2008 10:18 am
> Subject: Web crawler advice
> 
> I'm doing a project to analyze text content on the web:
> 
> i need to:
> 
> start with a list of URLs
> for each URL in the URL list
>    fetch the page
>    throw away non-English pages
>    extract the sentence text content, (not hidden text, menus, lists, etc.)
>       write that content to a file
>    extract all the links
>       add just the new links to the URL list (not those already in the 
> list of URLs)
> 
> i could just use java, but then i would have to write everything.
> beautiful soup (written in python) would probably work well to parse the 
> pages, but i don't see that it can fetch pages.
> i can't tell to what extent nutch can parse the pages. i know it can 
> give me the links, but i don't know if it can extract just the text i 
> care about.
> 
> 
> 
> Gordon Golding
> 
> ------------------------------------------------------------------------
> Plan your next roadtrip with MapQuest.com 
> <http://www.mapquest.com/?ncid=mpqmap00030000000004>: America's #1 
> Mapping Site.
> ------------------------------------------------------------------------
> Plan your next roadtrip with MapQuest.com 
> <http://www.mapquest.com/?ncid=mpqmap00030000000004>: America's #1 
> Mapping Site.
> 
> 
> ------------------------------------------------------------------------
> 
> _______________________________________________
> Web Page:  http://lug.boulder.co.us
> Mailing List: http://lists.lug.boulder.co.us/mailman/listinfo/lug
> Join us on IRC: lug.boulder.co.us port=6667 channel=#colug

-- 
George Sexton
MH Software, Inc.
Voice: +1 303 438 9585
URL:   http://www.mhsoftware.com/



More information about the LUG mailing list