[lug] Web crawler advice

Jason Davis mohadib at openactive.org
Fri May 9 20:01:16 MDT 2008


The JDIC project for java has a wrapper for I.E. and Firefox. This
wrapper has a javascript eval() method. This would allow you
to have full access to the dom and full JS support all from Java. The
other option is using one of the 100 java html parser libs out there.

Good Luck,
jd

On Mon, May 5, 2008 at 10:18 AM,  <gordongoldin at aim.com> wrote:
> I'm doing a project to analyze text content on the web:
>
> i need to:
>
> start with a list of URLs
> for each URL in the URL list
>    fetch the page
>    throw away non-English pages
>    extract the sentence text content, (not hidden text, menus, lists, etc.)
>       write that content to a file
>    extract all the links
>       add just the new links to the URL list (not those already in the list
> of URLs)
>
> i could just use java, but then i would have to write everything.
> beautiful soup (written in python) would probably work well to parse the
> pages, but i don't see that it can fetch pages.
> i can't tell to what extent nutch can parse the pages. i know it can give me
> the links, but i don't know if it can extract just the text i care about.
>
>
>
> Gordon Golding
>
> ________________________________
> Plan your next roadtrip with MapQuest.com: America's #1 Mapping Site.
> _______________________________________________
> Web Page:  http://lug.boulder.co.us
> Mailing List: http://lists.lug.boulder.co.us/mailman/listinfo/lug
> Join us on IRC: lug.boulder.co.us port=6667 channel=#colug
>



More information about the LUG mailing list