[lug] wget page-requisites

Wed Jan 12 13:57:55 MST 2011

wget is really your best option.  Curl is much better for grabbing a large
number of files using regular expressions, using cookines, forms etc.  But
it needs a wrapper script to mirror a website.

Here is an example from the cURL project.

http://curl.haxx.se/programs/curlmirror.txt

On Wed, Jan 12, 2011 at 1:05 PM, Davide Del Vento <
davide.del.vento at gmail.com> wrote:

> >> $ wget --page-requisites
> >>
> >>
> http://blog.javacorner.net/2009/08/cheap-50mm-f14-portrait-lent-for-micro.html
> >> (snip)
> >> Downloaded: 1 files, 49K in 0.1s (404 KB/s)
> >>
> >> It is not downloading any of the 5 inline images (1 in the header, 4
> >> in the body). What am I doing wrong?
> >
> > It probably has to do with the fact that the images are hosted on
> different
> > domains, and wget doesn't want to follow them.
> Yes, I was suspecting this.
>
> > Try with the "--span-hosts"
> > option, but you may also want to play with "--convert-links" if you want
> to
> > view everything locally.
>
> Thanks. This solves the simple single-page example, but of course life
> is always harder than simple examples. My actual wget is doing
> --mirror of the whole domain and adding the --span-hosts mess that
> out.
> What I want is a --span-host that works only for the --page-requisites
> and not for the recursion. It doesn't seem like a weird request at
> all, I want the pages that I am downloading to be complete with their
> requisites (images) even if they are hosted somewhere else, but I
> don't want to recurse the whole web (as it happens if I do a
> span-host). Any ideas?
>
> I guess I could count the deepest level of the domain I am mirroring,
> and use that as recursion level instead of the infinite that mirror
> uses. But if I get that wrong, I don't mirror the whole site. And then
> I have to continuously maintain that number, which is a pain. And
> then, even if not the whole internet-for-sure I am still downloading
> the world and his dog. This must be possible, isn't it?
>
> Using curl or anything else instead of wget is an option, if they are
> more flexible than wget.
>
> Thanks,
> Dav
> _______________________________________________
> Web Page:  http://lug.boulder.co.us
> Mailing List: http://lists.lug.boulder.co.us/mailman/listinfo/lug
> Join us on IRC: irc.hackingsociety.org port=6667 channel=#hackingsociety
>
-------------- next part --------------
An HTML attachment was scrubbed...
URL: <http://lists.lug.boulder.co.us/pipermail/lug/attachments/20110112/36945ac4/attachment.html>