[lug] wget page-requisites

Wed Jan 12 13:05:24 MST 2011

>> $ wget --page-requisites
>>
>> http://blog.javacorner.net/2009/08/cheap-50mm-f14-portrait-lent-for-micro.html
>> (snip)
>> Downloaded: 1 files, 49K in 0.1s (404 KB/s)
>>
>> It is not downloading any of the 5 inline images (1 in the header, 4
>> in the body). What am I doing wrong?
>
> It probably has to do with the fact that the images are hosted on different
> domains, and wget doesn't want to follow them.
Yes, I was suspecting this.

> Try with the "--span-hosts"
> option, but you may also want to play with "--convert-links" if you want to
> view everything locally.

Thanks. This solves the simple single-page example, but of course life
is always harder than simple examples. My actual wget is doing
--mirror of the whole domain and adding the --span-hosts mess that
out.
What I want is a --span-host that works only for the --page-requisites
and not for the recursion. It doesn't seem like a weird request at
all, I want the pages that I am downloading to be complete with their
requisites (images) even if they are hosted somewhere else, but I
don't want to recurse the whole web (as it happens if I do a
span-host). Any ideas?

I guess I could count the deepest level of the domain I am mirroring,
and use that as recursion level instead of the infinite that mirror
uses. But if I get that wrong, I don't mirror the whole site. And then
I have to continuously maintain that number, which is a pain. And
then, even if not the whole internet-for-sure I am still downloading
the world and his dog. This must be possible, isn't it?

Using curl or anything else instead of wget is an option, if they are
more flexible than wget.

Thanks,
Dav