[lug] Scripting help, lynx

Matt Dew marcoz at osource.org
Tue May 3 09:31:11 MDT 2011


find's -exec option can help too.

find . -name "*.html" -exec lynx -nolist -dump {} > {}.txt \;

m3 also can convert html to text.

On 05/03/2011 07:20 AM, Chip Atkinson wrote:
> In order to do this recursively one would have to use the find command:
>
> for $file in $(find . -name "*.html"); do
>    echo "lynx -nolist -dump>  $file.txt"
> done
>
> I like to put the echo in front of a command before I have it working how
> I want.  This is especially handy if your command does something
> potentially destructive such as deleting files or filling up your disk.
>
> Once the output looks correct remove the echo and quotes.  The quotes are
> needed in this case because of the output redirect (>). You could also
>    echo lynx -nolist -dump \>  $file.txt
>
> There is also a utility html2text that I've used which works well if lynx
> doesn't fill the bill.
>
> Chip
>
> On Tue, 3 May 2011, Dan Ferris wrote:
>
>> for $file in `ls *.html`
>> do
>>       lynx -nolist -dump>  $file.txt
>> done
>>
>> That will redirect the file to $file.html.txt, I'll leave it as an
>> exercise for you to figure out how to change it to $file.txt.
>>
>> Dan
>>
>>    On 5/3/2011 6:58 AM, Paul Nowosielski wrote:
>>> Dear All,
>>>
>>> I'm trying to convert all the html files
>>> into text using lynx. The files are in many directories
>>> with meaningful names.
>>>
>>> Can anyone assist me in creating a script
>>> That will go through each directory recursively
>>> and convert the files to text and preserve the base name.
>>>
>>> ex: file1.html file1.txt file2.html file2.txt (or something close to this)
>>>
>>> I have this so far, which correctly traverse the directories
>>> and spits out the text. But I am not understanding out how
>>> to direct to a txt file with the same name as the html file.
>>>
>>> find ./ -name *.html |xargs -I '{}' lynx -nolist -dump '{}'
>>>
>>> Any thoughts?
>>>
>>> Thank you,
>>>
>>> Paul
>>> _______________________________________________
>>> Web Page:  http://lug.boulder.co.us
>>> Mailing List: http://lists.lug.boulder.co.us/mailman/listinfo/lug
>>> Join us on IRC: irc.hackingsociety.org port=6667 channel=#hackingsociety
>>
>> _______________________________________________
>> Web Page:  http://lug.boulder.co.us
>> Mailing List: http://lists.lug.boulder.co.us/mailman/listinfo/lug
>> Join us on IRC: irc.hackingsociety.org port=6667 channel=#hackingsociety
>>
>
> _______________________________________________
> Web Page:  http://lug.boulder.co.us
> Mailing List: http://lists.lug.boulder.co.us/mailman/listinfo/lug
> Join us on IRC: irc.hackingsociety.org port=6667 channel=#hackingsociety




More information about the LUG mailing list