[lug] sed, tr...escaping non-printable characters for display

stimits at comcast.net stimits at comcast.net
Sat Feb 21 16:51:39 MST 2015


Hi,
 
The more I need to deal with comparing two file systems using only bash and core utilties, the more I miss languages like C/C++. In particular, it is very rare that a non-printable control character is included in a file name or directory name, yet linux itself is quite good at working with almost any embedded control character sequence possible. A newline, linefeed, tab...all of these can be embedded in file names or directories. I've encoded these names so working with them isn't too bad, but if I can't display the results when done the purpose of the script is lost. So to get around this, I'm trying to substitute non-printing (mostly control) characters with either "hat" notation (e.g., '^M') or hex or octal notation (e.g., '\0x1D' or '\012').
 
tr almost does this in a trivial way, e.g.:
echo "${filename}" | tr '[:cntrl:]' '[A.._]'
 ...the result of the above would be to replace control characters (decimal 1 through 31) with ASCII characters between capital 'A' and underscore '_'. But this leaves out the "hat", e.g., changing a carriage return to 'D' would actually need to show as '^D' to distinguish it from printable characters. How can I use tr to convert one character into a constant hat '^' plus a printable character?
 
Using sed almost does the job too, e.g.:
echo "${filename}" | sed 's/\([^[:print:]]\)/?/g'
...this will substitute a question mark for every non-printable character. So far this seems to be the best method, but it still doesn't give meaning to the non-printable character the way hat notation or hex notation would. Having sed capable of using the matched character and transforming it into a sequence of hat+transformed printable characters would be great, but I'm at a loss as to how to do that.
 
Yet this must be a common problem, I feel like I must be reinventing the wheel while trying to solve this. Does anyone have a suggestion on how to print these non-printable file and directory names in a meaningful way, without using a non-bash script and without using non-core utilities? sed and tr are core, gawk and perl are not. It's hard to imagine how inefficient it would be to use bash to traverse every character of every file or directory name one at a time looking for non-printable characters in some enormous loop.
 
Thanks!
-------------- next part --------------
An HTML attachment was scrubbed...
URL: <http://lists.lug.boulder.co.us/pipermail/lug/attachments/20150221/6b2f0149/attachment.html>


More information about the LUG mailing list