[ale] text file with non-ascii
Michael H. Warfield
mhw at wittsend.com
Mon Sep 19 16:49:07 EDT 2005
On Mon, 2005-09-19 at 16:03 -0400, James P. Kinney III wrote:
> I need to remove all non-ascii text from a very large file.
> What shows as non-ascii are things like ~B ~C ~R ~@ ~Z ~E and ~\
Oh, this could be fun. I know of several "things" which could be
considered "non-ascii" and how you deal with them varies with what they
are.
> I have used sed to remove the dos newlines (^M) and another control code
> (^G) but I'm stumped how these are formatted.
Ok... ^M and ^G are control codes which fall in the ASCII range 0 ->
0x1f. Non-printable, yes, but ASCII none the less.
What you showed above that is probably escape codes. It's the ESC
character 0x1b (027 octal) followed by an ASCII sequence of one or more
characters. All characters ARE ASCII but it's the combination that is
significant. If you are dealing with things like VT100 codes or Epson
printer sequences, the escape codes can be long and convoluted but they
will still all be ascii characters and you would have to interpret the
codes to know how long the sequence is.
If you have a limited set of escape codes to fight with, like only two
character sequences like ESC-B or ESC-C or ESC-R or such, then a simple
sed rule like 's/\027.//g' might do the job (you could probably enter a
real ESC instead of using the octal sequence by using ^V then hitting
Esc).
Then there are the extended ascii codes... 0x80 -> 0xff. Those you
can pick off using tr or sed.
Then you've got Unicode or the stuff they use for IDN
(Internationalized Domain Names). :-(
Then... What do you want to do with them? Squeeze them out? Replace
them with spaces? Replace them with their hex/octal codes?
You could get rid of a lot just with "tr -c -d [:print:]". That will
blow away all the non-printable charracters, just leaving printable
characters and white space. The escapes will be gone, but the following
code sequences would be left, which is probably NOT what you want, so
you would want to filter out escape sequences BEFORE filtering out
non-printables.
I would examine some of the file with hexdump and determine what the
byte sequences are and how complex they are before settling on a scheme
to get rid of them.
Mike
--
Michael H. Warfield | (770) 985-6132 | mhw at WittsEnd.com
/\/\|=mhw=|\/\/ | (678) 463-0932 | http://www.wittsend.com/mhw/
NIC whois: MHW9 | An optimist believes we live in the best of all
PGP Key: 0xDF1DD471 | possible worlds. A pessimist is sure of it!
-------------- next part --------------
A non-text attachment was scrubbed...
Name: not available
Type: application/pgp-signature
Size: 307 bytes
Desc: This is a digitally signed message part
More information about the Ale
mailing list