Bob Ippolito (@etrepum) on Haskell, Python, Erlang, JavaScript, etc.

Borked Text Encodings


Unfortunately when you're dealing with text across platforms, you're likely to run into a situation where you see characters that don't belong. When I need to fixup such text, I put a snippet of the borked text on the clipboard and break open a Python interpreter. On platforms other than Mac OS X, you'll probably have to fiddle around with the I/O boundary codecs, but on Mac OS X it is generally going to be UTF-8 to and from standard I/O.

>>> import sys, codecs
>>> sys.stdout = codecs.getwriter('utf-8')(sys.stdout)
>>> borked = 'CaliforniaÕs'.decode('utf-8')
>>> borked

At this point we have the borked unicode text in a unicode instance named borked. From there, I just guess at encode/decode pairs until I get sensible looking text without errors:

>>> print borked.encode('latin-1').decode('utf-8')
Traceback (most recent call last):
  File "<stdin>", line 1, in ?
UnicodeDecodeError: 'utf8' codec can't decode bytes in position 10-11: invalid data
>>> print borked.encode('macroman').decode('latin-1')
>>> print borked.encode('latin-1').decode('macroman')

So apparently, this text originated in macroman format, but ended up on a Windows machine and got marked as latin-1. From here, I'll leave it as an exercise to the reader to do this treatment in batch.

Are there any applications on the market that twiddle text encodings like this? I've certainly run into the problem a number of times. Would you pay for something that was really smart about guessing how to fix text encodings? If it could do the fix for rich text documents?

blog comments powered by Disqus