Unicode in Rich Text Format

written on Sat, 16 Dec 2006 22:32

Today I wrote an RTF Formatter for pygments. By doing this I not only found out that the most recent specification of RTF is basically just about Microsoft® Word® but also that Unicode is really tricky.

RTF is really one of the worst markup languages ever created. It looks like the human unreadable version of LaTeX and is heavily bound to Microsoft® Word®. Until I guess Microsoft® Word® 97 RTF just saved documents in ANSI charset. So. What the hell is an ANSI charset? I have no idea. Looks like ANSI is a mixin charset where each font definition can refer to a different codepage. In fact I don't mind because there is a way to embed unicode:

Binärcode

Looks like this in RTF:

Bin\ud{\u228\'e4}rcode

First you start the small adventure by opening a new block prefixed with the \ud keyword that let the RTF reader know that here starts an Unicode Part. The next one is the 16 bit signed integer number of the character code point in the unicode table. Yeah. Singed. And 16 bit long... So not even utf8 fits into what RTF thinks is unicode.

Anyway. The next part after the unicode codepoint number is the ANSI codepoint number. Because I'm lazy in my example I just encoded that again into iso-8859-1. If the character isn't in ANSI it should use the closest possible character. (Just replace it with ? ^^)

Check out the module for more information about the implementation if you like.

And here some funny facts from the specification (which is btw a Microsoft® Word® document packed in a zip exe file which is packed in a cab extract exe file):

The hidden style property can only be accessed using Microsoft® Visual Basic® for Applications.

That's the footnote for the \shidden style control word

\oldas
Use Word 95 Auto spacing
\lnbrkrule
Don’t use Word 97 line breaking rules for Asian text.
\bdrrlswsix
Use Word 6.0/Word 95 borders rules.

Great huh? I'm looking forward to discover the differences in the spacing and text breaking rules between different Microsoft® Word® versions.

view page source