Nonetheless, most of our current tools are still adapted primarily for vendor-specific character sets that can't handle more than a few languages at one time. Thus, learning how to convert your documents from proprietary to more standard character sets is crucial.
Some of the better XML and HTML editors let you choose the character set you wish to save in and perform automatic conversions from the native character set you use for editing. On Unix, the native character set is likely one of the standard ISO character sets, and you can save into that format directly. On the Mac, you can avoid problems if you stick to pure ASCII documents. On Windows, you can go a little further and use Latin-1, if you're careful to stay away from the extra characters that aren't part of the official ISO-8859-1 specification. Otherwise, you'll have to convert your document from its native, platform-dependent encoding to one of the standard platform-independent character sets.
François Pinard has written an open source character-set conversion tool called recode for Linux and Unix, which you can download from http://www.iro.umontreal.ca/contrib/recode/, as well as the usual GNU mirror sites. Wojciech Galazka has ported recode to DOS. See ftp://ftp.simtel.net/pub/simtelnet/gnu/djgpp/v2gnu/rcode34b.zip. You can also use the Java Development Kit's native2ascii tool at http://java.sun.com/products/jdk/1.3/docs/tooldocs/win32/native2ascii.html. First convert the file from its native encoding to Java's special ASCII-encoded Unicode format, then use the same tool in reverse to convert from the Java format to the encoding you actually want. For example, to convert the file myfile.xml from the Windows Cp1252 encoding to UTF-8, execute these two commands in sequence:
% native2ascii -encoding Cp1252 myfile.xml myfile.jtx % native2ascii -reverse -encoding UTF-8 myfile.jtx myfile.xml
Copyright © 2002 O'Reilly & Associates. All rights reserved.