start page | rating of books | rating of authors | reviews | copyrights

Book Home Java Servlet Programming Search this book

12.4. Multiple Languages

Now it's time to push the envelope a little and attempt something that has only recently become possible. Let's write a servlet that includes several languages on the same page. In a sense, we have already written such a servlet. Our last example, HelloJapan, included both English and Japanese text. It should be observed, however, that this is a special case. Adding English text to a page is almost always possible, due to the convenient fact that nearly all charsets include the 128 U.S.-ASCII characters. In the more general case, when the text on a page contains a mix of languages and none of the previously mentioned charsets contains all the necessary characters, we require an alternate technique.

12.4.1. UCS-2 and UTF-8

The best way to generate a page containing multiple languages is to output 16-bit Unicode characters to the client. There are two common ways to do this: UCS-2 and UTF-8. UCS-2 (Universal Character Set, 2-byte form) sends Unicode characters in what could be called their natural format, two bytes per character. All characters, including US-ASCII characters, require two bytes. UTF-8 (UCS Transformation Format, 8-bit form) is a variable-length encoding. With UTF-8, a Unicode character is transformed into a 1-, 2-, or 3-byte representation. In general, UTF-8 tends to be more efficient than UCS-2 because it can encode a character from the US-ASCII charset using just 1 byte. For this reason, the use of UTF-8 on the Web far exceeds UCS-2. For more information on UTF-8, see RFC 2279 at http://www.ietf.org/rfc/rfc2279.txt.

Before we proceed, you should know that support for UTF-8 is just beginning to appear on the Web. Netscape first added support for the UTF-8 encoding in Netscape Navigator 4, and Microsoft first added support in Internet Explorer 4.

12.4.2. Writing UTF-8

Example 12-7 shows a servlet that uses the UTF-8 encoding to say "Hello World!" and tell the current time (in the local time zone) in English, Spanish, Japanese, Chinese, Korean, and Russian. A screen shot of the servlet's output is shown in Figure 12-5.

Example 12-7. A servlet version of the Rosetta Stone

import java.io.*;
import java.text.*;
import java.util.*;
import javax.servlet.*;
import javax.servlet.http.*;

import com.oreilly.servlet.ServletUtils;

public class HelloRosetta extends HttpServlet {

  public void doGet(HttpServletRequest req, HttpServletResponse res)
                               throws ServletException, IOException {
    Locale locale;
    DateFormat full;

    try {
      res.setContentType("text/plain; charset=UTF-8");
      PrintWriter out = res.getWriter();

      locale = new Locale("en", "US");
      full = DateFormat.getDateTimeInstance(DateFormat.LONG, 
                                            DateFormat.LONG,
                                            locale);
      out.println("In English appropriate for the US:");
      out.println("Hello World!");
      out.println(full.format(new Date()));
      out.println();

      locale = new Locale("es", "");
      full = DateFormat.getDateTimeInstance(DateFormat.LONG, 
                                            DateFormat.LONG,
                                            locale);
      out.println("En Espa\u00f1ol:");
      out.println("\u00a1Hola Mundo!");
      out.println(full.format(new Date()));
      out.println();

      locale = new Locale("ja", "");
      full = DateFormat.getDateTimeInstance(DateFormat.LONG,
                                            DateFormat.LONG,
                                            locale);
      out.println("In Japanese:");
      out.println("\u4eca\u65e5\u306f\u4e16\u754c");
      out.println(full.format(new Date()));
      out.println();

      locale = new Locale("zh", "");
      full = DateFormat.getDateTimeInstance(DateFormat.LONG,
                                            DateFormat.LONG,
                                            locale);
      out.println("In Chinese:");
      out.println("\u4f60\u597d\u4e16\u754c");
      out.println(full.format(new Date()));
      out.println();

      locale = new Locale("ko", "");
      full = DateFormat.getDateTimeInstance(DateFormat.LONG,
                                            DateFormat.LONG,
                                            locale);
      out.println("In Korean:");
      out.println("\uc548\ub155\ud558\uc138\uc694\uc138\uacc4");
      out.println(full.format(new Date()));
      out.println();

      locale = new Locale("ru", "");
      full = DateFormat.getDateTimeInstance(DateFormat.LONG,
                                            DateFormat.LONG,
                                            locale);
      out.println("In Russian (Cyrillic):");
      out.print("\u0417\u0434\u0440\u0430\u0432\u0441\u0442");
      out.println("\u0432\u0443\u0439, \u041c\u0438\u0440");
      out.println(full.format(new Date()));
      out.println();
    }
    catch (Exception e) {
      log(ServletUtils.getStackTraceAsString(e));
    }
  }
}
figure

Figure 12-5. A true hello world

For this servlet to work as written, your server must support JDK 1.1.6 or later. Earlier versions of Java throw an UnsupportedEncodingException when trying to get the PrintWriter, and the page is left blank. The problem is a missing charset alias. Java has had support for the UTF-8 encoding since JDK 1.1 was first introduced. Unfortunately, the JDK used the name "UTF8" for the encoding, while browsers expect the name "UTF-8." So, who's right? It wasn't clear until early 1998, when the IANA (Internet Assigned Numbers Authority) declared "UTF-8" to be the preferred name. (See http://www.isi.edu/in-notes/iana/assignments/character-sets.) Shortly thereafter, JDK 1.1.6 added "UTF-8" as an alternate alias for the "UTF8" encoding. For maximum portability across Java versions, you can use the "UTF8" name directly with the following code:

res.setContentType("text/html; charset=UTF-8");
PrintWriter out = new PrintWriter(
  new OutputStreamWriter(res.getOutputStream(), "UTF8"), true);

Also, your client must support the UTF-8 encoding and have access to all the necessary fonts. Otherwise, some of your output is likely to appear garbled.



Library Navigation Links

Copyright © 2001 O'Reilly & Associates. All rights reserved.