Working with ServletRequest's setCharacterEncoding for UTF-8 form submissions
Aspire, the product that I developed over the last few years while I was working with java, servlets, jsp and XML, is being used on some SBIR projects involving JetSpeed and a few other open source tools. I have got a call the other day indicating that one of the forms is not accepting Vietnamese input. They are getting invalid characters placed into the MySQL database. The usage is pretty straight-forward. The user will see a form in a browser and will proceed to type Vietnamese (copy/paste perhaps from a Vietnamese app). This form will then be submitted to the server. The server is expected to parse the form input into parameters. Mahaveer suggested that perhaps I should be using some kind of character encoding to retrieve the parameters.
when it comes to complicated things such as setting digital watches, programming vcrs, ordering a sub for your spouse (Who always has a very discriminating taste), ordering at a local McDonalds drive through, self checkouts at Home Depot, and ofcourse software, I am a minimalist. If something is not hampering my imagination and productivity I usually don't upgrade. To exaggerate and make a point, I would be quite happy with windows 95, Tomcat 1.x, Frontpage, Jbuilder 3 for developing server side java applications that run in any container while David prods me all the time to upgrade to XP and how wonderful it is. (you can tell I am relegated to the auto pilot windows world). Anyway having made the point I still compile Aspire with servlets 2.1 although the current release may be at 2.4. Not that I don't like the new stuff in this particular instance but I want to be as backward compatible as possible as long as it is not crippling my style :-). For the curious I did upgrade to XP because I needed a better photo printing software.
Any way back to the encoding issue. I don't remember seeing any encoding issues while trying to read form submission parameters before. I remember porting one of our web sites to Japanese with out any problem last year. Based on that assumption I have advised Mahaveer that it should not be a problem and servlets probably will figure out the necessary details to retrieve the parameters. Ofcourse I am proven wrong.
Character encoding, Form submissions and URLs in browsers
This is what happens when a user clicks on a submit button on a web page. The browser will collect all the arguments on that form and gets ready to send a data stream to the web server. If the arguments are not ascii, then the browser needs to encode them in an alternate format. For example in IE there is an advanced option (which is normally checked) to allow this encoding to be utf-8 allowing foreign characters along with the English alphabet. Although it says '8' a character in utf-8 can take multiple bytes and hence can represent all the variations in the world's alphabet. In fact the 1.4 servlet spec describes this in a bit more detail. To cut the long story short, IE will dispatch this form to the server side in utf-8 format.
Default Servlet Behavior
The browser is supposed to set the character encoding in the content type of the post stream. As per the servlet documentation most browsers do not do this at this time. What this means is that the server side will have to assume a certain encoding to read the stream and also to retrieve any parameters from that stream. Apparently servlets will use latin-1 character set. Why they do this instead of utf-8 (utf-8 being seemingly more commmon) I am not sure. This will result in errors. This is what is happening in the above case. Is this behavior any different in servlets before 2.3 I am not sure. Either way I have a problem as Mahaveer is running Tomcat 4.x which is certainly running on a servlet api that is at least 2.3.
get/setCharacterEncoding():Suggested mechanism in 2.3
2.3 servlets provide two methods to get and set character encoding on the servletRequest interface. In cases where the client or the container do not set the encoding the get method will return null. It is clear that a client has the responsibility and means to set the encoding. But one might ask how can a container such as tomcat do this. Well, as Tomcat can intercept requests, it can use some external scheme such as the locale of the browser to actively determine what the encoding could be. I will discuss this topic in more depth later. Although it is not clear as to what is the best strategy to determine the character encoding, servlets do offer a set method to set the encoding. This method takes a string value of the encoding such as "UTF-8". If this method is called prior to doing anything else with the incoming stream we will be in good shape.
Determining the encoding
In this particular case as we are testing with IE and as IE seem to use utf-8 all the time, the choice is easy enough. I would check the get method first if it returns null. If it does then I need to set the character encoding to utf-8. Perhaps I can be bit a nice and put this in a config file in case if one has to change it globally. Another way to do this might be by examining the incoming locale and use a locale to character encoding map. The web.xml already allows this by providing such a table for response. Even in that case there is a problem. In the case that Mahaveer is dealing with, his locale is English but the encoding is UTF-8 as opposed to latin-1. In my mind this is still an open question. A quick search on the internet on the subject has not been as fruitful as I had hoped.
Delving into the Locale dependency
The web.xml allows for a configuration like the following. Provided, these mappings are in place, one doesn't have to use the set encoding explicitly for the response. In servelet spec this section is described as part of providing the encoding hints for the response. This has no impact on the incoming request. Nevertheless it won't be that hard to write a servlet filter that can do this pretty easily if this methodology works for you on the request side.
<pre><code>
<locale-encoding-mapping-list>
<locale-encoding-mapping>
<locale>ja</locale>
<encoding>Shift_JIS</encoding>
</locale-encoding-mapping>
</locale-encoding-mapping-list>
</code></pre>
So how to fix the problem
Based on what is learned, it is not hard to make a change to the code where the request object is set with its encoding right upfront. But the trouble with this is that the code now is not backward compatible any more with servlets 2.2 and before. Moreover the change is invasive as if you were to think of a different strategy to determine the character encoding. The obvious way then is to do the change via a servlet filter. With such an approach you can drop in more servlet filters in the future. In my case of Aspire there is an internal http event model that allows me to do the same via configuration and not depend on entries into web.xml.
Support for encoding in JSPs
If you are using JSPs to process your form inputs you may be inclined to think that the page directive that sets the content type can automatically resolve this issue. It doesn't. This directive only works for the response. You still have to set the character encoding as in direct servlets. Nevertheless for those that uses JSTL there is a tag fmt:requestEncoding that can do the same. JSP has also a page encoding directive that deals with the encoding that is used to write the JSP page itself. For the discussion here this tag has no relevance to the request processing.
Well I don't do any such fancy thing but still works!
Even if you don't set the request encoding, thing seem to work. But this must be the case because perhaps the characters in the latin 1 set and the utf-8 characters sets match, such as for English. This is my suspicion why it would work.
What are the allowed character encodings for servlets?
Because the set encoding takes a untyped string value for knowing the encoding type, it is important to know their names right. The url http://www.iana.org/assignments/character-sets is a large list of these character encodings. Weather the servlets api supports all or only a portion is not clear from the docs. The java web services tutorial has a good article in understanding these character sets as well. Appendix F of the same document states that the Java platform supports only 4 encoding schemes:
US-ASCII
ISO-8859-1
UTF-8
UTF-16
It is worth reading this appendix F as this explains the difference between UTF-8 and UTF-16 concisely and nicely.
References
1. java web services tutorial, Chapter 23
2. List of Java encoding schemes
3. An explanation of Character sets and encodings in Servlets
4. Servlets 2.4 spec, The Request section under internationalization