Thursday, February 14, 2013

Testing with foreign character sets on a Mac

To help with testing foriegn character sets I have created two additional character sets on my Mac that remap the standard British QUERTY keyboard to create readable but non-standard characters. This allows me to easily enter characters outside the normal ASCII range that are still readable. I created these character maps with Ukelele that allows me to load the British character map and then remap the keys. Once I've remapped the keys I export the character maps as a bundle and put them in /Library/Keyboard Layouts and then use System Preferences to add them to the list of available keyboard maps.
So the two character maps are:

  • British - Foreign : This uses accented characters that are below 0xFFFF in Unicode and so in plane 0. Ḥėṙė ïṡ ȧṅ ėẋȧṁṗĺė ȯḟ ṫẏṗïṅġ ẇïṫḥ ṫḥïṡ ḳėẏḃȯȧṙḋ ṁȧṗ.
    Ukelele File
  • British - Mathematical : This uses mathematical versions of the standard lating characters and are above 0xFFFF, in this case in plane 1 in Unicode. 𝖧𝖾𝗋𝖾 𝗂𝗌 𝖺𝗇 𝖾𝗑𝖺𝗆𝗉𝗅𝖾 𝗈𝖿 𝗍𝗒𝗉𝗂𝗇𝗀 𝗐𝗂𝗍𝗁 𝗍𝗁𝗂𝗌 𝗄𝖾𝗒𝖻𝗈𝖺𝗋𝖽 𝗆𝖺𝗉.
    Ukelele File

Character encoding on requests and Sakai

So an issue has come to light in our local Sakai deployment with character encodings. Some request were being incorrectly interpreted to be encoded using ISO-8859-1 instead of UTF-8. Before I explain what was going on here is some background.

HTML specification

The HTML specification has stuff about encodings and the short of it is that if you are making a request to a web server using a GET then you shouldn't have any foreign characters in the URL, you should just be using ASCII. In practice you can use foreign characters if you UTF-8 encode them as that's commonly assumed to be the used encoding by browsers, but it's not part of the standard. For example here is Google Chrome displaying a URL with UTF-8 encoded characters at the end.
The URL in the referencing page ends 096/%E1%B8%9F%C8%AF%E1%B9%99%C3%AF%C4%97%C4%A1%E1%B9%85.%E1%B9%AB%E1%BA%8B%E1%B9%AB which is the URL encoded version of the UTF-8 characters "096/ḟȯṙïėġṅ.ṫẋṫ".
If you have a browser making POSTs to the server, then you have a choice of two ways of submitting the data, application/x-www-form-urlencoded (which is the default on a
tag) ormultipart/form-data. If you are using characters in your form outside ASCII then you should use multipart/form-data as browsers don't typically say what encoding they are using when performing application/x-www-form-urlencoded, although lots of people assume it to be UTF-8.

Servlet request decoding.

When a request comes in to Tomcat a HttpServletRequest object is built and this includes the raw request path as well as a decoded one. Commonly containers such as Tomcat will use ISO-8859-1 to decode the path although this can be overridden in configuration. If the request is a POST and the content type is application/x-www-form-urlencoded then the container must also make the form data available as parameters and it will decode any characters using the character set supplied by the browser, however most browsers don't appear to sent a character set when submitting urlencoded POSTs and so it falls back to the containers default which in the case of Tomcat is ISO-8859-1, this can be overridden by calling ServletRequest.setCharacterEncoding(String).

If a request is a POST and the content type is multipart/form-data then the container doesn't do any decoding and it's up to the application to decode the body of the request and extract any parameters from it. This is one reason why people adopt application/x-www-form-urlencoded forms as it means they don't have to deal with parsing the requests, although there are lots of frameworks that help with this.

Part of the reason for only having the container decode application/x-www-form-urlencoded requests is that multipart/form-data are often used when file uploads are performed and may be large in which case you have to be careful about when you consume the upload and where you put the data.

Sakai and character sets

So Sakai supports unicode and uses UTF-8 as it's default encoding. But it does this through configuring all requests to use UTF-8. So in the Tomcat connector configuration the URI encoding is specified to be UTF-8 instead of ISO-8859-1 and the Sakai request filter which preprocesses all requests to Sakai (RequestFilter) sets the request encoding (if not already set) to be UTF-8 for any URL encoded form submissions.

This means that you can create a form in Sakai and leave the encoding as application/x-www-form-urlencoded and because the RequestFilter sets the encoding to UTF-8 everything works, really this is a bug and the form should be changed to use the correct encoding but as it generally works nobody notices. The more technically correct solution would be to have the original form submission made using multipart/form-data as this way you normally get the character encoding used by the browser in submitting the request.

Back to the problem...

We had a filter that was doing some authentication (OAuth) before the standard Sakai request filter. It was all working correctly but we started seeing bugs when people submitted some requests with foreign characters in them. After some investigation it turned out that the cause was the OAuth filter and through it's inspection of the request parameters.

The OAuth filter needs to look at the request parameters to extract any authentication information but in doing so it causes the servlet container to decode all the request paramaters. By default (and following the spec) a servlet container will decode URL encoded parameters according to the ISO-8859-1 character set. Once decoded the parameters remain decoded with that initial character set.

References