mentby.com
Blog | Jobs | Help | Signup | Login

request.setCharacterEncoding() && request.getParameter()



Hi, everybody. Sorry for my poor English and for my ignorance too.

    We've built an application where we used utf-8 as default encoding (it
runs in an English Linux box - default Java encodings will be utf-8). A
few days ago, i've added a new Servlet Filter to our application (to
change URL jsession id encode behavior). This filter
(URLSessionEncodingFilter) was placed before another filter
(SetRequestEncodingFilter) that performs

if (request.getCharacterEncoding() == null) {
   request.setCharacterEncoding(this.defaultEncoding);
}

    Today i've found a bug on our application: Except for a multipart/form,
all non-English characters (like á and ç) sent in HttpServletRequest was
messed up.
    I just can think that the cause of this problem was
request.getParameter() inside URLSessionEncodingFilter. Because
request.getCharacterEnconding() is still null and sent request data need
to be read (for parameter parsing), "ISO-8859-1" was took as default
(i'm just guessing).

http://java.sun.com/j2ee/sdk_1.3/techdocs/api/javax/servlet/[..] (java.lang.String)

    Well, i've switched Servlet Filter execution order and everything is
working again. I'm wondering if there is a better way of do this. Is
there ?
    We've added "<meta http-equiv="Content-Type"
content='text/html;charset=UTF-8'>" to all our pages. I was thinking
that this way web browsers will be doing a better guess and sent request
charset as UTF-8 (i really don't know how this part of HTTP
specification works).

    Any suggestions or ideas ?

    Thanks in advance !


Daniel Henrique Alves Lima Wed, 08 Jul 2009 06:30:28 -0700

If you're using JSP have you also checked that you've got:

  <%@ page contentType="text/html; charset=utf-8"

and not:

  <%@ page contentType="text/html; charset=iso-8859-1"

Also it's worth checking the request/response headers between each
browser type to check that there aren't any unexpected behaviours.

Firefox has a plugin called something like: LiveHttpHeaders, IE has an
equivalent, Safari has a development mode & tool.

Please keep us posted.

p


Pid Wed, 08 Jul 2009 06:58:46 -0700

Hi, P. Thanks for your answer.

The jsps in our application already include this page directive.
Encoding is a really mess/boring issue for non-US apps :-(

I will take a look.

How about this "accept-charset" ? I've never used before...

http://www.w3.org/TR/html401/interact/forms.html#adef-accept[..]

Ok. Thanks !

--
"If there must be trouble, let it be in my day,
that my child may have peace."

Thomas Paine


Daniel Henrique Alves Lima Wed, 08 Jul 2009 08:00:11 -0700

IE is the best :-)

"Note: The accept-charset attribute does not work properly in Internet
Explorer. If accept-charset='ISO-8859-1', IE will send data encoded as
'Windows-1252'."

http://www.w3schools.com/TAGS/att_form_accept_charset.asp


Daniel Henrique Alves Lima Wed, 08 Jul 2009 08:04:32 -0700

Encoding issues come up on the list fairly frequently.  There is no "one
size fits all" answer.

The first thing to do is ensure that your app is absolutely, definitely,
100%, doing what you think it should be doing, every single time.

It may be worth building a small test app to develop against.

AFAIK this is not reliably supported by all browsers.

From recent-ish memory, I think, it was reported to the list that
browser clients tend to send content in the encoding format that the
previous document was received in - but I have experienced unpredictable
variations recently myself.

URL encoding can be set on the Connector element in server.xml too -
just to complicate matters even further.

p


Pid Wed, 08 Jul 2009 08:16:36 -0700

That is only one of the issues (browser inconsistencies).

If you want to really tackle this complex issue, you need to be
systematic, make sure you understand the bits and pieces, and do
everything right.
A short overview :

1) choose Unicode/UTF-8 as your charset/encoding, for *everything*.
Don't try to mix and match, or you'll get in trouble. Promise.

Applying #1 above :

2) find out the available "locales" on the Linux host where you run this
Tomcat.
"locale -a | more"
Pick one locale that has "utf8" in the name, note its name.
In the system script that starts Tomcat, add
export LC_ALL="pt_PT.utf8@euro"
(or whichever locale you have chosen)
That sets the "system locale" for the JVM that runs Tomcat, and is a way
to make it independent from whatever may be the system's configured
"default locale".

3) All your html pages should have a declaration like :
<meta http-equiv="content-type" value="text/html; charset=UTF-8" />

4) All your html <form> tags should have an attribute :
accept-charset="UTF-8"

5) a URL is in no particular charset.  A URL is *bytes*.
Any byte in a URL, that is not (generally speaking) such that it can be
represented by an ASCII letter a-zA-Z0-9, will be encoded as %xy, where
xy is the hexadecimal representation of this byte.
After decoding these %xy things, the result is again bytes, and that's
how your application sees it.

6) In your application, you can decide to interpret this series of
bytes, as a string in the UTF-8 encoding, and decode it as such into
Unicode *characters*.
Forget about any parameters to specify the charset of URLs, they only
confuse things totally.
The only way you know what was the underlying encoding, is when you know
for sure that all URLs that will hit your server, come from a known
source of which you controlled the encoding.

7) When submitting the values of the <input> tags of a form, browsers
will generally respect the basic encoding of the html page in which the
form was included, and (usually) also the "accept-charset" attribute.
By specifying both, you almost always win, as long as the submitted form
comes from your application, and has the right encoding.

8) In theory, you should also make sure that all responses sent by your
server to a browser, if they are html pages, contain the correct HTTP
header :
Content-type: text/html; charset=UTF-8
That, you can check with a browser add-on such as
- LiveHttpHeader for Firefox
- Fiddler2 for IE
and examine what goes out and what comes in.
You can also use Wireshark.
The good news is that most webservers do this correctly.
The bad news is that IE usually ignores this header, and makes its own
decision as to what the content is.  Newer IE versions may be better.

9) Java's internal charset is Unicode.
So when you do request.getParameter(), you will always get what Java
considers to be the proper Unicode translation of how the parameter came in.
The problem is to not let Java get confused about what it receives from
the browser.  By doing all the above, you minimise the chances that it
will be confused.

10) If you want to really make sure, include in all your forms some
hidden input value, containing a known string with "accented" characters
(áàéèÜÖ and such).
Then, before you process any other parameter in your webapp, check if
that string matches one that you have defined in your servlet.
If it does not, then something is wrong.


André Warnier Wed, 08 Jul 2009 09:14:59 -0700

11) Check any .java files are also encoded in UTF-8?
Might need one of the grandees to say whether that is meaningful.

p


Pid Wed, 08 Jul 2009 10:03:47 -0700

The encoding of the .java files shouldn't matter, as long as the glyphs usd in any strings correspond to the encoding used *and* the default charsetfor system where the compilation is done matches the encoding.

However, it is critical that .jsp files are stored in UTF-8 when using tha as the encoding for the server.

- Chuck

THIS COMMUNICATION MAY CONTAIN CONFIDENTIAL AND/OR OTHERWISE PROPRIETARY MTERIAL and is thus for use only by the intended recipient. If you receivedthis in error, please contact the sender and delete the e-mail and its attchments from all computers.


Charles R Caldarale Wed, 08 Jul 2009 10:30:11 -0700

Great how-to; any interest in adding it to the FAQ? http://wiki.apache.org/tomcat/FAQ/CharacterEncoding

- Chuck

THIS COMMUNICATION MAY CONTAIN CONFIDENTIAL AND/OR OTHERWISE PROPRIETARY MTERIAL and is thus for use only by the intended recipient. If you receivedthis in error, please contact the sender and delete the e-mail and its attchments from all computers.


Charles R Caldarale Wed, 08 Jul 2009 10:32:33 -0700

Hi, everybody. Thanks for the answers !

    Just to make myself clear:

    1. Always to set request charset before doing anything else fixes the
bug;
    2. When the bug is "on", only input data (request) is wrong. Previously
utf-8 encoded data is rendered right (response). At least, Firefox says
that the pages were using UTF-8 as encoding.

    Andre:

Inconsistencies ? In Microsoft IE ? Never ! ;-)

Checked.

I'll change any starting script to set this before Tomcat get running.
I've used to use LANG=C or JVM System properties directly (like
file.encoding, user.???? and etc).

Checked.

I'll change the jsp files to include this.

Ok. I think that is nothing like that in this webapp.

?

Ok.

Ok. Page properties (in Firefox) is showing UTF-8 as encoding.

Ok.

Ok.


Daniel Henrique Alves Lima Wed, 08 Jul 2009 11:28:26 -0700

I'll double check this but we avoid to use "special" characters in our
jsp and Java files. So even if the encoding is wrong, will make no
difference: The file contains only printable "normal" (not "extended"
one) ASCII table characters.
    Even in ResourceBundles (.properties), we apply native2ascii to escape
any of these "special" characters.


Daniel Henrique Alves Lima Wed, 08 Jul 2009 11:34:32 -0700

To use an example :

Suppose you give me the URL to your webapp, and it is http://your-server.somewhere.br/yourapp

Suppose I use this URL, and add a query string, so that it arrives to
your server as a GET request for
/yourapp?param=%45abcd%f3%b9123%c4%20xy

then, you have absolutely no way, after URL-decoding the above into a
series of bytes, to know under which character set I actually composed
that query string.

It /could be/, that the sequence %c4%20 that you see above, is actually
the UTF-8 encoding of a single Unicode character.(**)

But it could also be that in fact it is the two iso-8859-1 characters
"Ä" and "space".
And it could also be that, together with the "x" which follows, it is
the tri-byte encoding of the Klingon symbol for breakfast.(*)

In order to decide on an interpretation of that query string using a
certain character set and encoding, you would have to know something
about me and my browser, which on the WWW you don't know.

The only way you could /assume/ a certain character set and encoding,
would be if this request could only originate from a page that your
application sent to my browser beforehand, in which you have done your
best to ensure that whatever "click" results in a request with a known
charset and encoding.
That's why all the previous details are important.

Note that some people variously assume that a HTTP URL is necessarily
expressed in US-ASCII, or iso-latin-1, or UTF-8.
They are generally mistaken, as per http://www.w3.org/Protocols/rfc2616/rfc2616-sec3.html#sec3.4 http://www.apps.ietf.org/rfc/rfc3986.html

So, let me add an item to the previous shortlist :

11) in html <form> elements, always specify the attribute
method="POST"
This way, form input elements will be passed in the /body/ of the HTTP
request (and not in the URL, like in my GET example above).
At least for the body of a HTTP request, the browser can, and /should/
send charset/encoding information allowing the server to know how the
submitted parameters are encoded.

There seems to be a recent /tendency/ for browsers to use UTF-8 for
encoding request URLs, but it is by no means yet a universal thing.
(In IE for instance, it is a setting that must be turned on in "Internet
Options").

(*) This is a little-known fact, but there exists in fact a Klingon
relay station on Earth connected to our Internet, and the Klingons in
their spaceships use it from time to time to access Wikipedia and have a
good laugh.  Their keyboards and browsers are different from ours of course.

(**) and I bet someone is going to get back here and say that this
cannot possibly be a valid UTF-8 sequence.


André Warnier Wed, 08 Jul 2009 13:57:17 -0700

Daniel,

Sorry for the terse reply, but this page has lots of good information:

http://wiki.apache.org/tomcat/FAQ/CharacterEncoding

- -chris


Christopher Schultz Wed, 08 Jul 2009 14:42:22 -0700

Hi, Chris. The only missing item in my checklist ("What can you
recommend to just make everything work?") is the first one (Set
URIEncoding="UTF-8" on your <Connector> in server.xml).
    I didn't know that "Most web browsers today do not specify the
character set of a request". Well, better later than never.

    Thanks !

--
"If there must be trouble, let it be in my day,
that my child may have peace."

Thomas Paine


Daniel Henrique Alves Lima Wed, 08 Jul 2009 15:12:04 -0700

Another question: Even when response's content type is not text (like
pdf/odt/doc or image streams), should i set the response charset ? Does
"application/pdf; charset=UTF-8" make sense ?

"If there must be trouble, let it be in my day,
that my child may have peace."

Thomas Paine


Daniel Henrique Alves Lima Wed, 08 Jul 2009 15:20:05 -0700

No.
It only makes sense for MIME types that start with "text/" (such as
"text/plain" and "text/html").

And, about the URIEncoding attribute for the Connector, understand the
following : it means that Tomcat is going to (try to) decode ALL URLs of
ALL requests that arrive on that port, for ALL Hosts and webapps, as if
these URIs are ALL UTF-8 encoded.
That may, or may not, fit your situation.


André Warnier Wed, 08 Jul 2009 15:42:47 -0700

i would suggest starting at request (that way only your request is affected)
widen scope when you want the same encoding for all other webapps http://java.sun.com/products/servlet/2.3/javadoc/javax/servl[..] (java.lang.String)

some background on UTF-8 vs UTF-16 available at: http://download-west.oracle.com/otn_hosted_doc/jdeveloper/90[..]

you probably want to think on whether to include multibyte charsets
such as mandarin (in which case you'll want to accomodate UTF-16)
you're safe with european languages implementing UTF-8 encoding

hth
Martin
______________________________________________
Verzicht und Vertraulichkeitanmerkung/Note de déni et de confidentialité

Diese Nachricht ist vertraulich. Sollten Sie nicht der vorgesehene Empfaenger sein, so bitten wir hoeflich um eine Mitteilung. Jede unbefugte Weiterleitung oder Fertigung einer Kopie ist unzulaessig. Diese Nachricht dient lediglich dem Austausch von Informationen und entfaltet keine rechtliche Bindungswirkung. Aufgrund der leichten Manipulierbarkeit von E-Mails koennen wir keine Haftung fuer den Inhalt uebernehmen.
Ce message est confidentiel et peut être privilégié. Si vous n'êtes pas le destinataire prévu, nous te demandons avec bonté que pour satisfaire informez l'expéditeur. N'importe quelle diffusion non autorisée ou la copie de ceci est interdite. Ce message sert à l'information seulement et n'aura pas n'importe quel effet légalement obligatoire. Étant donné que les email peuvent facilement être sujets à la manipulation, nous ne pouvons accepter aucune responsabilité pour le contenu fourni.


Martin Gainty Wed, 08 Jul 2009 19:21:29 -0700

André,

This shouldn't really matter: the default locale for the JVM does not
affect the encoding used for reading request URIs and bodies: the body
is always decoded using the Content-Type request header (or ISO-8859-1
if none is provided) and the URI is always decoded using ISO-8859-1
unless you have overridden it using the appropriate <Connector> attribute.

Reading files off the disk /is/ usually done using the default encoding.
I haven't read the spec wrt JSP files, but I would hesitate to use any
non-ASCII characters in these files - just like you should when saving
.java files. Any non-ASCII characters can be expressed using \uxxxx
syntax. Another poster mentioned using native2ascii with .properties
files, which can be used for this purpose as well.

This is debatable :)

Since you mentioned that you had done this at some point, I've been
thinking about a way to do this in an automated way, so you could sort
of "turn it on" for your entire site. I think the only way to wave a
magic wand and have this work is if you were already using some kind of
custom <xyz:form> JSP tag library, and you were to subclass and replace
the class that implements the <xyz:form> tag to add a hidden <input>
parameter. A corresponding Filter would need to be written to check for
the proper decoding of the GET parameter, but this could be used
site-wide with no further invasiveness. (Of course, using <xyz:form> is
probably relatively invasive unless you are already using a tag library
such as Struts's).

- -chris


Christopher Schultz Sat, 11 Jul 2009 12:50:31 -0700



Related Topics

Post a Comment