Question

I'm tring to create form validation unit that, in addition to "regular" tests checks encoding as well.

According to this article http://www.w3.org/International/questions/qa-forms-utf-8 the allowed characters are CR, LF and TAB in range of 0-31, the DEL=127 in not allowed.

On the other hand, there are control characters in range 0x80-0xA0. In different sources I had seen that they are allowed and that not. Also I had seen that this is different for XHTML, HTML and XML.

Some articles had told that FF is allowed as well?

Can someone provide a good answer with sources what can be given and what isn't?

EDIT: Even there http://www.w3.org/International/questions/qa-controls some ambiguity

The C1 range is supported

But table shows that they are illegal and previous shown UTF-8 validations allows them?

Was it helpful?

Solution

The Unicode characters in these ranges are valid in HTML 4.01:

0x09..0x0A
0x0D
0x20..0x7E
0x00A0..0xD7FF
0xE000..0x10FFFF    

In XHTML 1.0... it's unclear. See http://cmsmcq.com/2007/C1.xml#o127626258

OTHER TIPS

I think you're looking at this the wrong way around. The resources you link specify what encoded values are valid in (X)HTML, but it sounds like you want to validate the "response" from a web form — as in, the values of the various form controls, as passed back to your server. In that case, you shouldn't be looking at what's valid in (X)HTML, but what's valid in the application/x-www-form-urlencoded, and possibly also multipart/form-data, MIME types. The HTML 4.01 standards for <FORM> elements clearly states that for application/x-www-form-urlencoded, "Non-alphanumeric characters are replaced by '%HH'":

This is the default content type. Forms submitted with this content type must be encoded as follows:

  1. Control names and values are escaped. Space characters are replaced by '+', and then reserved characters are escaped as described in [RFC1738], section 2.2: Non-alphanumeric characters are replaced by '%HH', a percent sign and two hexadecimal digits representing the ASCII code of the character. Line breaks are represented as "CR LF" pairs (i.e., `%0D%0A').
  2. The control names/values are listed in the order they appear in the document. The name is separated from the value by '=' and name/value pairs are separated from each other by '&'.

As for what character encoding is contained, (i.e. whether %A0 is a non-breaking space or an error), that's negotiated by the accept-charset attribute on your <FORM> element and the response's (well, really a GET or POST request) Content-Type header.

Postel's Law: Be conservative in what you do; be liberal in what you accept from others.

If you're generating documents for others to read, you should avoid/escape all control characters, even if they're technically legal. And if you're parsing documents, you should endeavor to accept all control characters even if they're technically illegal.

First of all any octet is valid. The mentioned regular expression for UTF-8 sequences just omits some of them as they are rather uncommon in practice to be entered by a user. But that doesn’t mean that they are invalid. They are just not expected to occur.

The first link you mention does not have anything to do with validating the allowed characters in XHTML... the example on that link is simply showing a common/generic pattern for detecting whether or not raw data is in utf-8 encoding or not.

This is a quote from the second link:

HTML, XHTML and XML 1.0 do not support the C0 range, except for HT (Horizontal Tabulation) U+0009, LF (Line Feed) U+000A, and CR (Carriage Return) U+000D. The C1 range is supported, i.e. you can encode the controls directly or represent them as NCRs (Numeric Character References).

The way I read this is:

Any control character in the C1 range is supported if you encode them (using base64, or Hex representations) or represent them as NCRs.

Only U+0009, U+000A, and U+000D are supported in the C0 range. No other control code in that range can be represented.

If the document is known to be XHTML, then you should just load it and validate it against the schema.

What programming language do you use? At least for Java there exist libraries to check the encoding of a string (or byte-array). I guess similar libraries would exist for other languages too.

Do I understand your question correctly: you want to check whether the data submitted by a form is valid, and properly encoded?

If so, why do several things at once? It would be a lot easier to separate those checks, and perform them step by step, IMHO.

  1. You want to check that the submitted form data is correctly encoded (in UTF-8, I gather). As Archchancellor Ridcully says, that's easy to check in most languages.
  2. Then, if the encoding is correct, you can check whether it's valid form data.
  3. Then, if the form data is valid, you can check whether the data contains what you expect.
Licensed under: CC-BY-SA with attribution
Not affiliated with StackOverflow
scroll top