Question

Intention

I'm trying to do some minimal very minimal validation of e-mail addresses, despite seeing a lot of advice advising against doing that. The reason I'm doing this is that spec I am implementing requires e-mail addresses to be in this format:

mailto:<uri-encoded local part>@<domain part>

I'd like to simply split on the starting mailto: and the final @, and assume the "local part" is between these. I'll verify that the "local part" is URI encoded.

I don't want to do much more than this, and the spec allows for me to get away with "best effort" validation for most of this, but is very specific on the URI encoding and the mailto: prefix.

Problem

From everything I've read, splitting on the @ seems risky to me.

I've seen a lot of conflicting advice on the web and on Stack Overflow answers, most of it saying "read the RFCs", and some of it saying that the domain part can only be certain characters, i.e. 1-9 a-z A-Z -., maybe a couple other characters, but not much more than this. E.g.:

When I read various RFCs on domain names, I see that "any CHAR" (dtext) or "any character between ASCII 33 and 90" (dtext) are allowed, which implies @ symbols are allowed. This is further compounded because "comments" are allowed in parens ( ) and can contain characters between ASCII 42 and 91 which include @.

RFC1035 seems to support the letters+digits+dashes+periods requirement, but "domain literal" syntax in RFC5322 seems to allow more characters.

Am I misunderstanding the RFC, or is there something I'm missing that disallows a @ in the domain part of an e-mail address? Is "domain literal" syntax something I don't have to worry about?

Was it helpful?

Solution

The most recent RFC for email on the internet is RFC 5322 and it specifically addresses addresses.

addr-spec       =   local-part "@" domain
local-part      =   dot-atom / quoted-string / obs-local-part

The dot-atom is a highly restricted set of characters defined in the spec. However, the quoted-string is where you can run into trouble. It's not often used, but in terms of the possibility that you'll run into it, you could well get something in quotation marks that could itself contain an @ character.

However, if you split the string from the last @, you should safely have located the local-part and the domain, which is well defined in the specification in terms of how you can verify it.

The problem comes with punycode, whereby almost any Unicode character can be mapped into a valid DNS name. If the system you are front-ending can understand and interpret punycode, then you have to handle almost anything that has valid unicode characters in it. If you know you're not going to work with punycode, then you can use a more restricted set, generally letters, digits, and the hyphen character.

To quote the late, great Jon Postel: TCP implementations should follow a general principle of robustness: be conservative in what you do, be liberal in what you accept from others.

Side note on the local part: Keeping in mind, of course, that there are probably lots of systems on the internet that don't require strict adherence to the specs and therefore might allow things outside of the spec to work due to the long standing liberal-acceptance/conservative-transmission philosophy.

Licensed under: CC-BY-SA with attribution
Not affiliated with StackOverflow
scroll top