Question

I'm writing a simple, small app that allows me to share information. I have a question on using regx to validate email address. I'm kind learning on my own. But when it comes to real-world examples, such that strings that can be validated with regular expressions, I'm kind stuck.

Exercise: Untangle the following regular expression that validates an email address:

  [a-z0-9!#$%&'*+/=?^_`{|}~-]+(?:\.[a-z0-9!#$%&'*+/=?^_`{|}~-]+)*@(?:[a-z0-9](?:[a-z0-9-]*[a-z0-9])?\.)+[a-z0-9](?:[a-z0-9-]*[a-z0-9])?

It looks like a jumble of characters.

Can someone please explain to me how does this work?

I try to use this online resources by by Jan Goyvaerts. Any help I will appreciate it.

Was it helpful?

Solution 5

Two Suggestions I have for you.

  1. Escaping special characters is messy. 2. Email addresses are complicated. I probably recommend you to study this post if you are really interested. Please check out this other posts: Validation in Regex and Regex Help.

OTHER TIPS

First of all, there is a good thread about totally the same thing: Using a regular expression to validate an email address

Then, below there is the explanation of your regular expression:

[a-z0-9!#$%&'*+/=?^_`{|}~-]+

- The square brackets represent the symbol class, containing all the symbols which are in the square brackets. The plus sign ('+') is a quantifier, which means that the sequence of symbols, represented by this symbol class must be at least one character long.

Also, the '+' is greedy, and, therefore, this part of the pattern will match the symbol sequence of the maximal possible length.

Talking about the square brackets contents, 'a-z' means any symbol in a range, which could be described mathematically as [a, z], and '0-9' is similar. All the other symbols are just symbols in this case.

(?:\.[a-z0-9!#$%&'*+/=?^_`{|}~-]+)*

- In Regular Expressions, the brackets represent grouping, and the asterisk ('*') is a greedy quantifier, which means "occurs zero or more times". So here we are not sure if we are going to find the brackets content, but we do not rule out the possibility.

Then, inside the brackets, we see the ?: character combination, which, being put inside brackets tells us that the symbol group inside should not be captured as a sub-string for the further reference.

Going further, \. means just a usual dot (see Escape sequence), since a dot symbol is a meta-symbol in Regex.

After the dot we see again the character of symbols, explained above.

@(?:[a-z0-9](?:[a-z0-9-]*[a-z0-9])?\.)+

- Here we see the at symbol ('@'), which is just a symbol here, then there is a non-capturing symbol group, which will occur one or more times (because of + after it), and which includes a single symbol of [a-z0-9] class and another non-capturing group of symbols, which contents you can totally describe using my explanations above except for a question mark sign ('?'), which means "either once or not at all" in this context (i.e. if it is used as a quantifier).

[a-z0-9](?:[a-z0-9-]*[a-z0-9])?

- This last part is similar to what is found in a symbol group, explained above, so I believe you have now enough information to understand it.

More on quantifier types here: Greedy vs. Reluctant vs. Possessive Quantifiers.

A good Regular Expressions reference: Regular Expression Language - Quick Reference

Some information on capturing in Regular Expressions: Regex Tutorial - Parentheses for Grouping and Capturing

About special characters: Regex Tutorial - Literal Characters and Special Characters

Regex statements can be a fun yet tricky to follow. There are 5 parts to this statement.

One valid characters for a username

[a-z0-9!#$%&'*+/=?^_`{|}~-]+

check for a single '.' and any additional amount of characters

(?:\.[a-z0-9!#$%&'*+/=?^_`{|}~-]+)*

The '@' symbol

Valid second / lower level domain

(?:[a-z0-9](?:[a-z0-9-]*[a-z0-9])?\.)+

A valid top level domain

[a-z0-9](?:[a-z0-9-]*[a-z0-9])?

I recommend http://www.ultrapico.com/expresso.htm. It will break the statement down for you.

I've found a remarkable tool for visualizing regular expressions here: http://regexper.com

It shows me that your regular expression breaks down like this. Hopefully this helps explain it.

enter image description here

  1. [a-z0-9!#$%&'*+/=?^_`{|}~-]+
    This looks for at least one of of the characters given here (a-z, 0-9, and those special characters).
  2. (?:\.[a-z0-9!#$%&'*+/=?^_`{|}~-]+)
    This looks for the same as above, but only when it stands after a dot. This part is optional and can be repeated indefinitely. It prevents dots at the end of the name.
  3. @
    Matches the @ symbol
  4. (?:[a-z0-9](?:[a-z0-9-]*[a-z0-9])?\.)+
    This matches a-z, 0-9 ending with a dot and optional - in the middle ending with a dot. This has to be matched at least once.
  5. [a-z0-9](?:[a-z0-9-]*[a-z0-9])?
    This looks for a-z or 0-9, optionally followed by a-z, 0-9, -, but it cant end with a - again.

See this answer. The problem is probably too difficult to solve. Two problems you have here. 1. RegEx are not easy. 2. Escaping special characters is messy. Finally, Email addresses are complicated. I probably recommend you to study this post if you are really interested.

Licensed under: CC-BY-SA with attribution
Not affiliated with StackOverflow
scroll top