Email validation - Part 1 : DEFINING A RECIPIENT
After stumbling across the RFC 2822 e-mail spec I thought that I'd take a shot at defining an e-mail address and along the way attempt to create a regex to validate it. I'll attempt it over the next week or so, so, bear with me:
*******************************************************
NOTE:
To understand - technically - how difficult it can be to write an all encompassing Email validation script: - refer the actual RFC 2822 specification ( a link to an explanation of the RFC 822 spec can be found here:
http://www.cs.tut.fi/~jkorpela/rfc/822addr.html). That article also contains a simple e-mail validation RegEx.On the high end, Jeffrey Friedl, in his book Mastering Regular Expressions (1997, O'Reilly & Associates), presents an 11-page explanation of his 4,724-byte email-validation script.
You can also find other regex variations at the regex library:
http://regexlib.com/
*******************************************************
I think that at the simplest level you can safely make the following assumptions:
These characters are not allowed anywhere within the actual address members of an e-mail address:
( ) < > @ , ; : \ " . [ ]
Although obviously they can be used in other roles that make up the full e-mail address - such as the @ character and the . character.
Therefore, in a simplified manner we can say that an address is made up of (primarily) 2 parts: the part before the @ and the part after it.
At a high level, the part before the @ character, defines the recipient and the part after the @ character defines the Domain.
Therefore, we have:
Recipient@Domain
DEFINING A RECIPIENT
A recipient can contain 1 or more "phrases". Multiple "phrases" are separated by the "." character. Such as:
darren
darren.neimke
darren.neimke.is
darren.neimke.is.really
darren.neimke.is.really.cool
Before we start delving too deeply into how we'd construct a RegEx to match these patterns, let's create a negative character set to define invalid characters:
var RE_VALID_PHRASE = "[^\(\)<>\@,;:\\"\.\[\]]+" // keep matching until we hit one of these.
var RE_ADDITIONAL_PHRASE = "(\." + RE_VALID_PHRASE + "+)*" // each additional phrase must start with a "." and there can be zero or more
Then we hit the "@" character.
var RE_RECIPIENT = "^" + RE_VALID_PHRASE + RE_ADDITIONAL_PHRASE + "\@"
Or, in hard to read longhand:
var RE_RECIPIENT = "^[^\(\)<>\@,;:\\"\.\[\]]+(\.[^\(\)<>\@,;:\\"\.\[\]]+)*?" + "\@"
DEFINING A DOMAIN
The hard bit :-)... Coming soon.... :-)
MY HUMBLE OPINION
After reading the specification document, I think that you are probably better off - unless you have extremely well defined requirements - to use a simpler, rather than complicated validation strategy; for fear of leaving someone out!