Email validation - Part 1 : DEFINING A RECIPIENT

Tuesday, March 4, 2003

Regex

After stumbling across the RFC 2822 e-mail spec I thought that I'd take a shot at defining an e-mail address and along the way attempt to create a regex to validate it. I'll attempt it over the next week or so, so, bear with me:

*******************************************************

NOTE:

To understand - technically - how difficult it can be to write an all encompassing Email validation script: - refer the actual RFC 2822 specification ( a link to an explanation of the RFC 822 spec can be found here: http://www.cs.tut.fi/~jkorpela/rfc/822addr.html). That article also contains a simple e-mail validation RegEx.

On the high end, Jeffrey Friedl, in his book Mastering Regular Expressions (1997, O'Reilly & Associates), presents an 11-page explanation of his 4,724-byte email-validation script.

You can also find other regex variations at the regex library:

http://regexlib.com/

*******************************************************

I think that at the simplest level you can safely make the following assumptions:

These characters are not allowed anywhere within the actual address members of an e-mail address:

( ) < > @ , ; : \ " . [ ]

Although obviously they can be used in other roles that make up the full e-mail address - such as the @ character and the . character.

Therefore, in a simplified manner we can say that an address is made up of (primarily) 2 parts: the part before the @ and the part after it.

At a high level, the part before the @ character, defines the recipient and the part after the @ character defines the Domain.

Therefore, we have:

Recipient@Domain

DEFINING A RECIPIENT

A recipient can contain 1 or more "phrases". Multiple "phrases" are separated by the "." character. Such as:

darren
darren.neimke
darren.neimke.is
darren.neimke.is.really
darren.neimke.is.really.cool

Before we start delving too deeply into how we'd construct a RegEx to match these patterns, let's create a negative character set to define invalid characters:

var RE_VALID_PHRASE = "[^\(\)<>\@,;:\\"\.\[\]]+" // keep matching until we hit one of these.
var RE_ADDITIONAL_PHRASE = "(\." + RE_VALID_PHRASE + "+)*" // each additional phrase must start with a "." and there can be zero or more

Then we hit the "@" character.

var RE_RECIPIENT = "^" + RE_VALID_PHRASE + RE_ADDITIONAL_PHRASE + "\@"

Or, in hard to read longhand:

var RE_RECIPIENT = "^[^\(\)<>\@,;:\\"\.\[\]]+(\.[^\(\)<>\@,;:\\"\.\[\]]+)*?" + "\@"

DEFINING A DOMAIN

The hard bit :-)... Coming soon.... :-)

MY HUMBLE OPINION

After reading the specification document, I think that you are probably better off - unless you have extremely well defined requirements - to use a simpler, rather than complicated validation strategy; for fear of leaving someone out!

Was there ever a part two?

Michael Ash - Thursday, July 15, 2004 3:48:00 PM

Ummm.... no, I think that I decided that it probably wasn't worth it. I'm pretty sure that I started looking at it but decided the same as you... that this is clearly a case for the 80/20 rule :-)

Darren Neimke - Friday, July 16, 2004 10:46:00 AM

2 Comments