The Sexy Filter Extension and Fear of Regex

Regular Expressions

Regular Expressions are a powerful way to parse and evaluate strings. They can be fun sometimes, but there is rarely an indication of the reason why they fail. Now, there are editors that you can try to match a regex against a data set. Using Notepad isn’t going to tell you anything about the problem. Using the language won’t give you anything either.

The best way without using any external tools is to keep testing, but it gets tedious. I remember spending and writing a regex for a log file and spending the extra testing to get the few test cases that wouldn’t match against the normal set of rules. Most probably would have been happy with multiple ‘(.*)’, but it came out great. One week on a school project, but to be honest, I did start on it early because I was fascinated with the whole Perl and regex concept.

It isn’t much as fear as it is remembering how much time it took to get the regular expression to work for that project and not wanting to spend that much time again.

To tell someone that in order to test for a correct email address, that they have to write a regular expression is to write it for them in another hour or two.

Examples of Email Regex

My Email Regex

preg_match("/^[_a-zA-Z-]+[\._a-zA-Z0-9-]+@[a-zA-Z0-9-]+[\.a-zA-Z0-9]+$/", $email);

It isn’t flawless. If a person uses a period at the beginning of their email address, this test will fail. It does match better for domains, such as ‘co.uk’ and even subdomains. It also doesn’t fully conform to the RFC below, meaning it doesn’t allow ‘+’ or any extra special characters.

Another Email Regex

// Thanks to "mail(at)philipp-louis.de" from php.net!
preg_match("/^[-_.[:alnum:]]+@((([[:alnum:]]|[[:alnum:]][[:alnum:]-]*[[:alnum:]])\.)+
(ad|ae|". (Truncated by Author) ."ws|ye|yt|yu|za|zm|zw)$|
(([0-9][0-9]?|[0-1][0-9][0-9]|[2][0-4][0-9]|[2][5][0-5])\.){3}([0-9][0-9]?|
[0-1][0-9][0-9]|[2][0-4][0-9]|[2][5][0-5]))$/i",$email);

Used in the Dragon Knight code. This is also not perfect. It used of multiple domains test cases slows down the regular expression and would have to keep being updated whenever new domains are added. It isn’t often that new domains are added, even so it is pointless.

‘Best’ Email Validation

There is no way to tell if the user is full of crap until you get a bounced email back or can some way communicate with the mail server and ask it whether the email address exists. However, most SMTP servers won’t do that because of spammers.

preg_match("/^(.+)@(.+)\.(.+)$/iU", $email);

This will match everything that has one or more characters at the start, a ‘@’, one or more character after the at sign, a period, and one or more character after the period. The most basic and quick way to test for an email address. Whether it is the best way? Nope, there is RFC 822 and RFC 2821 and 2822 that details the best way.

A better way is to build a system to handle the bounced back message and discard the email address along with the account. You could keep the account and give the user another chance to submit their email address.

The Filter Extension

I knew it would be satisfying, once I seen this in passing on the php.net site. It uses the procedural method, which I think fits perfectly. If you try to throw objects at a novice than most aren’t likely to use what you give. It is really simple to use, so there is no longer any excuse for an developer to not be using it, if using PHP 5.2+.

The Zend Framework also offers Filtering and Input Filtering, but as of 0.1.3, the email validation was incomplete. The Filter Extension won’t replace the Zend Framework Filtering and Input Filtering, but they should complement each other nicely.

Email Validation Revisited

filter_data($email, FILTER_VALIDATE_EMAIL);

This compared to the preg_match examples and which one would you choose? I haven’t yet done any testing on if FILTER_VALIDATE_EMAIL matches against all available domains and RFCs, so I’m placing a lot of trust on it. It will be experimental, when it is bundled with PHP 5.2 from what I’ve read. Fixes should come fairly quickly once it hits the mainstream.

That isn’t the only thing it can do and gaining a grasp of it other features won’t be all that difficult even without a proper tutorial.

Possibly Related Posts:


4 Comments.

  1. Hello,

    JFYI, Vim has partial support for PCRE highlighting. Although it isn’t finished yet and has a few bugs, the colors do work reliably for single-quoted patterns on a single line.

    If you are interested, You can find Vim at http://www.vim.org and the in-development PHP syntax is available here.

    - Peter

  2. You know that an @ might appear before the actual @ that is separating the user and host part of the adress? If I get the RFCs correctly even an E-Mail like “foo\@bar foo”@host.com should be possible. Yet I doubt that this is supported by any mail server at all.

  3. Yeah, I agree. Following the RFC is really good practice if you are building an application for other people but full compliance isn’t always feasible.

    In the Zend Framework, they have the email portion incomplete with a comment to use the RFC. I said to myself, “Good luck with that!” The Zend Framework has to conform to the RFC so that they match every possible valid test case, unlike mine and the other, which do not conform to the RFC.

  4. Kelvin Grove Says:

    This is really cool, I am a regex flunky. I read more on this extension in a tutorial at http://phpro.org/tutorials/Filtering-Data-with-PHP.html