Verifying a Valid Username
Verifying Valid Dates
Validating IP Addresses
Matching Phone Numbers, Postcodes, Social Security numbers etc
Validating Email Addresses
Verifying Credit Card Numbers
Replacing Static Links with Hyperlinks in HTML
Removing Swearing and Profanity from blogs and forums
Stripping out Bad Attributes and Javascript from HTML code
BBCode - Custom Formatting Definitions
Conclusion
Regular expression are one of the most powerful features of PHP. They’re also one of the most confusing and bewlidering if you’re not familiar with their syntax. They can be used for a wide range of purposes, from validating user names to grabbing data from a website. In this article, I’ll take you through 10 regular expression functions which I have found extremely useful in the past and which can play an important role in any modern website or application.
Sometimes we need to impose rules on a website about what sort of names a user can choose. For our own needs, and for readability, it’s often useful to state rules which force users to choose names containing only letters, numbers or underscores.
The following regular expression function checks a given username string based on the following rules
- A username must start with a lower-case or upper-case letter
- A username can contain only letters, numbers or underscores
- A username must be between 8 and 24 characters
- A username cannot end in an underscore
function valid_name($username){ return preg_match("#^[a-z][\da-z_]{6,22}[a-z\d]\$#i", $username); }
$usernames = array( "RoughGuide98", "_invalidUsername", "%423f@''#", "25UserName", "I_am_valid_user", "I am not a valid user", "tooshort", "ThisUserNameIsTooLongAndWontBeAccepted", "ThisIsTheRightSize" ); foreach ($usernames as $name){ if(valid_name($name)) echo "$name - Ok<br />"; else echo "$name - Unsuitable!<br />"; } /* This produces... RoughGuide98 - Ok _invalidUsername - Unsuitable! %423f@''# - Unsuitable! 25UserName - Unsuitable! I_am_valid_user - Ok I am not a valid user - Unsuitable! short - Unsuitable! ThisUserNameIsTooLongAndWontBeAccepted - Unsuitable! ThisIsTheRightSize - Ok */
This regular expression function will check if a date provided is valid in the DD-MM-YYYY format and can successfully check for an incorrect date value for specific months. It will also deal successfully with February 29th on leap years. It can accept the characters “/”, “-” and “.” as date seperators.
function check_date($date){ $seperator = "[\/\-\.]"; return preg_match("#^(((0?[1-9]|1\d|2[0-8]){$seperator}(0?[1-9]|1[012])|(29|30){$seperator}(0?[13456789]|1[012])|31{$seperator}(0?[13578]|1[02])){$seperator}(19|[2-9]\d)\d{2}|29{$seperator}0?2{$seperator}((19|[2-9]\d)(0[48]|[2468][048]|[13579][26])|(([2468][048]|[3579][26])00)))$#", $date)==1?true:false; }
Here’s the function in action
check_date("30.9.2005"); //valid check_date("32.9.2005"); //invalid check_date("29.1.2005"); //valid check_date("29.2.2005"); //invalid
This regular expression function will check if a given IP address is within the valid range, specifically between 0.0.0.0 and 255.255.255.255
function valid_ip($ip_address){ $val_0_to_255 = "(25[012345]|2[01234]\d|[01]?\d\d?)"; $pattern = "#^($val_0_to_255\.$val_0_to_255\.$val_0_to_255\.$val_0_to_255)$#"; return preg_match($pattern, $ip_address, $matches); }
This regular expression is relatively simple. It uses the regular expression to find a number between 0 and 255 and repeats it 4 times, seperated by dots. Here’s the function in use
valid_ip("0.0.0.255"); //valid valid_ip("0.0.0.256"); //invalid
All of these are fairly similar, so are lumped into the same category here. I’ve left out “Credit Card Numbers” since they can require a little extra consideration, and I’ve given them their own section later on. But for now, let’s look at how we can use pattern matching to check some of the most common number sequences we use in our daily lives.
Let’s look first at matching a valid US telephone number
US phone numbers are of the format 000 000 0000, a group of three numbers, followed by another three then a final four. We can write this in regular expression format as follows
$reg = "#^\d{3}\s\d{3}\s\d{4}$#";
This will match any phone number in the above format, like 351 234 5555. But we run into problems if we’re using a different notation, like (123)-555-5555. No worries though, the following alterations will allow us to use all sorts of variations like this
$reg = "#^\(?\d{3}\)?[\s\.-]?\d{3}[\s\.-]?\d{4}$#";
Let’s try this now
function matchUSPhone($number) { $reg = "#^\(?\d{3}\)?[\s\.-]?\d{3}[\s\.-]?\d{4}$#"; return preg_match($reg, $number); } if(matchUSPhone("(123)-555-5555")) echo "This matches"; //indeed, this does match
Here’s some other useful functions you can use
//Match a UK format phone number function matchUKPhoneNumber($number){ //UK Phone numbers are 10 or 11 digits long, and have a 3, 4, 5 or 6 digit area code $s = "(([ \-\.])*)?"; $result = preg_match("#^((\(?\d{3}\)?$s\d{7,8})|(\(?\d{4}\)?$s\d{6,7})|(\(?\d{5}\)?$s\d{5,6})|(\(?\d{6}\)?$s\d{4,5}))$#", $number); return $result; } //Match a US social security number function matchUSSocialSecurity($ss_num){ return preg_match("#\d{3}-\d{2}-\d{4}#", $ss_num); } //This regular expression matches a UK National Insurance number function matchUKNationalInsurance($ni){ $digits_1_and_2 = "[ABCEGHJKLMNOPRSTWXYZ][ABCEGHJKLMNPRSTWXYZ]"; $the_rest = "\d{6}[A-D]?"; $reg = "#^".$digits_1_and_2.$the_rest."$#"; return preg_match($reg, $ni); }
Checking that an email address is valid is a fairly tricky thing. You have to make sure the domain name is correct, and that the first part of the email contains valid characters. Using regular expressions though, something like this becomes fairly trivial. Here’s a simple example to match a very basic email address
$reg = "#(.*)@(.*)\.com#";
This is pretty useless though, since it will only match email addresses ending in “.com”. A more concrete example is as follows
function checkEmail($email){ $reg = "#^(((([a-z\d][\.\-\+_]?)*)[a-z0-9])+)\@(((([a-z\d][\.\-_]?){0,62})[a-z\d])+)\.([a-z\d]{2,6})$#i"; return preg_match($reg, $email); }
And a couple of examples
checkEmail("robin@roughguidetophp.com"); //valid checkEmail("not-an-email@address"); //invalid
Note that this function won’t determine whether a given email address actually exists and is valid. To achieve that, you’d need to implement more checks. One possibility is to use the PHP function checkdnsrr() to check that the domain name is valid, although that only exists on Unix installations of PHP. On Windows you’d need to write your own custom function.
You would then still need to check that the mailbox itself exists, which could be achieved by sending out a confirmation email to the user, and if they receive it, and click on the link contained within, then you’ll know it’s a valid address.
When developing e-commerce applications, it becomes necessary to process credit card details, and make sure they’re accurate. A simple check you can perform is to see if the credit card number a customer provides fits the card number definitions laid out by the card issuers. For example, a Visa card number starts with a 4, and can contain 16 or 13 digits.
The following regular expression function will check to see if a card number conforms to the appropriate rules. This function checks for some of the most common card types. An additional second parameter can be entered to use a Luhn algorithm check1 which is a method used by card issuers to generate valid card numbers, as opposed to random strings of digits. Read more about the Luhn algorithm at Wikipedia, if you’re interested
check_cc() will return false if the card number isn’t valid and if it is valid, will return a string containing the type of card matched.
function validatecard($cardnumber) { $cardnumber=preg_replace("/\D|\s/", "", $cardnumber); # strip any non-digits $cardlength=strlen($cardnumber); $parity=$cardlength % 2; $sum=0; for ($i=0; $i<$cardlength; $i++) { $digit=$cardnumber[$i]; if ($i%2==$parity) $digit=$digit*2; if ($digit>9) $digit=$digit-9; $sum=$sum+$digit; } $valid=($sum%10==0); return $valid; } function check_cc($cc, $extra_check = false){ $cards = array( "visa" => "(4\d{12}(?:\d{3})?)", "amex" => "(3[47]\d{13})", "jcb" => "(35[2-8][89]\d\d\d{10})", "maestro" => "((?:5020|5038|6304|6579|6761)\d{12}(?:\d\d)?)", "solo" => "((?:6334|6767)\d{12}(?:\d\d)?\d?)", "mastercard" => "(5[1-5]\d{14})", "switch" => "(?:(?:(?:4903|4905|4911|4936|6333|6759)\d{12})|(?:(?:564182|633110)\d{10})(\d\d)?\d?)", ); $names = array("Visa", "American Express", "JCB", "Maestro", "Solo", "Mastercard", "Switch"); $matches = array(); $pattern = "#^(?:".implode("|", $cards).")$#"; $result = preg_match($pattern, str_replace(" ", "", $cc), $matches); if($extra_check && $result > 0){ $result = (validatecard($cc))?1:0; } return ($result>0)?$names[sizeof($matches)-2]:false; }
And here’s a sample of the function in action
$cards = array( "4111 1111 1111 1111", "4111 1111 1111 1", "4111 1111 1111 111", "3400 0000 0000 009", "3500 0000 0000 009", "5500 1545 0000 0004", "5940 0000 0000 0004" ); foreach($cards as $c){ $check = check_cc($c, true); if($check!==false) echo $c." - ".$check; else echo "$c - Not a match"; echo "<br/>"; } /* This gives us 4111 1111 1111 1111 - Visa 4111 1111 1111 1 - Visa 4111 1111 1111 111 - Not a match 3400 0000 0000 009 - American Express 3500 0000 0000 009 - Not a match 5500 1545 0000 0004 - Mastercard 5940 0000 0000 0004 - Not a match Using the Luhn algorithm, we can also check further to see if any of these numbers actually represents a valid card check_cc("5500 1545 0000 0004", true); //matches check_cc("5500 1585 0000 0004", true); //doesn't match */
A useful feature on most forums is the ability to type in a website address, like http://www.google.com, for example, and have that address automatically linked, via a HTML hyperlink, like so - http://www.google.com. The following regular expression function will automatically replace any website addresses with the appropriate HTML code to produce an active link
function replace_urls($string){ $host = "([a-z\d][-a-z\d]*[a-z\d]\.)+[a-z][-a-z\d]*[a-z]"; $port = "(:\d{1,})?"; $path = "(\/[^?<>\#\"\s]+)?"; $query = "(\?[^<>\#\"\s]+)?"; return preg_replace("#((ht|f)tps?:\/\/{$host}{$port}{$path}{$query})#i", "<a href='$1'>$1</a>", $string); }
Here, the four different components of the average URL - host (e.g. www.google.com), port (e.g. www.website.com:8080), path (e.g. www.website.com/mysite/file.php) and the query string (e.g. www.website.com/index.php?var=1&var2=100) - are combined to generate the final regular expression pattern.
An example of this in action
$string = "Visit http://www.google.com for all your searching requirements! And here are some pictures of cheese - http://images.google.co.uk/images?q=cheese"; echo replace_urls($string); /* This produces... Visit <a href="http://www.google.com">http://www.google.com</a> for all your searching requirements! And here are some pictures of cheese - <a href="http://images.google.co.uk/images?q=cheese">http://images.google.co.uk/images?q=cheese</a> */
Sometimes it’s useful to be able to protect the youngsters and more fragile minded members of our communities from having to see rude words on their computer screens by filtering out any words deemed to be unsuitable from any comments posted by site members.
The following regular expression function makes use of the preg_replace_callback() function in PHP to perform a function call “stars” on every matched instance for the words in the $swears array, replacing each letter except the first with star symbols.
This function can be used after retrieving data from a database, to filter out any unsuitable content. You can also use this function to filter out any other keywords, perhaps to prevent users from advertising your competitor’s products on your forum, for example.
function deswear($string){ function prep_regexp_array(&$item){ $item = "#$item#i"; } function stars($matches){ return substr($matches[0], 0, 1).str_repeat("*", strlen($matches[0])-1); } $swears = array("darn", "heck", "blast", "shoot"); array_walk($swears, "prep_regexp_array"); return preg_replace_callback($swears, "stars", $string); }
And an example of this in action
$string = "Aw, darn, HECK, Blast and ShOoT!"; echo deswear($string); //Aw, d***, H***, B**** and S****!
When working with a forum or blog where user comments are welcomed, it’s often necessary to maintain strict control over exactly what a user can post on the site. The simplest method is just to strip out all unwanted content using the PHP function strip_tags(). This function also has an optional argument to let you leave certain tags in place though, like <b> or <u> if you wish to allow users the ability to specify custom formatting
A problem with this, though, is that strip_tags() will not remove whatever is inside the <b> or <u> tags if you leave them in place. For example, the following code will not get stripped out using strip_tags()
This is <b style='font-size:100px' onMouseOver='alert("Hello")'>malicious text!</b>
This will trigger a javascript alert window whenever you move your mouse over the bold text and also - because of the “style” attribute - make the text 100 pixels tall. Not what you want to be seeing on your message board!
The following function will not only strip out any unwanted tags from your HTML code, but will also clean up whatever attributes are inside those tags, preventing people from messing up your page
function cleanTags($source, $tags = null) { function clean($matched) { $attribs = "javascript:|onclick|ondblclick|onmousedown|onmouseup|onmouseover|". "onmousemove|onmouseout|onkeypress|onkeydown|onkeyup|". "onload|class|id|src|style"; $quot = "\"|\'|\`"; $stripAttrib = "' ($attribs)\s*=\s*($quot)(.*?)(\\2)'i"; $clean = stripslashes($matched[0]); $clean = preg_replace($stripAttrib, '', $clean); return $clean; } $allowedTags='<a><br><b><i><br><li><ol><p><strong><u><ul>'; $clean = strip_tags($source, $allowedTags); $clean = preg_replace_callback('#<(.*?)>#', "clean", $source); return $source; }
Here’s the function in action
$string = "This is <b style='font-size:100px;' onMouseOver='alert(\"Hello\")'>malicious text!</b>"; echo $string; echo cleanTags($string); //gives us... // This is <b>malicious text!</b>
A more comprehensive approach to the previous problem of disallowing certain tags in HTML user input is to completely strip out all tags, except for your own custom-defined ones.
Instead of declaring your tags using pointed brackets (< and >) you can create custom tags, most commonly with square brackets ([ and ]). This is known as BBCode (Bulletin Board Code). It is a simple way of defining custom formatting rules for forums and gives more control to the board owner over what users can and cannot do. Any range of tags can be defined, the most common of which are [b] for bold, [for underline and [to place a link.
Here is a simple implementation of some basic BBCode rules
function BBcode($string) { //get rid of all HTML tags $string = strip_tags($string); $patterns = array( "bold" => "#\[b\](.*?)\[/b\]#is", "italics" => "#\[i\](.*?)\[/i\]#is", "underline" => "#\[u\](.*?)\[/u\]#is", "link_title" => "#\[url=(.*?)](.*?)\[/url\]#i", "link_basic" => "#\[url](.*?)\[/url\]#i", "color" => "#\[color=(red|green|blue|yellow)\](.*?)\[/color\]#is" ); $replacements = array( "bold" => "<b>$1</b>", "italics" => "<i>$1</i>", "underline" => "<u>$1</u>", "link_title" => "<a href=\"$1\">$2</a>", "link_basic" => "<a href=\"$1\">$1</a>", "color" => "<span style='color:$1;'>$2</span>" ); return preg_replace($patterns, $replacements, $string); }
This regular expression function searches through a given string for any instances of matching text within square brackets, followed by a matching closing pair of brackets with the same text inside. So, for example, the text “[b]this is bold[/b]” would be matched by the first expression in the array. Note how we’re specifying that the contents of each pair of brackets musn’t be a greedy search - denoted by the ? after the * symbol. Otherwise, a string like “[b]this is bold[/b] and [b]this is also bold[/b]” would return a match for everything within the first [b] and the last [/b], not what we require.
Here it is in action
echo BBcode("[b]this is[/b] me being [b]bold[/b], and [url=http://www.google.com]here's a link[/url] and another - [url]http://www.google.com[/url]"); //<b>this is</b> me being <b>bold</b>, and <a href="http://www.google.com">here's a link</a> and another - <a href="http://www.google.com">http://www.google.com</a> echo BBcode("Here's some [color=red]Red Text![/color]"); //Here's some <span style="color: red;">Red Text!</span>
So, there you go. Hopefully you’ve gained some useful knowledge from this article. At the very least, you should now have an understanding of the power of regular expressions and be able to see how they can fit in with your website applications to provide extra functionality.
These expressions and functions can also be translated into other programming languages fairly easily, like Javascript or Perl (Perl’s regular expression handling is actually far superior to PHP’s, and I’d recommend checking it out just to get a grasp of regular expression handling in its purest form). The expressions themselves will need very little altered when being used elsewhere, although the PHP functions used will most likely have different counterparts.
Try messing around with the expressions to obtain different results, and feel free to use any code in this article in your own programs.
- Thanks to www.planzero.org ↩




September 26th, 2008 at 7:39 am
Te felicito y agradesco que compartas tus valiosos conocimientos. Espero me permitas traducirlo al espaƱol.
September 30th, 2008 at 4:08 pm
Great tutorial. I will use these
October 27th, 2008 at 11:14 am
Please adjust your username validator to allow the . character.
November 18th, 2008 at 6:53 pm
This has been really useful; one minor improvement is to edit the html detector so that the path and query parts don’t match punctuation marks at the end of a url. For example, if someone had written “Check out http://www.google.com!” it would match the explanation mark (which you don’t want), but changing the path match to “(\/[^?\#\"\s]+[^?\#\"\s\.!])?” will resolve this. You can change the query path similarly, though I’m less sure about escape characters here.
Cheers