Sponsored Links :
  • Verifying a Valid Username
    Verifying Valid Dates
    Validating IP Addresses
    Matching Phone Numbers, Postcodes, Social Security numbers etc
    Validating Email Addresses
    Verifying Credit Card Numbers
    Replacing Static Links with Hyperlinks in HTML
    Removing Swearing and Profanity from blogs and forums
    Stripping out Bad Attributes and Javascript from HTML code
    BBCode - Custom Formatting Definitions
    Conclusion

    Regular expression are one of the most powerful features of PHP. They’re also one of the most confusing and bewlidering if you’re not familiar with their syntax. They can be used for a wide range of purposes, from validating user names to grabbing data from a website. In this article, I’ll take you through 10 regular expression functions which I have found extremely useful in the past and which can play an important role in any modern website or application.

    Verifying a Valid Username

    ^Back to top

    Sometimes we need to impose rules on a website about what sort of names a user can choose. For our own needs, and for readability, it’s often useful to state rules which force users to choose names containing only letters, numbers or underscores.

    The following regular expression function checks a given username string based on the following rules

    • A username must start with a lower-case or upper-case letter
    • A username can contain only letters, numbers or underscores
    • A username must be between 8 and 24 characters
    • A username cannot end in an underscore
    function valid_name($username){
        return preg_match("#^[a-z][\da-z_]{6,22}[a-z\d]\$#i", $username);
    }
    $usernames = array(
      "RoughGuide98", "_invalidUsername", "%423f@''#",  "25UserName",
      "I_am_valid_user",  "I am not a valid user", "tooshort",
      "ThisUserNameIsTooLongAndWontBeAccepted", "ThisIsTheRightSize"
      );
     
    foreach ($usernames as $name){
        if(valid_name($name))
            echo "$name - Ok<br />";
        else
            echo "$name - Unsuitable!<br />";  
    }
     
    /* This produces...
    RoughGuide98 - Ok
    _invalidUsername - Unsuitable!
    %423f@''# - Unsuitable!
    25UserName - Unsuitable!
    I_am_valid_user - Ok
    I am not a valid user - Unsuitable!
    short - Unsuitable!
    ThisUserNameIsTooLongAndWontBeAccepted - Unsuitable!
    ThisIsTheRightSize - Ok */

    Verifying Valid Dates

    ^Back to top

    This regular expression function will check if a date provided is valid in the DD-MM-YYYY format and can successfully check for an incorrect date value for specific months. It will also deal successfully with February 29th on leap years. It can accept the characters “/”, “-” and “.” as date seperators.

    function check_date($date){
        $seperator = "[\/\-\.]";
        return preg_match("#^(((0?[1-9]|1\d|2[0-8]){$seperator}(0?[1-9]|1[012])|(29|30){$seperator}(0?[13456789]|1[012])|31{$seperator}(0?[13578]|1[02])){$seperator}(19|[2-9]\d)\d{2}|29{$seperator}0?2{$seperator}((19|[2-9]\d)(0[48]|[2468][048]|[13579][26])|(([2468][048]|[3579][26])00)))$#", $date)==1?true:false;
    }

    Here’s the function in action

    check_date("30.9.2005"); //valid
    check_date("32.9.2005"); //invalid
    check_date("29.1.2005"); //valid
    check_date("29.2.2005"); //invalid

    Validating IP Addresses

    ^Back to top

    This regular expression function will check if a given IP address is within the valid range, specifically between 0.0.0.0 and 255.255.255.255

    function valid_ip($ip_address){
        $val_0_to_255 = "(25[012345]|2[01234]\d|[01]?\d\d?)";
        $pattern = "#^($val_0_to_255\.$val_0_to_255\.$val_0_to_255\.$val_0_to_255)$#";
        return preg_match($pattern, $ip_address, $matches);
    }

    This regular expression is relatively simple. It uses the regular expression to find a number between 0 and 255 and repeats it 4 times, seperated by dots. Here’s the function in use

    valid_ip("0.0.0.255"); //valid
    valid_ip("0.0.0.256"); //invalid

    Matching Phone Numbers, Postcodes, Social Security numbers etc

    ^Back to top

    All of these are fairly similar, so are lumped into the same category here. I’ve left out “Credit Card Numbers” since they can require a little extra consideration, and I’ve given them their own section later on. But for now, let’s look at how we can use pattern matching to check some of the most common number sequences we use in our daily lives.

    Let’s look first at matching a valid US telephone number

    US phone numbers are of the format 000 000 0000, a group of three numbers, followed by another three then a final four. We can write this in regular expression format as follows

    $reg = "#^\d{3}\s\d{3}\s\d{4}$#";

    This will match any phone number in the above format, like 351 234 5555. But we run into problems if we’re using a different notation, like (123)-555-5555. No worries though, the following alterations will allow us to use all sorts of variations like this

    $reg = "#^\(?\d{3}\)?[\s\.-]?\d{3}[\s\.-]?\d{4}$#";

    Let’s try this now

    function matchUSPhone($number) {
      $reg = "#^\(?\d{3}\)?[\s\.-]?\d{3}[\s\.-]?\d{4}$#";
      return preg_match($reg, $number);
    }
     
    if(matchUSPhone("(123)-555-5555"))
      echo "This matches";
    //indeed, this does match

    Here’s some other useful functions you can use

    //Match a UK format phone number
    function matchUKPhoneNumber($number){
      //UK Phone numbers are 10 or 11 digits long, and have a 3, 4, 5 or 6 digit area code
      $s = "(([ \-\.])*)?";
      $result = preg_match("#^((\(?\d{3}\)?$s\d{7,8})|(\(?\d{4}\)?$s\d{6,7})|(\(?\d{5}\)?$s\d{5,6})|(\(?\d{6}\)?$s\d{4,5}))$#", $number);
      return $result;
    }
     
    //Match a US social security number
    function matchUSSocialSecurity($ss_num){
      return preg_match("#\d{3}-\d{2}-\d{4}#", $ss_num);
    }
     
    //This regular expression matches a UK National Insurance number  
    function matchUKNationalInsurance($ni){
      $digits_1_and_2 = "[ABCEGHJKLMNOPRSTWXYZ][ABCEGHJKLMNPRSTWXYZ]";  
      $the_rest = "\d{6}[A-D]?";
      $reg = "#^".$digits_1_and_2.$the_rest."$#";
      return preg_match($reg, $ni);
    }

    Validating Email Addresses

    ^Back to top

    Checking that an email address is valid is a fairly tricky thing. You have to make sure the domain name is correct, and that the first part of the email contains valid characters. Using regular expressions though, something like this becomes fairly trivial. Here’s a simple example to match a very basic email address

    $reg = "#(.*)@(.*)\.com#";

    This is pretty useless though, since it will only match email addresses ending in “.com”. A more concrete example is as follows

    function checkEmail($email){
        $reg = "#^(((([a-z\d][\.\-\+_]?)*)[a-z0-9])+)\@(((([a-z\d][\.\-_]?){0,62})[a-z\d])+)\.([a-z\d]{2,6})$#i";
        return preg_match($reg, $email);   
    }

    And a couple of examples

    checkEmail("robin@roughguidetophp.com"); //valid
    checkEmail("not-an-email@address"); //invalid

    Note that this function won’t determine whether a given email address actually exists and is valid. To achieve that, you’d need to implement more checks. One possibility is to use the PHP function checkdnsrr() to check that the domain name is valid, although that only exists on Unix installations of PHP. On Windows you’d need to write your own custom function.

    You would then still need to check that the mailbox itself exists, which could be achieved by sending out a confirmation email to the user, and if they receive it, and click on the link contained within, then you’ll know it’s a valid address.

    Verifying Credit Card Numbers

    ^Back to top

    When developing e-commerce applications, it becomes necessary to process credit card details, and make sure they’re accurate. A simple check you can perform is to see if the credit card number a customer provides fits the card number definitions laid out by the card issuers. For example, a Visa card number starts with a 4, and can contain 16 or 13 digits.

    The following regular expression function will check to see if a card number conforms to the appropriate rules. This function checks for some of the most common card types. An additional second parameter can be entered to use a Luhn algorithm check1 which is a method used by card issuers to generate valid card numbers, as opposed to random strings of digits. Read more about the Luhn algorithm at Wikipedia, if you’re interested

    check_cc() will return false if the card number isn’t valid and if it is valid, will return a string containing the type of card matched.

    function validatecard($cardnumber) {
        $cardnumber=preg_replace("/\D|\s/", "", $cardnumber);  # strip any non-digits
        $cardlength=strlen($cardnumber);
        $parity=$cardlength % 2;
        $sum=0;
        for ($i=0; $i<$cardlength; $i++) {
          $digit=$cardnumber[$i];
          if ($i%2==$parity) $digit=$digit*2;
          if ($digit>9) $digit=$digit-9;
          $sum=$sum+$digit;
        }
        $valid=($sum%10==0);
        return $valid;
    }
     
     
    function check_cc($cc, $extra_check = false){
        $cards = array(
            "visa" => "(4\d{12}(?:\d{3})?)",
            "amex" => "(3[47]\d{13})",
            "jcb" => "(35[2-8][89]\d\d\d{10})",
            "maestro" => "((?:5020|5038|6304|6579|6761)\d{12}(?:\d\d)?)",
            "solo" => "((?:6334|6767)\d{12}(?:\d\d)?\d?)",
            "mastercard" => "(5[1-5]\d{14})",
            "switch" => "(?:(?:(?:4903|4905|4911|4936|6333|6759)\d{12})|(?:(?:564182|633110)\d{10})(\d\d)?\d?)",
        );
        $names = array("Visa", "American Express", "JCB", "Maestro", "Solo", "Mastercard", "Switch");
        $matches = array();
        $pattern = "#^(?:".implode("|", $cards).")$#";
        $result = preg_match($pattern, str_replace(" ", "", $cc), $matches);
        if($extra_check && $result > 0){
            $result = (validatecard($cc))?1:0;
        }
        return ($result>0)?$names[sizeof($matches)-2]:false;
    }

    And here’s a sample of the function in action

    $cards = array(
        "4111 1111 1111 1111",
        "4111 1111 1111 1",
        "4111 1111 1111 111",
        "3400 0000 0000 009",
        "3500 0000 0000 009",
        "5500 1545 0000 0004",
        "5940 0000 0000 0004"
    );
     
    foreach($cards as $c){
        $check = check_cc($c, true);
        if($check!==false)
            echo $c." - ".$check;
        else
            echo "$c - Not a match";
        echo "<br/>";
    }
     
    /*
    This gives us
    4111 1111 1111 1111 - Visa
    4111 1111 1111 1 - Visa
    4111 1111 1111 111 - Not a match
    3400 0000 0000 009 - American Express
    3500 0000 0000 009 - Not a match
    5500 1545 0000 0004 - Mastercard
    5940 0000 0000 0004 - Not a match
     
    Using the Luhn algorithm, we can also check further to 
    see if any of these numbers actually represents a valid card
    check_cc("5500 1545 0000 0004", true); //matches
    check_cc("5500 1585 0000 0004", true); //doesn't match
     
    */

    Replacing Static Links with Hyperlinks in HTML

    ^Back to top

    A useful feature on most forums is the ability to type in a website address, like http://www.google.com, for example, and have that address automatically linked, via a HTML hyperlink, like so - http://www.google.com. The following regular expression function will automatically replace any website addresses with the appropriate HTML code to produce an active link

    function replace_urls($string){
        $host = "([a-z\d][-a-z\d]*[a-z\d]\.)+[a-z][-a-z\d]*[a-z]";
        $port = "(:\d{1,})?";
        $path = "(\/[^?<>\#\"\s]+)?";
        $query = "(\?[^<>\#\"\s]+)?";
        return preg_replace("#((ht|f)tps?:\/\/{$host}{$port}{$path}{$query})#i", "<a href='$1'>$1</a>", $string);
    }

    Here, the four different components of the average URL - host (e.g. www.google.com), port (e.g. www.website.com:8080), path (e.g. www.website.com/mysite/file.php) and the query string (e.g. www.website.com/index.php?var=1&var2=100) - are combined to generate the final regular expression pattern.

    An example of this in action

    $string = "Visit http://www.google.com for all your searching requirements! And here are some pictures of cheese - http://images.google.co.uk/images?q=cheese";
    echo replace_urls($string);
     
    /*
    This produces...
    Visit <a href="http://www.google.com">http://www.google.com</a> for all your searching requirements!
    And here are some pictures of cheese - <a href="http://images.google.co.uk/images?q=cheese">http://images.google.co.uk/images?q=cheese</a>
    */

    Removing Swearing and Profanity from blogs and forums

    ^Back to top

    Sometimes it’s useful to be able to protect the youngsters and more fragile minded members of our communities from having to see rude words on their computer screens by filtering out any words deemed to be unsuitable from any comments posted by site members.

    The following regular expression function makes use of the preg_replace_callback() function in PHP to perform a function call “stars” on every matched instance for the words in the $swears array, replacing each letter except the first with star symbols.

    This function can be used after retrieving data from a database, to filter out any unsuitable content. You can also use this function to filter out any other keywords, perhaps to prevent users from advertising your competitor’s products on your forum, for example.

    function deswear($string){
     
        function prep_regexp_array(&$item){
            $item = "#$item#i";
        }
     
        function stars($matches){
            return substr($matches[0], 0, 1).str_repeat("*", strlen($matches[0])-1);
        }    
     
        $swears = array("darn", "heck", "blast", "shoot");
        array_walk($swears, "prep_regexp_array");
        return preg_replace_callback($swears, "stars", $string);
    }

    And an example of this in action

    $string = "Aw, darn, HECK, Blast and ShOoT!";
    echo deswear($string);
    //Aw, d***, H***, B**** and S****!

    Stripping out Bad Attributes and Javascript from HTML code

    ^Back to top

    When working with a forum or blog where user comments are welcomed, it’s often necessary to maintain strict control over exactly what a user can post on the site. The simplest method is just to strip out all unwanted content using the PHP function strip_tags(). This function also has an optional argument to let you leave certain tags in place though, like <b> or <u> if you wish to allow users the ability to specify custom formatting

    A problem with this, though, is that strip_tags() will not remove whatever is inside the <b> or <u> tags if you leave them in place. For example, the following code will not get stripped out using strip_tags()

    This is <b style='font-size:100px' onMouseOver='alert("Hello")'>malicious text!</b>

    This will trigger a javascript alert window whenever you move your mouse over the bold text and also - because of the “style” attribute - make the text 100 pixels tall. Not what you want to be seeing on your message board!

    The following function will not only strip out any unwanted tags from your HTML code, but will also clean up whatever attributes are inside those tags, preventing people from messing up your page

    function cleanTags($source, $tags = null)
    {
        function clean($matched)
        {
              $attribs = "javascript:|onclick|ondblclick|onmousedown|onmouseup|onmouseover|".
                         "onmousemove|onmouseout|onkeypress|onkeydown|onkeyup|".
                         "onload|class|id|src|style";
              $quot = "\"|\'|\`";
              $stripAttrib = "' ($attribs)\s*=\s*($quot)(.*?)(\\2)'i";
              $clean = stripslashes($matched[0]);
              $clean = preg_replace($stripAttrib, '', $clean);
              return $clean;
        }      
     
        $allowedTags='<a><br><b><i><br><li><ol><p><strong><u><ul>';
        $clean = strip_tags($source, $allowedTags);
        $clean = preg_replace_callback('#<(.*?)>#', "clean", $source);
        return $source;
    }

    Here’s the function in action

    $string = "This is <b style='font-size:100px;' onMouseOver='alert(\"Hello\")'>malicious text!</b>";
     
    echo $string;
    echo cleanTags($string);
    //gives us...
    //  This is <b>malicious text!</b>

    BBCode - Custom Formatting Definitions

    ^Back to top

    A more comprehensive approach to the previous problem of disallowing certain tags in HTML user input is to completely strip out all tags, except for your own custom-defined ones.

    Instead of declaring your tags using pointed brackets (< and >) you can create custom tags, most commonly with square brackets ([ and ]). This is known as BBCode (Bulletin Board Code). It is a simple way of defining custom formatting rules for forums and gives more control to the board owner over what users can and cannot do. Any range of tags can be defined, the most common of which are [b] for bold, [for underline and [to place a link.

    Here is a simple implementation of some basic BBCode rules

    function BBcode($string) {
     
        //get rid of all HTML tags
        $string = strip_tags($string);
     
        $patterns = array(
            "bold" => "#\[b\](.*?)\[/b\]#is",
            "italics" => "#\[i\](.*?)\[/i\]#is",
            "underline" => "#\[u\](.*?)\[/u\]#is",
            "link_title" => "#\[url=(.*?)](.*?)\[/url\]#i",
            "link_basic" => "#\[url](.*?)\[/url\]#i",
            "color" => "#\[color=(red|green|blue|yellow)\](.*?)\[/color\]#is"
        );
     
        $replacements = array(
            "bold" => "<b>$1</b>",
            "italics" => "<i>$1</i>",
            "underline" => "<u>$1</u>",
            "link_title" => "<a href=\"$1\">$2</a>",
            "link_basic" => "<a href=\"$1\">$1</a>",
            "color" => "<span style='color:$1;'>$2</span>"
        );
     
        return preg_replace($patterns, $replacements, $string);
     
    }

    This regular expression function searches through a given string for any instances of matching text within square brackets, followed by a matching closing pair of brackets with the same text inside. So, for example, the text “[b]this is bold[/b]” would be matched by the first expression in the array. Note how we’re specifying that the contents of each pair of brackets musn’t be a greedy search - denoted by the ? after the * symbol. Otherwise, a string like “[b]this is bold[/b] and [b]this is also bold[/b]” would return a match for everything within the first [b] and the last [/b], not what we require.

    Here it is in action

    echo BBcode("[b]this is[/b] me being [b]bold[/b], and [url=http://www.google.com]here's a link[/url] and another - [url]http://www.google.com[/url]");
    //<b>this is</b> me being <b>bold</b>, and <a href="http://www.google.com">here's a link</a> and another - <a href="http://www.google.com">http://www.google.com</a>
     
    echo BBcode("Here's some [color=red]Red Text![/color]");
    //Here's some <span style="color: red;">Red Text!</span>

    Conclusion

    ^Back to top

    So, there you go. Hopefully you’ve gained some useful knowledge from this article. At the very least, you should now have an understanding of the power of regular expressions and be able to see how they can fit in with your website applications to provide extra functionality.

    These expressions and functions can also be translated into other programming languages fairly easily, like Javascript or Perl (Perl’s regular expression handling is actually far superior to PHP’s, and I’d recommend checking it out just to get a grasp of regular expression handling in its purest form). The expressions themselves will need very little altered when being used elsewhere, although the PHP functions used will most likely have different counterparts.

    Try messing around with the expressions to obtain different results, and feel free to use any code in this article in your own programs.

    1. Thanks to www.planzero.org

4 Responses

WP_Cloudy
  • Javier Says:

    Te felicito y agradesco que compartas tus valiosos conocimientos. Espero me permitas traducirlo al espaƱol.

  • Ben Shelock Says:

    Great tutorial. I will use these :)

  • MarkJ Says:

    Please adjust your username validator to allow the . character.

  • Matt Says:

    This has been really useful; one minor improvement is to edit the html detector so that the path and query parts don’t match punctuation marks at the end of a url. For example, if someone had written “Check out http://www.google.com!” it would match the explanation mark (which you don’t want), but changing the path match to “(\/[^?\#\"\s]+[^?\#\"\s\.!])?” will resolve this. You can change the query path similarly, though I’m less sure about escape characters here.

    Cheers

Leave a Comment

Want to ask a question about anything in this tutorial? Have you spotted an inaccuracy, or noticed areas for improvement? Fancy just having a chat? Leave your comments below...

Recommended Reading from Amazon.com

Next tutorial
A simple IP Address and Visitor Tracking tool with PHP & MySQL