Tuesday, March 17, 2009

php- Regular Expressions simple functions and common reqular expressions

Regular Expressions

We will use simple functions which return TRUE or FALSE.
$regex will serve as our regular expression to match against and $text will be our text.

 

function do_reg($text, $regex)
{
if (preg_match($regex, $text))
{
return TRUE;
}
else {
return FALSE;
}
}
 

The next function will get the part of a given string ($text) matched

by the regex ($regex) using a group srorage ($regs). By changing the

$regs[0] to $regs[1] we can use a capturing group (in this case griup

1) to match against. The capturing group can also have a name

($regs['groupname']):

 

function do_reg($text, $regex, $regs) {

if (preg_match($regex, $text, $regs))
{
$result = $regs[0];
}
else {

$result = "";

}

return $result;

}

The following function will return an array of all regex
matches in a given string ($text):
function do_reg($text, $regex)
{
preg_match_all($regex, $text, $result, PREG_PATTERN_ORDER);

return $result = $result[0];

}
 Next we can iterate (loop) over all matches in a string ($text)
and output the results:

function do_reg($text, $regex)
{

preg_match_all($regex, $text, $result, PREG_PATTERN_ORDER);

for ($i = 0; $i < count($result[0]); $i++)
{
$result[0][$i];
}

}
 Extending the above one we can iterate over all matches ($text)
and capture groups in a string ($text):

function do_reg($text, $regex)
{
preg_match_all($regex, $text, $result, PREG_SET_ORDER);

for ($matchi = 0; $matchi < count($result); $matchi++)
{
for ($backrefi = 0; $backrefi < count($result[$matchi]); $backrefi++)
{
$result[$matchi][$backrefi];
}
}

}
 Now lets see some useful regular expressions

Addresses

//Address: State code (US) '/\\b(?:A[KLRZ]|C[AOT]|D[CE]|FL|GA|HI|I[ADLN]|K[SY]|LA|M[ADEINOST]|N[CDEHJMVY]|O[HKR]|PA|RI|S[CD]|T[NX]|UT|V[AT]|W[AIVY])\\b/' //Address: ZIP code (US) '\b[0-9]{5}(?:-[0-9]{4})?\b'
Dates
//Date d/m/yy and dd/mm/yyyy
//1/1/00 through 31/12/99 and 01/01/1900 through 31/12/2099
//Matches invalid dates such as February 31st '\b(0?[1-9]|[12][0-9]|3[01])[- /.](0?[1-9]|1[012])[- /.](19|20)?[0-9]{2}\b'

//Date dd/mm/yyyy //01/01/1900 through 31/12/2099
//Matches invalid dates such as February 31st '(0[1-9]|[12][0-9]|3[01])[- /.](0[1-9]|1[012])[- /.](19|20)[0-9]{2}'

//Date m/d/y and mm/dd/yyyy //1/1/99 through 12/31/99 and 01/01/1900 through 12/31/2099
//Matches invalid dates such as February 31st //Accepts dashes, spaces, forward slashes and dots as date separators '\b(0?[1-9]|1[012])[- /.](0?[1-9]|[12][0-9]|3[01])[- /.](19|20)?[0-9]{2}\b'

//Date mm/dd/yyyy //01/01/1900 through 12/31/2099
//Matches invalid dates such as February 31st '(0[1-9]|1[012])[- /.](0[1-9]|[12][0-9]|3[01])[- /.](19|20)[0-9]{2}'

//Date yy-m-d or yyyy-mm-dd //00-1-1 through 99-12-31 and 1900-01-01 through 2099-12-31
//Matches invalid dates such as February 31st '\b(19|20)?[0-9]{2}[- /.](0?[1-9]|1[012])[- /.](0?[1-9]|[12][0-9]|3[01])\b'


//Date yyyy-mm-dd //1900-01-01 through 2099-12-31
//Matches invalid dates such as February 31st '(19|20)[0-9]{2}[- /.](0[1-9]|1[012])[- /.](0[1-9]|[12][0-9]|3[01])'
Email address 

//Email address //Use this version to seek out email addresses in random documents and texts.
//Does not match email addresses using an IP address instead of a domain name.
//Does not match email addresses on new-fangled top-level domains with more than 4 letters such as .museum.
//Including these increases the risk of false positives when applying the regex to random documents. '\b[A-Z0-9._%-]+@[A-Z0-9.-]+\.[A-Z]{2,4}\b'
//Email address (anchored)
//Use this anchored version to check if a valid email address was entered. //Does not match email addresses using an IP address instead of a domain name.
//Does not match email addresses on new-fangled top-level domains with more than 4 letters such as .museum.
//Requires the "case insensitive" option to be ON. '^[A-Z0-9._%-]+@[A-Z0-9.-]+\.[A-Z]{2,4}$'
//Email address (anchored; no consecutive dots)
//Use this anchored version to check if a valid email address was entered. //Improves on the original email address regex by excluding addresses with consecutive dots such as john@aol...com
//Does not match email addresses using an IP address instead of a domain name.
//Does not match email addresses on new-fangled top-level domains with more than 4 letters such as .museum. //Including these increases the risk of false positives when applying the regex to random documents. '^[A-Z0-9._%-]+@(?:[A-Z0-9-]+\.)+[A-Z]{2,4}$' //Email address (no consecutive dots)
//Use this version to seek out email addresses in random documents and texts.
//Improves on the original email address regex by excluding addresses with consecutive dots such as john@aol...com //Does not match email addresses using an IP address instead of a domain name.
//Does not match email addresses on new-fangled top-level domains with more than 4 letters such as .museum.
//Including these increases the risk of false positives when applying the regex to random documents. '\b[A-Z0-9._%-]+@(?:[A-Z0-9-]+\.)+[A-Z]{2,4}\b'
//Email address (specific TLDs) //Does not match email addresses using an IP address instead of a domain name. //Matches all country code top level domains, and specific common top level domains.
'^[A-Z0-9._%-]+@[A-Z0-9.-]+\.(?:[A-Z]{2}|com|org|net|biz|info|name|aero|biz|info|jobs|museum|name)$'
//Email address: Replace with HTML link '\b(?:mailto:)?([A-Z0-9._%-]+@[A-Z0-9.-]+\.[A-Z]{2,4})\b'

URL's


//URL: Different URL parts //Protocol, domain name, page and CGI parameters are captured into backreferenes 1 through 4 '\b((?#protocol)https?|ftp)://((?#domain)[-A-Z0-9.]+)((?#file)/[-A-Z0-9+&@#/%=~_|!:,.;]*)?((?#parameters)\?[-A-Z0-9+&@#/%=~_|!:,.;]*)?'
//URL: Different URL parts //Protocol, domain name, page and CGI parameters are captured into named capturing groups.
//Works as it is with .NET, and after conversion by RegexBuddy on the Use page with Python, PHP/preg and PCRE. '\b(?<protocol>https?|ftp)://(?<domain>[-A-Z0-9.]+)(?<file>/[-A-Z0-9+&@#/%=~_|!:,.;]*)?(?<parameters>\?[-A-Z0-9+&@#/%=~_|!:,.;]*)?'
//URL: Find in full text //The final character class makes sure that if an URL is part of some text, punctuation such as a //comma or full stop after the URL is not interpreted as part of the URL. '\b(https?|ftp|file)://[-A-Z0-9+&@#/%?=~_|!:,.;]*[-A-Z0-9+&@#/%=~_|]'
//URL: Replace URLs with HTML links preg_replace('\b(https?|ftp|file)://[-A-Z0-9+&@#/%?=~_|!:,.;]*[-A-Z0-9+&@#/%=~_|]', '<a href="\0">\0</a>', $text);

No comments: