Liferay Tech Support: Regular Expression

We need to use regular expression frequently in text processing for search, parse, validation or XML document integrity. Java provide us a package called java.util.regex to make life easier for regular expression. Bellow I have summarized the things as I need to use Java Regex so frequently

Common matching symbols:

Regular Expression	Description
.	Matches any sign
^regex	regex must match at the beginning of the line
regex$	Finds regex must match at the end of the line
[abc]	Set definition, can match the letter a or b or c
[abc[vz]]	Set definition, can match a or b or c followed by either v or z
[^abc]	When a “^” appears as the first character inside [] when it negates the pattern. This can match any character except a or b or c
[a-d1-7]	Ranges, letter between a and d and figures from 1 to 7, will not match d1
X\|Z	Finds X or Z
XZ	Finds X directly followed by Z
$	Checks if a line end follows

Metacharacters:

Regular Expression	Description
\d	Any digit, short for [0-9]
\D	A non-digit, short for [^0-9]
\s	A whitespace character, short for [ \t\n\x0b\r\f]
\S	A non-whitespace character, for short for [^\s]
\w	A word character, short for [a-zA-Z_0-9]
\W	A non-word character [^\w]
\S+	Several non-whitespace characters

Characters:

Characters	Description
x	The character x
`\\`	The backslash character
n	The character with octal value `0`n (0`<=`n`<=`7)
nn	The character with octal value `0`nn (0`<=`n`<=`7)
mnn	The character with octal value `0`mnn (0`<=`m`<=`3, 0`<=`n`<=`7)
`\x`hh	The character with hexadecimal value `0x`hh
`\u`hhhh	The character with hexadecimal value `0x`hhhh
`\t`	The tab character (`'\u0009'`)
`\n`	The newline (line feed) character (`'\u000A'`)
`\r`	The carriage-return character (`'\u000D'`)
`\f`	The form-feed character (`'\u000C'`)
`\a`	The alert (bell) character (`'\u0007'`)
`\e`	The escape character (`'\u001B'`)
`\c`x	The control character corresponding to x

Quantifier:

Regular Expression	Description	Examples
*	Occurs zero or more times, is short for {0,}	X* – Finds no or several letter X, .* – any character sequence
+	Occurs one or more times, is short for {1,}	X+ – Finds one or several letter X
?	Occurs no or one times, ? is short for {0,1}	X? -Finds no or exactly one letter X
{X}	Occurs X number of times, {} describes the order of the preceding liberal	\d{3} – Three digits, .{10} – any character sequence of length 10
{X,Y}	.Occurs between X and Y times,	\d{1,4}- \d must occur at least once and at a maximum of four
*?	? after a qualifier makes it a “reluctant quantifier”, it tries to find the smallest match.

A simple example for case insensitive URL matching using java Regex given bellow:

Liferay Tech Support

Friday, 11 May 2012

Regular Expression

Common matching symbols:

Metacharacters:

Characters:

Quantifier:

No comments:

Post a Comment

Blogroll