User:Breinbaas/scratch/regular expression intro
by Will Trillich
Finding data in PostgreSQL using LIKE or ~ (REGULAR EXPRESSIONS):
LIKE
You're probably familiar with the WHERE clause
of the SELECT statement
SELECT somefield FROM table WHERE somefield = somevalue;
For example, to show all customers within the post code "60201" you could try
SELECT * FROM cust WHERE zip = '60201';
But if you also have nine-digit zips stored in the zip field (in the U.S.A. we have standard five-digit postal codes and also a four-digit extension for further geographic resolution) then this search will OMIT customers having more-specific zips such as '60201-9876' or '60201-1234'.
The Percent "%"
Try using LIKE instead of = (EQUALS):
SELECT * FROM cust WHERE zip LIKE '60201%';
With LIKE, your search values can use the percent (%) to tell PostgreSQL that anything can occupy that spot -- one character, twelve symbols or zero digits -- and still satisfy the search.
On Unix or Linux, this is basically the same as the ASTERISK (*) at the command line, when dealing with file names:
# list all files whose names begin with '.bash' ls .bash*
# list all files containing 'out' anywhere in the name ls *out*
# list all file names ending with '.pl' ls *.pl
# list file starting with 'proj', ending with '.c' ls proj*.c
With PostgreSQL using the LIKE operator, use the percent, instead:
-- list all customers within the 47610 postal code SELECT * FROM cust WHERE zip LIKE '47610%';
-- display customers who have 'Corp' in their names SELECT * FROM cust WHERE name LIKE '%Corp%';
-- show customers whose names end in 'LLC' SELECT * FROM cust WHERE name LIKE '%LLC';
-- documents beginning with 'We', ending with 'rica' SELECT * FROM doc WHERE contents LIKE 'We%rica';
Wherever the '%' appears (using the LIKE operator) Postgres allows anything at all to match -- from a lengthy string of text, to one single character, to a zero-length string -- i.e. nothing at all.
...ILIKE 'A%Z' -- matches 'AZ' and 'A to Z' and 'ABC2468XYZ' and 'A@$*Z'
The Underscore "_"
For the LIKE operator, the UNDERSCORE (_) takes on the same meaning as the question mark does to shell operations for Unix and Linux file names:
# list files starting with dot, followed by at least two chars ls .??*
The underscore matches just one character exactly:
SELECT * FROM atable WHERE afield LIKE '_'; -- shows records where afield is exactly one character -- omitting ones where it's blank or has two (or more) -- characters in it)
You can restrict your searches to finding fields of certain lengths this way:
SELECT * FROM cust WHERE zip LIKE '_____' OR zip LIKE '_________'; -- display all five- or nine-character zip codes
Combine it with percent to find fields over a certain length:
SELECT * FROM atable WHERE afield LIKE '___%'; -- display records where afield has three or more characters
Or you can accomodate some language idiosyncracies, as well:
SELECT * FROM activity WHERE venue LIKE 'Theat__'; -- find u.k. 'theatre' and u.s. 'theater'
ILIKE
But what if a customer's name is all capitals, as in 'THE ACME CORPORATION'? The "...LIKE '%Corp%'" won't find it, because uppercase 'O' is not identical to lowercase 'o'!
Finding text independent of uppercase/lowercase is something we often need to do. And so we have the ILIKE operator.
Use ILIKE instead of LIKE when your search should ignore case -- the "I" stands for case-[I]nsensitive:
-- show all customers with 'corp' in the name SELECT * FROM cust WHERE name ILIKE '%Corp%'; -- ignoring whether the field is upper- or lowercase
That'll find 'Bubba Gump Shrimp Corp' and 'ACME CORP' and even 'Amalgamated Switch and Relay corporation' because anything can ('%') precede or ('%') follow the string 'Corp' which can be uppercase or lowercase.
But there's even more power in "regular expressions"! See below...
Notes
Always include some real data to search for!
-- show everything, doing a lot of unnecessary work: SELECT * FROM atable WHERE afield LIKE '%';
It would be rather silly to make PostgreSQL compare every record to see if it matched 'anything goes, in "afield"'. Just say no.
Convert your spoken-language request to a LIKE clause like this:
"afield ENDS WITH avalue" reword => "afield starts with anything, ending with avalue" sql => "... WHERE afield LIKE '%avalue'"
"afield STARTS WITH avalue" reword => "afield starts with avalue, ending with anything" sql => "... WHERE afield LIKE 'avalue%'"
See how that works?
"afield CONTAINS avalue" sql => "... WHERE afield LIKE '%avalue%'"
If you're looking for something at the beginning of a field, that's where your data goes, and you END with the percent. If you're looking for something at the end of a field, put your data there and allow anything at the front by putting the percent there.
ALSO -- if your field is indexed, you'll defeat the index unless you anchor your search to the beginning of the field.
-- find fields starting with 'something' ... afield LIKE 'something%'; -- uses afield's index
The example below isn't anchored at the start of the field, so it does no good to use the index:
-- find fields containing 'pooh' ... afield LIKE '%pooh%'; -- can't use afield's index
Afield must contain 'pooh' BUT it could be anywhere in the field. The index would be useless.
LIMITATION
Okay. Let's say you have a full-name field that has first, middle, and last name all in the one field. Sometimes you'll have a middle name, sometimes you won't. Sometimes the first name will be spelled out, other times it'll be shortened or even abbreviated to just the first initial.
Here's how you'd search that field for "Abraham Lincoln" using the LIKE operator:
SELECT * FROM peron WHERE fullname LIKE 'A%Lincoln';
It'll find 'Abraham Lincoln' and 'A. Lincoln' and 'Abe Lincoln' because anything can go between the initial 'A' and the ending 'Lincoln'.
But -- it'll also match 'Andrew M. Lincoln' and 'Abner Sasquatch Lincoln' as well... and this is why some bright souls created regular expressions!
REGULAR EXPRESSIONS
A regular expression is a pattern to search for.
The structure of the pattern has to match a
rigid set of rules so that PostgreSQL will know
what you're trying to search for.
This introduction will barely scratch the surface, but hopefully it'll get you started--
If you've ever tinkered with perl, you've probably encountered regular expressions:
perl -ne 'print if /[A-Za-z]/;' somefile
Here, the "[A-Za-z]" is a regular expression. It matches any line containing uppercase (A-Z) or lowercase (a-z) letters. In this example, the lines will come from the file "somefile" and lines that match will be printed out.
Well, PostgreSQL can do that, too!
Example
SELECT * FROM person WHERE fullname ~ 'A(be|braham|\\.) *Lincoln';
The REGULAR EXPRESSION operator is the TILDE (~). The value you're searching for then must be a valid regular expression. In the above example, we're asking PostgreSQL to display all rows containing
'A'
followed by EITHER
'be'
OR
'braham'
OR
'\\.' -- a dot
which is then followed by
' ' -- a space '*' -- zero or more times
followed by
'Lincoln'
and that's all. Whew!
As you can guess, these will all successfully match:
'A. Lincoln' 'AbeLincoln' -- without any space at all 'Abraham Lincoln' -- with lots of spaces
and these won't match:
'A Lincoln' -- no dot, no 'braham', no 'be' 'Abe Gump Lincoln' -- nothing will match 'Gump'
Quoting "\\"
So what's with the BACKSLASH in front of the dot? Well, just as LIKE has the UNDERSCORE (_) to denote "any single character", REGULAR EXPRESSIONS use the DOT (.) for that very same purpose. So we have to "escape" the dot to alter its normal interpretation, using the.
Note that string literals in PostgreSQL already use the backslash, so you'll have to double any backslash you use:
psql=# SELECT * FROM person psql-# WHERE fullname ~ 'A(be|braham|\\.) *Lincoln' psql-# -- a second backslash added psql-# ;
Grouping "()"
How about the PARENTHESES () and the BAR |, hmm? Well, that's how you can group permitted alternatives:
... ~ 'A(be|braham|\\.)'
That says that 'A' can be followed by any of the three sub-expressions within the parentheses, namely 'be' or 'braham' or <dot>.
These all produce the same matches:
... ~ 'A(be|braham|\\.)' ... ~ 'Abe|Abraham|A\\.' ... ~ 'A(b(e|raham)|\\.)'
Let's dissect this last one -- it specifies that:
'A'
can be followed by either
'b' followed by some more stuff
or
<dot>
nestling into the details following the 'b' -- if 'A' is followed by 'b', then the 'b' must be followed by either
'e' -- which matches 'Abe'
or
'raham' -- which matches 'Abraham'
Do you see how powerful this can be?
Character Classes "[]"
So you're trying to find "McAndrews" -- but maybe it's spelled with a lower-case "a" as in "Mcandrews", instead!
SELECT * FROM person WHERE fullname ~ 'Mc[Aa]ndrews';
Using the BRACKETS tells the regular expression parser that you're allowing any one of a whole class of characters in that spot.
'[Aa]'
It specifies that you're looking for either an uppercase 'A' or a lowercase 'a' in that spot.
Here's how you can find fields containing vowels:
... ~ '[AEIOUaeiou]'
Search for fields containing lowercase letters:
... ~ '[abcdefghijklmnopqrstuvwxyz]'
Of course, there's a shortcut for specifying character classes that cover a whole range:
... ~ '[a-z]' -- also matches any lowercase letter
Display fields that contain digits:
... ~ '[0-9]'
Here's an example using a standard U.S. phone pattern (neglecting the area code for clarity and space):
SELECT * FROM cust WHERE descr ~ '[0-9][0-9][0-9]-[0-9][0-9][0-9][0-9]';
That will show customers whose "descr" field contains a pattern of digits that looks like a U.S. phone number:
'[0-9][0-9][0-9]' -- three digits
followed by
'-' -- a hyphen
followed by
'[0-9][0-9][0-9][0-9]' -- four digits
As you can see, the hyphen doesn't do anything special for a regular expression unless it's inside the square brackets of a character class -- in which case it means "anything between".
What if you want to allow a hyphen within a character class? Simply make it the first character inside the brackets:
'[-.,_]'
That class will match any hyphen, dot, comma or underscore.
Bounds "{}"
Rather than having to specify each of those [0-9] monstrosities every time, let's abbreviate. Instead of:
'[0-9][0-9][0-9]-[0-9][0-9][0-9][0-9]'
we can just say
'[0-9]{3}-[0-9]{4}'.
The BOUNDS specification, inside braces, comes immediately after whatever you want a certain number of. That is, first you specify what you're looking for, then you specify how many you need. (The default is {1,1} one, as you can tell: one 'A' followed by one 'b' followed by...)
Inside the BRACES that specify your "bounds" you can say you want exactly so many, as we did above (3 and then 4) or you can give a low-to-high pair:
... ~ 'Z{3}' -- need three Z's ... ~ '@{5,}' -- five or more @'s ... ~ 'Q{2,5}' -- at least two, at most five, Q's ... ~ '(This|That){0,3}'
The last one allows for 'This' or 'That', anywhere from zero to three times. So 'ThisThatThis' and will match! Convenient, hmm?
And for very common cases, there are handy abbreviations:
'x{1,1}' 'x' one (the default) 'r{0,1}' 'r?' zero or one, i.e. 'x' is optional 'B{0,}' 'B*' zero or more 'z{1,}' 'z+' one or more
These three abbreviations will pop up often:
-- allow 'Ms' and 'Mrs' (the 'r' is optional): ... title ~ '(Mr|Mr?s|Dr)'
-- maybe there's a space, maybe there isn't: ... lname ~ 'Mac ?Affee'
-- don't care how many trailing spaces: ... lname ~ 'Smith *'
-- find records with two or more even digits: ... afield ~ '[02468]{2,}'
or
... afield ~ '[02468][02468]+'
That last one breaks down to
'[02468]' -- an even digit
followed by
'[02468]' -- an even digit '+' -- one or more times
Remember: the plus means '{1,}' -- i.e. that the preceeding item must match one or more times.
Getting back to our phone number pattern... this is the same request as above, but it's easier to see what's going on, using the {bounds} feature:
SELECT * FROM cust WHERE descr ~ '[0-9]{3}-[0-9]{4}'; -- three digits, hyphen, four digits
Excluding Characters "[^]"
Note that the above pattern will also match 00000-0000 because the last three digits of the five in the left chunk, with the four digits of the right chunk, fit perfectly into the constraints of what we asked for. And this ain't no phone number.
Here's an approach to fixing that snag:
SELECT * FROM cust WHERE descr ~ '[^0-9][0-9]{3}-[0-9]{4}';
Note that we've prefixed the previous pattern with
'[^0-9]'
because, within the square brackets of a character class, the CARAT (^) means "anything EXCEPT..."
So now, 00000-0000 will NOT match; this is what we're after.
But there's a new problem: if the phone number is the very first thing in the "descr" field it would never match because we're demanding that there be SOMETHING (besides a digit) before the phone number. What to do?
At The Very Beginning "^"
You can match "beginning-of-field" with the CARAT (^). (When it's the first thing inside square brackets, it negates the character class; outside brackets, a carat means "beginning-of-text". Got that? It's important!)
SELECT * FROM cust WHERE descr ~ '(^|[^0-9])[0-9]{3}-[0-9]{4}';
Here's what this pattern looks for:
Either
'^' -- at the very beginning of text
or
'[^0-9]' -- find something that's a non-digit
followed by
'[0-9]{3}' -- three digits
and then a
'-' -- hyphen
and finally
'[0-9]{4}' -- four digits
and we've solved the 'at-beginning-of-field' problem.
At The Very End "$"
Still, we could inadvertently match something like 000-000000000 (probably an id number for a replacement part) which we don't want to do.
Note how this is the same problem, for the end of the field, that we had earlier, with the beginning of the field? The solution is similar:
SELECT * FROM cust WHERE descr ~ '(^|[^0-9])[0-9]{3}-[0-9]{4}($|[^0-9])';
The DOLLAR SIGN ($) represents 'end-of-text' just as carat, outside of square brackets, means 'beginning-of- text'. The above pattern is the same as before, except we appended
'($|[^0-9])'
to the end, which means we're looking for
either
'$' -- at the very end-of-text
or
'[^0-9]' -- something that's not a digit
after our previous pattern.
And, finally, we have a comprehensive phone pattern -- it'll match ###-#### while excluding other numeric arrangements -- in a text field.
Anchoring
Did you notice that regular expressions aren't "anchored", as the LIKE expressions are? To specify that you're looking only at the beginning of a field, you need to use '^' and to include the end of a field, you must use '$'.
It's a good thing regular expressions aren't anchored automatically -- we wouldn't have been able to specify our phone number pattern above! Things like '(^|[^0-9])' are possible because YOU get to specify when and how beginning-of-text is required. Cool, eh?
And, just as in the LIKE operator, unless regular expressions ARE anchored at the beginning of a field, you'll defeat any index you have for that field. Indexes help alphabetize by comparing the beginning of your fields to each other; unless you're looking at the beginning of the field, your index can't be used.
Here are some comparisons between operators LIKE and ~ :
-- list all customers within the 47610 postal code: SELECT * FROM cust WHERE zip LIKE '47610%'; -- zip begins with '47610' then has anything SELECT * FROM cust WHERE zip ~ '^47610'; -- zip begins with '47610'
-- display customers who have 'Corp' in their names SELECT * FROM cust WHERE name LIKE '%Corp%'; -- name contains 'Corp' with anything before and after SELECT * FROM cust WHERE name ~ 'Corp'; -- name contains 'Corp'
-- show customers whose names end in 'LLC' SELECT * FROM cust WHERE name LIKE '%LLC'; -- name can have anything, with 'LLC' at the end SELECT * FROM cust WHERE name ~ 'LLC$'; -- name must have 'LLC' at the end
-- documents beginning with 'We', ending with 'rica' SELECT * FROM doc WHERE contents LIKE 'We%rica'; -- starts with 'We', has anything, ending with 'rica' SELECT * FROM doc WHERE contents ~ '^We.*rica$'; -- start with 'We', zero or more chars, end with 'rica'
Remember that in regular expressions, DOT means 'any character'. Thus '.*' means 'any character, zero times or more' which, in English, means "anything can go here, including nothing at all".
There are ways to anchor your searches to word boundaries, as well -- not just beginning-of-field and end-of-field. See your documentation for details.
Case-Insensitive "~*"
If you're not worried about differentiating between uppercase and lowercase in your regular expressions, you could go full boar and try
SELECT * FROM cust WHERE lname ~ '[Oo][Cc][Tt][Aa][Vv][Ii][Aa][Nn]';
Fortunately there's a quickie to make it easier for you -- where you've been using the tilde (~) as your operator, use tilde-star (~*) instead:
SELECT * FROM cust WHERE lname ~* 'Octavian';
ILIKE is case-insensitive (where LIKE is case-specific) just as ~* is case-insensitive for regular expressions (where ~ is case-specific). Very handy!
Example
Email addresses can look like any of these:
m...@my.net someone-unimportant9...@this.little.org first.last.ti...@obscure.sub-net.biggie.com _wei...@somewhere.out_there.net
A reasonably-functional email address pattern might be something like this:
'[a-z0-9_]+([\\-\\.][a-z0-9_]+)*@[a-z0-9_]+([\\-\\.][a-z0-9_]+)+'
That breaks down to (using case-indifferent via ~*, of course):
'[a-z0-9_]'
any alphanumeric (including underscore) character
'+'
one or more times, followed by
'([\\-\\.][a-z0-9_]+)'
email stuff [1],
'*'
zero or more times, followed by
'@'
which is then followed by
'[a-z0-9_]'
alphanumerics (including underscore)
'+'
one or more times, followed by
'([\\-\\.][a-z0-9_]+)'
site stuff [2]
'+'
one or more times.
As for [1] email stuff, it can be (zero or more of):
'[\\-\\.]'
hyphen or dot, followed by
'[a-z0-9_]'
alphanumerics (or underscore)
'+'
one or more times.
And as for [2] site stuff, it can be (one or more of):
'[\\-\\.]'
hyphen or dot, followed by
'[a-z0-9_]'
alphanumerics (or underscore)
'+'
one or more times.
So the personal part of the address can just be a single word, with optional dot-or-hyphen joining other words (ZERO or more) onto it; this is followed by at-sign, which is followed by the site portion of the address, namely: any word, followed by ONE or more dot-or-hyphen- followed-by-another-word combo's.
It's not a perfect match for all legal email patterns, (for example, bad@my-addr would match) but something like it might be sufficient depending on your requirements.
CONCLUSION
Regular expressions are complicated, because they're powerful! (Or is it the other way around?)
Check out "Pattern Matching" in the PostgreSQL manual for more details -- on my Debian "Potato" system it's in html/postgres/functions-matching.html under the /usr/share/doc/postgresql-doc/ directory.