Culturally aware initcap

From PostgreSQL wiki

Revision as of 15:11, 7 August 2010 by Sternocera (Talk | contribs)

(diff) ← Older revision | Latest revision (diff) | Newer revision → (diff)
Jump to: navigation, search

Snippets

Culturally aware initcap

Works with PostgreSQL

Any version

Written in

SQL

Depends on

Nothing


Written by Peter Geoghegan ("sternocera")


The following is an implementation of what I call "culturally aware initcap", which is usable as a drop-in replacement for initcap, provided input is in English. It solves a number of problems I've often encountered when using regular initcap, particularly with possessive apostrophes, but also with various common contractions (isn't, can't, won't, I'm, you've, you're). It does this without stepping on the toes of people with Irish names like O'Shaughnessy and O'Sullivan.

with regular initcap: SELECT initcap($$O'SULLIVAN'S friend DOESN'T LIVE HERE anymore$$)

output: "O'Sullivan'S Friend Doesn'T Live Here Anymore"

with cul_initcap: SELECT cul_initcap($$O'SULLIVAN'S friend DOESN'T LIVE HERE anymore$$)

output: "O'Sullivan's Friend Doesn't Live Here Anymore"

CREATE OR REPLACE FUNCTION cul_initcap(input_val text) RETURNS text AS
$function_body$
 
  SELECT 
    REPLACE(
      REPLACE(
        REPLACE(
          REPLACE(
            REPLACE(
              REPLACE(
                REPLACE(
                  regexp_replace(initcap($1), 
                      $$'(([MSTD]|Ve|Re|Ll))([^[:upper:][:lower:]]|$)$$, 
                      $$'{@*#!\1!#*@}\3$$, 
                      'g'
                      )
                  , '{@*#!M!#*@}', 'm')
                , '{@*#!S!#*@}', 's')
              , '{@*#!T!#*@}', 't')
            , '{@*#!D!#*@}', 'd')
          , '{@*#!Ve!#*@}', 've')
        , '{@*#!Re!#*@}', 're')
      , '{@*#!Ll!#*@}', 'll');
 
$function_body$
LANGUAGE 'sql' IMMUTABLE STRICT;

One notable limitation of my implementation is that it will incorrectly convert "magical brace enclosed literals" into their corresponding output values: '{@*#!T!#*@}' ,'{@*#!S!#*@}', '{@*#!M!#*@}', '{@*#!Ve!#*@}' (and the rest) become 't', 's', 'm' and 've' respectively. I doubt that this is much of a concern generally. If it is a concern, use more elaborate magical braces that are even less likely to occur in strings that exist in production.

Personal tools