Soundex

From PostgreSQL wiki

Revision as of 13:59, 9 February 2012 by Intgr (Talk | contribs)

(diff) ← Older revision | Latest revision (diff) | Newer revision → (diff)
Jump to: navigation, search

Library Snippets

Compute string soundex

Works with PostgreSQL

Tested on 9.0, 9.1. Probably works in earlier versions

Written in

PL/pgSQL

Depends on

Nothing


Soundex is a simple algorithm for comparing/normalizing English words based on phonetic similarity.

PostgreSQL 8.3 and later include a soundex(text) function in the fuzzystrmatch extension. Note that this function behaves differently from some other implementations, as it does not skip similar consonants that are separated by 'w' and 'h'.

PL/pgSQL version

If you can't install the fuzzystrmatch extension (due to limited privieges or other reasons), here is an alternative version in PL/pgSQL. This function behaves identically to the fuzzystrmatch implementation. (Tested on PostgreSQL 9.0 and 9.1, using dictionaries in many different languages).


CREATE OR REPLACE FUNCTION soundex(input text) RETURNS text
IMMUTABLE STRICT COST 500 LANGUAGE plpgsql
AS $$
DECLARE
  soundex text = '';
  char text;
  symbol text;
  last_symbol text = '';
  pos int = 1;
BEGIN
  WHILE length(soundex) < 4 LOOP
    char = upper(substr(input, pos, 1));
    pos = pos + 1;
    CASE char
    WHEN '' THEN
      -- End of input string
      IF soundex = '' THEN
        RETURN '';
      ELSE
        RETURN rpad(soundex, 4, '0');
      END IF;
    WHEN 'B', 'F', 'P', 'V' THEN
      symbol = '1';
    WHEN 'C', 'G', 'J', 'K', 'Q', 'S', 'X', 'Z' THEN
      symbol = '2';
    WHEN 'D', 'T' THEN
      symbol = '3';
    WHEN 'L' THEN
      symbol = '4';
    WHEN 'M', 'N' THEN
      symbol = '5';
    WHEN 'R' THEN
      symbol = '6';
    ELSE
      -- Not a consonant; no output, but next similar consonant will be re-recorded
      symbol = '';
    END CASE;
 
    IF soundex = '' THEN
      -- First character; only accept strictly English ASCII characters
      IF char ~>=~ 'A' AND char ~<=~ 'Z' THEN
        soundex = char;
        last_symbol = symbol;
      END IF;
    ELSIF last_symbol != symbol THEN
      soundex = soundex || symbol;
      last_symbol = symbol;
    END IF;
  END LOOP;
 
  RETURN soundex;
END;
$$;

Example

db=# select column1, soundex(column1) from (values ('Robert'), ('Rupert'), ('Rubin'), ('Ashcroft')) data;
 column1  | soundex 
----------+---------
 Robert   | R163
 Rupert   | R163
 Rubin    | R150
 Ashcroft | A226

db=# create table words(word text);
db=# copy words from '/usr/share/dict/words';
db=# create index on words(soundex(word));
db=# select * from words where soundex(word)=soundex('hotel');
  word  
--------
 hotel
 hotly
 huddle
Personal tools