Truncate text by byte

From PostgreSQL wiki

Jump to: navigation, search

Library Snippets

Truncate UTF-8 Text by byte width

Works with PostgreSQL

Any version

Written in

PL/pgSQLl

Depends on

Nothing


Several different ways to truncate a String/Text that is encoded in UTF-8 or other variable encoding method to specified byte width:



Method 1: author Laruenz Albe is an SQL implementation generating a record set containing all possible byte lengths to character length in a table then filters the result down

CREATE OR REPLACE FUNCTION get_prefix (
   string text,
   max_bytes BIGINT
) RETURNS text
   LANGUAGE SQL STRICT AS
$$SELECT p
FROM (SELECT p.p, octet_length(p.p) AS len
      FROM generate_series(0, LENGTH($1)) AS len
         CROSS JOIN LATERAL substr($1, 1, len.len) AS p) AS q
WHERE len <= $2
ORDER BY len DESC
LIMIT 1$$;



Method 2: Author Zsheep is an PLpg/SQL implementation, iterating over the string 1 character at a time moving left to right until the string byte width is approximately the input width

CREATE OR REPLACE FUNCTION max_bytea_length(pchars text, bytea_length int)
RETURNS bytea
LANGUAGE plpgsql
 
COST 100
VOLATILE 
AS $BODY$
 
DECLARE 
_i int;
_length_of_chars int;
_newchars text;
_testchars text;
BEGIN
 
IF octet_length(pchars::bytea) <= bytea_length THEN
   RETURN pchars::bytea;
END IF;
_i = LEAST( octet_length(pchars)-4, bytea_length-4);
_length_of_chars =  char_length(pchars);
LOOP 
   _newchars= left(pchars, _i);
    _testchars = left(pchars, _i+1); 
  IF octet_length(_testchars::bytea) > bytea_length OR _i = _length_of_chars  THEN
     RETURN _newchars::bytea;
  END IF ;
  _i = _i+1;
END LOOP ;
 
END;
$BODY$



Method 3: Author Daniel Veritte is an SQL function that inspects the UTF-8 byte sequence looking for the high order bytea that is <= max byte width.

CREATE FUNCTION utf8_truncate(str text, len int) returns text
AS $$
SELECT CASE WHEN octet_length(str) <= len THEN str
ELSE (
   WITH bstr(s) AS (SELECT convert_to(str, 'UTF-8'))
   SELECT
   CASE
   WHEN len>=1 AND (get_byte(s, len) & 192) <> 128
   THEN convert_from(substring(s, 1, len), 'UTF-8')
   ELSE
     CASE
     WHEN len>=2 AND (get_byte(s, len-1) & 192) <> 128
     THEN convert_from(substring(s, 1, len-1), 'UTF-8')
     ELSE
       CASE
       WHEN len>=3 AND (get_byte(s, len-2) & 192) <> 128
       THEN convert_from(substring(s, 1, len-2), 'UTF-8')
       ELSE
         CASE
         WHEN len>=4 AND (get_byte(s, len-3) & 192) <> 128
         THEN convert_from(substring(s, 1, len-3), 'UTF-8')
         ELSE ''
         END
       END
     END
   END
 FROM bstr)
   END;
$$ language SQL strict immutable parallel safe;


Examples of usage

SELECT get_prefix('abc€', 4);
 
SELECT max_bytea_length('i ♥ u function changed it to be faster', 56);
 
SELECT utf8_truncate('method 3 €', 9);
Personal tools