Improving PL/PgSQL (September 2014)

From PostgreSQL wiki
Jump to: navigation, search

This page currently contains all the changes I (Marko Tiikkaja) would like to see changed in PL/PgSQL if breaking backwards compatibility was allowed, in response to the thread started here: http://www.postgresql.org/message-id/CAASwCXchFh_SOrMtteQszoNnhQdU9v-QysUQ5ToFBEcLrw4F-Q@mail.gmail.com . It's not supposed to be an exhaustive list, but it probably has the ones I'm most annoyed with on a daily basis.

SELECT .. INTO vs. TOO_MANY_ROWS

The current version of PL/PgSQL has an implicit LIMIT 1 in all SELECT .. INTO queries. This can be considered to be problematic for a number of reasons:

1. Enforcing the "0 or 1 rows" behaviour gets difficult. Probably the least horrible way is to SELECT count(*) OVER() into a separate variable and check its value after the query, but that gets really ugly.

2. If the user is using SELECT .. INTO, he's likely expecting only a single row back, so enforcing that constraint is the sensible default. Also, going from that default to the other one is trivial; just add 'LIMIT 1'. That also serves as a documentation to the person reading your code a year later. Additionally, as pointed out above, going from the current default to the other one is painful.

3. INSERT/UPDATE/DELETE .. RETURNING .. INTO ..; works differently. The inconsistency is dismissed in the documentation on the basis that there's no ORDER BY for INSERT, UPDATE or DELETE, but it's not difficult to argue that to be nonsense considering that SELECT also has LIMIT when that's the required behavior.

The suggested solution is to make INTO raise a TOO_MANY_ROWS error if a SELECT returns more than one row instead of choosing the "first" one. If the old behaviour is desired, add a LIMIT 1 (and consider adding an ORDER BY, too).

SELECT .. INTO and the number of result columns

With the PL/PgSQL of today, this function would run successfully:

 CREATE FUNCTION foof() RETURNS VOID AS $$
 DECLARE
 _f1 int;
 _f2 int;
 BEGIN
 SELECT 1, 2 INTO _f1;
 SELECT 1 INTO _f1, _f2;
 SELECT a b INTO _f1, _f2 FROM foo; -- note the missing comma
 
 FOR _f1 IN SELECT 1, 2 LOOP
 END LOOP;
 
 END $$ LANGUAGE PLPGSQL;

Even though all of the statements are clearly wrong in some way.

The suggested fix is to make PL/PgSQL raise an error if the number of columns returned by the query doesn't match the number of expressions in the INTO (or FOR .. IN <query>) target.

Though it's not clear whether that's a good idea for known row types or not. If the rowtype of a table is used, updating the table structure might result in errors for concurrently running transactions.

Variable assignments

Currently when doing assignments to variables, either via the var := expr; syntax or SELECT .. INTO, PL/PgSQL does no type checking. If the text representation of expr is valid input syntax for the type of var, the conversion is done automatically.

For example, consider the following function:

 CREATE FUNCTION foof() RETURNS VOID AS $$
 DECLARE
 _date DATE;
 BEGIN
 _date := (SELECT intcol FROM tbl);
 RAISE NOTICE '%', _date;
 END $$ LANGUAGE PLPGSQL;
 =# create table tbl as select 20040101 as intcol;
 SELECT 1
 =# select foof();
 NOTICE:  2004-01-01
 CONTEXT:  PL/pgSQL function "foof" line 6 RAISE
  foof 
  ------
   
  (1 row)

If the integers happen to be in the correct range, this might not break during testing, but probably will at some point over the life time of the application.

The suggested solution is that if there isn't an implicit or an assignment cast from the type of the expression to the variable's type, a runtime error will be thrown.

Single-row operations

Our observation at Trustly has been that in our code base, nearly all INSERTs, UPDATEs and DELETEs should affect exactly one row or something's wrong (could be a bug in the query itself or invalid data in one or more of the related tables). We feel that it's very important that execution halts immediately if some assumptions of the code have been violated. This both helps people catch bugs earlier during testing (since you get an obvious error message instead of data which subtly breaks some assumptions elsewhere in the code), but also helps in production should someone commit code with a bug in it.

Our solution so far has been to use RETURNING TRUE INTO STRICT _OK; a lot, but that's ugly, somewhat painful to write over and over, and requires a special variable _OK (declared at the beginning of the function, of course). In the past, a syntax syntax was proposed which looked roughly like this:

 STRICT UPDATE foo SET ..;

but that didn't seem to catch on.

The exact choice of keyword does not matter all that much (though RETURNINGTRUEINTOSTRICT_OK would likely not be the first choice), but at least a few people feel that something should be done here to improve the situation.

EXECUTE and FOUND

This would probably also be a good chance to make EXECUTE set FOUND. This inconsistency has caused numerous bugs and hours of head scratching.

GET DIAGNOSTICS ROW_COUNT

The current way to get the number of rows affected by a statement looks like this:

 CREATE FUNCTION foof() RETURNS VOID AS $$
 DECLARE
 _row_count int;
 BEGIN
 UPDATE foo SET bar = 1;
 GET DIAGNOSTICS _row_count = row_count;
 RAISE NOTICE 'updated % rows', _row_count;
 END $$ LANGUAGE PLPGSQL;

Which is problematic because the syntax is unintuitive and requires a special variable to hold the row count (remember that variables can only be declared at the beginning of a block; usually that would be the beginning of the function).

The suggested solution is to have a special variable, ROW_COUNT, which behaves similarly to FOUND. So the code above would look like this instead:

 CREATE FUNCTION foof() RETURNS VOID AS $$
 BEGIN
 UPDATE foo SET bar = 1;
 RAISE NOTICE 'updated % rows', row_count;
 END $$ LANGUAGE PLPGSQL;

OUT parameters

To avoid conflicts between column names and PL/PgSQL variables, it's a relatively common practice to prefix them with something (e.g. v_, or just _). However, doing the same for OUT parameters would make for some ugly looking code, consider:

 SELECT foo.bar, bar._bar, bar._baz
 FROM foo
 LATERAL bar(foo.id)
 WHERE bar._baz = 1;

so a common practice is not to prefix them with anything, which means that variable conflicts with OUT parameters can for some applications be relatively common inside the function body. To disambiguate, you can use the name of the function:

 CREATE FUNCTION foof(_fooid int)
 RETURNS TABLE (bar int, baz int)
  AS $$ BEGIN
 SELECT tbl.bar, tbl.baz INTO foof.bar, foof.baz
 FROM tbl
 WHERE tbl.fooid = _fooid;
 RETURN NEXT;
 END $$ LANGUAGE plpgsql;

However, this can be seen as problematic for a couple of reasons:

1. If the function name is long, trying to format the SQL in a nice way is hard, if not impossible.

2. If you rename the function, you also have to rename all references.

This pain could be alleviated with having a special OUT alias, which would only allow references to the OUT parameters of the current function. This would also allow making some code more readable, for example:

 CREATE FUNCTION foof(OUT bar int, OUT baz int)
 RETURNS RECORD
 AS $$
 DECLARE
 _f1 int;
 BEGIN
 
 SELECT tbl.bar, tbl.f1
 INTO OUT.bar, _f1
 FROM tbl;
 ASSERT(_f1 < 0);
 OUT.baz := 1;
 
 END $$ LANGUAGE plpgsql;

It's clear which variables affect the row returned by the function, and which ones don't, and you avoid the problem of conflicts altogether.

Assertions

Assertions are one way of improving the robustness of a code base, as I'm sure Postgres developers are aware. In the past a suggestion was made to add assertions into PL/PgSQL. The original proposal was:

1. Simple syntax: ASSERT <expr>; If <expr> evaluates to FALSE, an exception is raised.

2. A GUC to turn off assertions with zero run-time overhead. This allows heavy use of assertions in the code without taking a performance penalty in the production environment (similarly to how Assert() works in the PostgreSQL source code). A mechanism was also provided to override the global default on a per-function basis.

However, it was met with some criticism:

1. It requires making ASSERT an unreserved keyword.

2. It only supports one "severity level".

3. It doesn't support custom error messages.

More discussion on -HACKERS is probably necessary. Adding support for custom error messages could look natural if it was done this way, for example:

 RAISE <expr>, 'error message: %, %', param1, param2;

but it's not clear what specifying the severity level would look like (or whether being able to do that was even considered desirable).