Referential Integrity Tutorial & Hacking the Referential Integrity tables
by Joel Burton
Last updated 3rd September 2001
Referential integrity is the feature of a database ensuring implied relationships in the database are enforced. It is a feature of most database systems, and protects users from accidentally (or intentionally!) creating discrepencies in their database.
In this chapter, we'll work with a sample set of data involving people, tasks they are responsible for, and appointments you have scheduled with them. As you fire employees, you delete them from your person table, but want to be certain you can't leave tasks no one is in charge of, or leave appointments where the persons name cannot be located.
If you're not familiar with the concepts of Primary Keys and Foreign Keys, it might be helpful to review a basic database primer, as this tutorial assumes basic knowledge of these concepts.
Lets have some fun!
First, because we'll be playing with the system tables, it's best to do this in a practice database :
# CREATE DATABASE RI_test; # \c ri_test
Now, let's create practice tables. We'll have one parent table, 'pers', and two child tables, 'tasks' and 'appts'.
The parent table is straightforward :
# CREATE TABLE pers (pid INT NOT NULL PRIMARY KEY, pname TEXT NOT NULL);
The first child table will hold tasks for which a person is responsible :
# CREATE TABLE tasks (taskid SERIAL NOT NULL PRIMARY KEY, pid INT NOT NULL CONSTRAINT tasks__ref_p REFERENCES pers, task TEXT NOT NULL);
A few notes :
- Foreign key references are handled by PostgreSQL as a type of CONSTRAINT. Therefore, we can begin the referential declaration with CONSTRAINT <name of constraint>. If this is done, the error messages generated by PostgreSQL will return this error name, which the user could helpfully see, or we could parse and handle in a front-end system. By wrapping it in system identifiers ("), we could even make it look like an error message (i.e. CONSTRAINT "Table tasks refers to table pers"), but we are limited in the length of this name/message, and therefore it's usefulness. CONSTRAINT names are not neccessary, and we could leave this off (i.e. ...pid INT NOT NULL REFERENCES pers...). Everything still works the same, except our constraint is unnamed, and therefore the error messages are more generic.
- After REFERENCES is the name of the table we are referring to, and optionally, the name of the field in parentheses. If the field is the primary key, the fieldname is optional. If we had not made the 'pid' field the primary key for 'pers', we would have to use 'REFERENCES pers(pid)'.
- Optionally, we can inform PostgreSQL how to handle deletes and updates on table 'pers' by adding ON DELETE and ON UPDATE declarations. If these are not made specific, then PostgreSQL defaults to "no action", which (at the time of this writing, 12/2000) means the same thing as "restrict". This means we can neither delete a parent nor update the parent 'pid' field if a child exists who depends on the parent. In our other child table, we'll see another way to handle this.
- Optionally, we can tell PostgreSQL if this is "deferrable", that is, if we are allowed to violate referential integrity while inside of a transaction as long as we have fixed our integrity problem at the end of the transaction. We'll see examples of this below, in DEFERRING.
So, let's add some sample data :
INSERT INTO pers VALUES (1, 'Jeff Brown'); INSERT INTO pers VALUES (2, 'Maria Lane'); INSERT INTO tasks (pid, task) VALUES (1, 'Write contract'); INSERT INTO tasks (pid, task) VALUES (1, 'Upgrade database');
So far so good.
If we try to insert a child that has no parent, i.e. :
INSERT INTO tasks (pid, task) VALUES (3, 'Install Linux');
we'll get a referential integrity error. This error will refer to our constraint name (if any), and will block the entering of this data into 'tasks'. In addition, if this is part of a transaction, it will rollback the entire transaction, as always happens in PostgreSQL.
Also, if we try to change a parent that has children :
DELETE FROM pers WHERE pname = 'Jeff Brown';
We'll get the same problem, as we cannot delete Jeff while he has tasks assigned.
This will work, as there are presently no child tasks for Maria :
DELETE FROM pers WHERE pname = 'Maria Lane';
If you did the above step, you'll need to add Maria back for the examples further on :
INSERT INTO pers VALUES (2, 'Maria Lane');
Default (NO ACTION) or RESTRICT - which mean the same thing to PostgreSQL - ON UPDATE and ON DELETE rules make sense for many situations like this. You wouldn't want to be able to delete a staff person from your database if they had certain resposibilities, otherwise you would never know who was in charge of different tasks.
A More Permissive Child Table : Appointments
Let's create a second child, 'appts' (short for Appointments) :
# CREATE TABLE appts (apptid SERIAL NOT NULL PRIMARY KEY, pid INT NOT NULL CONSTRAINT appt__ref_pers REFERENCES pers ON UPDATE CASCADE ON DELETE CASCADE DEFERRABLE, apptsubj TEXT NOT NULL, apptdate DATE NOT NULL);
and insert some data into it :
INSERT INTO appts (pid, apptsubj, apptdate) VALUES (1, 'Discuss raise', '2001-01-15'); INSERT INTO appts (pid, apptsubj, apptdate) VALUES (2, 'Plan project', '2001-01-20');
This works just fine. Also, just like before :
INSERT INTO appts (pid, apptsubj, apptdate) VALUES (3, 'Termination', '2001-01-20');
does not work, as there is no pid = 3 in the 'pers' table.
However, unlike the 'tasks' table, 'appts' is set to CASCADE deletes and updates. This means if we update a person's pid, or delete a person entirely, PostgreSQL will allow and assist this by first deleting or updating any tasks which exist for the person.
UPDATE pers SET pid=14 WHERE pid=2;
This will change Maria's pid to 14, first changing the connecting pid in the 'appts' table.
DELETE FROM pers WHERE pid=14;
This will delete Maria's appointment, then delete Maria.
Notice that CASCADE in this class only refers to the CASCADE between the 'pers' and 'appt' tables. Trying to DELETE Jeff would still fail because even though the pers-appt relationship would CASCADE, the pers-tasks relationship would fail, and PostgreSQL would report the DELETE attempt as a violation of this referential integrity. Possible actions
- NO ACTION (the default)
Stops the action if an update or delete would fail referential integrity checks.
Presently means the same thing as NO ACTION.
Deletes the child data, then deletes the parent data, subject to other relationships, as noted above.
- SET NULL
Sets the foreign key field to NULL, then updates or deletes the parent.
- SET DEFAULT
Sets the foreign key field to its DEFAULT value, then updates or deletes the parent.
SET NULL and SET DEFAULT can be useful options, especially for ON DELETE. For example, we might have a table, 'offices', keeping track of which office a worker uses. It might contain fields for officeid, officelocation, and persid. If we wanted to delete a person, we shouldn't be stopped just because the person has a related office, but similarly, we don't want to delete an office only because this person is being deleted. In some cases, the best option may be to set the persid field for 'offices' to NULL or DEFAULT, leaving the office in place, but making it clear this office is now unused.
Note that ON UPDATE and ON DELETE can have different rules. It's very common for instance to ON UPDATE CASCADE but ON DELETE RESTRICT - allowing people to change their ID's, but not allowing deletion of tasks if a related person exists.
It's also possible to change the actions for a relationship, and this requires a little hacking in the system catalog tables. See Hacking Referential Integrity, below.
By default, referential integrity is checked for every single relationship, for every single insert, delete or update which could affect this relationship.
This means :
INSERT INTO tasks (pid, task) VALUES (5, 'Open sales office'); INSERT INTO pers (pid, name) VALUES (5, 'Helen Kim');
would fail, because at the time of the first attempted insert, there is no person with pid=5. Much of the time, this is the most intuitive setting.
However, sometimes you may not be able to predict the exact order data arrives. Perhaps you are receiving data loaded from a text file, or from across the web. It's possible the data may not arrive in the order above (task, then associated person). When this happens, you can choose to defer the transaction checking.
In order to defer a transaction, three things must happen :
- The referential integrity relationship must have been defined as DEFERRABLE. This is not the default (NOT DEFERRABLE is), so it must be declared explicity, as we did for 'appts', above.
- You must be inside an explicit transaction.
- You must either have the initial default for this relationship be DEFERRED, or have SET the relationship constraint to DEFERRED for this transaction.
For the table 'appts' relationship to 'pers', we have declared this as DEFERRED. So, if we do :
BEGIN; SET CONSTRAINTS ALL DEFERRED; INSERT INTO appts (pid, apptsubj, apptdate) VALUES (5, 'Negotiation', '2001-01-20'); INSERT INTO pers VALUES (5, 'Helen Kim'); COMMIT;
This works just fine.
Notice having the relationship declared as DEFERRABLE is not enough - we must also use SET to explicity set CONSTRAINTS to deferred. In this example, we have set all relationships to DEFERRED; instead we could set only a single constraint to deferred, as in :
SET CONSTRAINTS appts__ref_pers DEFERRABLE;
It may be convenient to have a relationship already set, rather than having to set this for every transaction. To do this, add "INITIALLY DEFERRED" to the CONSTRAINT ... REFERENCES declaration for the table.
# CREATE TABLE appts (apptid SERIAL NOT NULL PRIMARY KEY, pid INT NOT NULL CONSTRAINT appt__ref_pers REFERENCES pers ON UPDATE CASCADE ON DELETE CASCADE DEFERRABLE INITIALLY DEFERRED, apptsubj TEXT NOT NULL, apptdate DATE NOT NULL);
Now, we can simply :
BEGIN; INSERT INTO appts (pid, apptsubj, apptdate) VALUES (5, 'Negotiation', '2001-01-20'); INSERT INTO pers VALUES (5, 'Helen Kim'); COMMIT;
With no explicit SET command. This is especially convenient if you work in a programming setting which abstracts SQL commands and makes it difficult to execute an arbitrary, nonstandard SQL command such as "SET...".
Hacking Referential Integrity
Referential integrity works great in PostgreSQL. However, PostgreSQL does not yet have SQL-synax commands to change actions, turn on/off referential integrity, etc.
These actions can be performed by editing the information stored in the system catalog tables directly.
To edit the system catalog tables, you must be a superuser in PostgreSQL. In addition, you should be VERY CAREFUL when editing these tables, and make sure you have a backup first (via pg_dump). An accidental table-wide UPDATE or DELETE could delete all of your tables, ruin your indexes, corrupt your database, etc. Practice this by working in a test database, preferrably on a machine without any other critical databases.
There are several system catalog tables of interest to us :
all "classes" including tables, views, sequences, etc.
all triggers. PostgreSQL handles referential integrity using behind-the-scenes triggers, so this is where all of your referential integrity controls are stored.
all PostgreSQL procedures. We won't need to make any changes to this, but we will use it to show what the referential integrity procedures used actually are.
To see (most) of the system tables, you can use the command \dS in psql.
For example, let's collect information from pg_class about our test tables :
# SELECT oid, relname FROM pg_class WHERE relname IN ('pers','appts','tasks');
PostgreSQL usually lowercases unquoted system identifiers such as tables for us automatically (i.e. I can "CREATE TABLE foo" and "SELECT * FROM FOO" still works.) However, when examining pg_class, you must work in a case-sensitive manner, or use case-insensitive operators.
This query returns :
oid relname 9100 pers 9110 tasks 9120 appts (3 rows)
Your OIDs will be different. Don't worry, but do take notice of what they are.
Now, if we look in pg_trigger, we can find the triggers used by our tables.
- SELECT * from pg_trigger WHERE tgrelid in (9100, 9110, 9120);
Note, the output table from this command was too wide to fit in this HTML page. You can find it here.
The columns in pg_trigger are :
The table the trigger is on. The table which is inserted/updated/deleted and calls the trigger.
Name of the trigger. Trigger names are generated by the referential integrity feature of PostgreSQL and are unimportant, except they are unique.
OID of the function to be is called. This is an important column; by changing the function called, we can change which action (CASCADE, etc.) is performed. See below.
What kind of trigger is this? UPDATE, DELETE, etc.
Is this trigger enabled?
Is this trigger part of a constraint? Non-Referential integrity triggers user-declared by CREATE TRIGGER may be FALSE for this, but referential integrity triggers will always be true.
The name of the constraint which calls this trigger. If you named your constraints (as we did, i.e. pers__ref_tasks), this will be the constraint name, otherwise it will be unnamed.
OID of table which had the constraint set.
Can this constraint be deferred? Equivalent to DEFERRABLE. See above.
Is this constraint initially deferred? Equivalent to INITIALLY DEFERRED. See above.
Number of arguments for the referential integrity function. As of PostgreSQL 7.1, this always seems to be 6, and should not be edited.
The actual arguments to the referential integrity function.
To help make this more understandable and usable, a view is helpful :
CREATE VIEW dev_ri AS SELECT t.oid as trigoid, c.relname as trig_tbl, t.tgfoid, f.proname as trigfunc, t.tgenabled, t.tgconstrname, c2.relname as const_tbl, t.tgdeferrable, t.tginitdeferred FROM pg_trigger t, pg_class c, pg_class c2, pg_proc f WHERE t.tgrelid=c.oid AND t.tgconstrrelid=c2.oid AND tgfoid=f.oid AND tgname ~ '^RI_' ORDER BY t.oid;
This view requires PostgreSQL 7.1 because of the ORDER BY statement. For PostgreSQL versions earlier than 7.1, you should remove the ORDER BY statement.
# select * from dev_ri;
Note, the output table from this command was too wide to fit in this HTML page. You can find it here.
Now, it's much easier to understand what's happening. For example, from trigger oid=263753, we can see for inserts to the 'tasks' table, RI_FKey_check_ins is called, which checks the 'pers' table.
From here, we can :
1) Temporarily disable a trigger. If you want to load lots of data, and not slow down with referential integity checks (and without having to deal with transaction-level deferrment) :
UPDATE pg_trigger SET tgenabled=FALSE WHERE oid=xxx;
Get the trigoid column from our view to find which trigger you want to affect.
Note that pg_dump files do this so table data can be inserted in any order, without having to worry about referential integrity rules.
2) Make a trigger DEFERRABLE, if it orginally wasn't, or vice-versa :
UPDATE pg_trigger SET tgdeferrable=[ TRUE | FALSE ] WHERE oid=xxx;
Or, to make a trigger INITIALLY DEFERRED (or turn this off):
UPDATE pg_trigger SET tginitdeferred=[ TRUE | FALSE ] WHERE oid=xxx;
3) Change the action for a trigger. If you've created a trigger with an action (or with the default NO ACTION action), you can change your mind by changing the function called. To get the list of all referential integrity trigger functions :
SELECT oid, proname FROM pg_proc where proname ~ '^RI_'; oid proname 1646 RI_FKey_cascade_del 1647 RI_FKey_cascade_upd 1644 RI_FKey_check_ins 1645 RI_FKey_check_upd 1654 RI_FKey_noaction_del 1655 RI_FKey_noaction_upd 1648 RI_FKey_restrict_del 1649 RI_FKey_restrict_upd 1652 RI_FKey_setdefault_del 1653 RI_FKey_setdefault_upd 1650 RI_FKey_setnull_del 1651 RI_FKey_setnull_upd
Your OIDs will probably be different. Note and use your own.
In our example, to set updates on pers(pid) to CASCADE, rather than NO ACTION on pers-tasks :
UPDATE pg_trigger SET tgfoid=1647 WHERE oid=xxx;
where xxx is our current noaction_upd trigger for pers-tasks
Note these changes often require a new backend. Quit and restart psql, or reset your client connection, and you should be able to test out your new settings.