DATALINK
This is a design document describing planning for a feature that may not yet exist in PostgreSQL. See the official documentation for your version to get information on available features.
Introduction
SQL Datalinks are a part of SQL/MED ISO/IEC 9075-9:2003 specification. Note that this is distinct from and complements efforts described here.
Datalink is a special SQL type intended to store URLs in database columns, and a number of functions, which can be used in SQL queries.
There aren't many RDBMSs which implement SQL/MED datalinks. IBM DB2 does. A lot of it is defined by the SQL standard to be "implementation specific".
Values of DATALINK type are opaque, constructed by using function DLVALUE. There are several functions available for converting DATALINK back to text, for example DLURLCOMPLETE.
There is not much software, which demands SQL/MED datalinks. However, there is functionality concerning files, URLs and web, which is often needed in modern applications. Often, it is awkward to use these together with RDBMS data. SQL/MED standardizes some of this.
Features
A column containing a value of type DATALINK can optionally have some special semantics:
- referential integrity - file pointed to by a datalink cannot be renamed or deleted
- access control - either SQL or file system mediated
- point in time recovery of file changes
Only references to external files are stored in the database, not the content of the files themselves.
Datalinks as defined by SQL/MED should provide:
- transactional semantics
- checking if file exists
- protection of linked file against renaming or deletion
- read access control through database
- write access control through database
- point-in-time recovery
- deletion of files no longer referenced
- access to files on different servers
The following is not specified by SQL/MED, but is often needed in practice (and probably needed to implement datalink):
- compact URL storage for lots and lots of URLs (say int instead of string; sort-of like enum, but for URLs)
- URL to filesystem mapping, access control to file system hierarchies
- local server file access (read, write)
- remote server file access (get, put)
- extended file system attributes (get, set)
- file meta data (list, get, set)
Use cases
Common use cases include:
- website content management - ensuring that linked-to URLs exist
- workflow management - providing for locking and versioning of files
- configuration - managing configuration files, ie have a change in database recreate text configuration file. This file should be locked and writeable only from database.
- web server log storage (many referer URLs)
- RDF databases (many URLs)
Datalinks should enable you to:
- efficiently store URLs (datalinks are stored as integers, not strings)
- conveniently map between URLs and file system paths
- safely read and write files from SQL (SQL/MED does not specify functions for this)
- keep backups of changed files
- manage files on different servers
- manage datalink metadata (unified interface to extended filesystem attributes)
Specification
Synopsys
create table my_files ( ..., link datalink, .... ) insert into my_files (link) select dlvalue('file:///etc/passwd'); select into url dlurlcomplete(link) from my_files; update my_files set link=dlnewcopy(url,0); update my_files set link=dlpreviouscopy(url,0);
Datalink support functions
Constructors for values of type datalink:
- DLVALUE(url) → datalink (INSERT only)
- DLNEWCOPY(url,tokenp) → datalink (UPDATE only)
- DLPREVIOUSCOPY(url,tokenp) → datalink (UPDATE only)
Functions for extracting information from datalink type:
- DLURLCOMPLETE(datalink) → url
- DLURLCOMPLETEONLY(datalink) → url
- DLURLCOMPLETEWRITE(datalink) → url
- DLURLPATH(datalink) → file path
- DLURLPATHONLY(datalink) → file path
- DLURLPATHWRITE(datalink) → file path
- DLURLSCHEME(datalink) → URL scheme ('HTTP' or 'FILE')
- DLURLSERVER(datalink) → server address
- DLREPLACECONTENT(url, url, comment) → datalink (new in SQL 2008)
Additional functions in another well known RDBMS
- DLVALUE(address,linktype [,comment]) → datalink
- DLCOMMENT(datalink) → text
- DLLINKTYPE(datalink) → link type ('FILE' or 'URL')
Datalink attributes per SQL spec
These are refered to as link control options by SQL spec. They are insanely verbose as text, but can be stored in a typmod. They are specified per column and thus apply to all URLs stored.
LINK CONTROL
- NO
- just store the datalink
- file is not "linked", no further control
- FILE
- file is "linked"
- file has to exist
- level of control can be specified with further options
INTEGRITY
- ALL - linked files cannot be deleted or renamed
- SELECTIVE - linked files can only be deleted or modified using file manager operations, if no datalinker is installed
- NONE - referenced files can be deleted or modified using file manager operations, not compatible with FILE LINK CONTROL
READ ACCESS
- DB
- read access is controlled by SQL server, based on access privileges to the datalink value
- involves read access tokens
- encoded into the URL by the SQL server
- verified by external file manager/data linker
- FS
- read access is determined by file manager
WRITE ACCESS
- ADMIN REQUIRING TOKEN FOR UPDATE
- write access governed by SQL server (and datalinker)
- involves write access token for modifying file content
- ADMIN NOT REQUIRING TOKEN FOR UPDATE
- write access governed by SQL server (and datalinker)
- BLOCKED
- linked files cannot be modified
- FS
- write access controlled by file manager
RECOVERY
- YES
- NO
ON UNLINK
- DELETE
- file is deleted when unlinked
- RESTORE
- original properties (ownership, permissions) restored as well
- NONE
- ownership and permissions are not restored
Implementation
Implementation of SQL/MED requires a lot of hocus-pocus and is probably even impossible to do properly on plain unix. Many parts, however seem to be implementable.
System should support at least http: and file: URL schemes. These two require quite different implementations. Perhaps a support for pluggable URL schemes could be useful - consider for example RDF namespaces.
When storing many URLs, it should be possible to greatly reduce space requirements by storing URLs only once in a master URL table and using an int4 or int8 as an actual DATALINK type. This greatly reduces storage requirements needed for many URLs, provided for the use case mentioned. It should be possible to partition the master URL table according to several criteria.
Datalink software is divided in several parts:
- datalink database schema. Master URL table is stored here.
- external process datalinker. It uses datalink schema. It manages files on behalf of postgresql server. It uses postgresql for implementing transactional semantics.
QUESTION: what to do for multiple databases on same server?
Parts
- DL - DATALINK SQL/MED + support functions
- FILEIO - local file access functions
- URL - parsing and manipulation - one often needs to get and set various parts of URLs and filenames
Some sort of inter-server communication is needed (to support remote files). Accessing datalinker api in a remote postgresql instance over libpq (say dblink) could maybe work.
Misc notes
On ext3 filesystem, files can be locked with 'chattr +i' command, but only by root. This prevents file from beeing changed, no matter what unix file permissions say, even for root user.
By changing file owner/group to postgres and setting unix file permissions, file can be made effectively owned by postgresql server (which can than use normal unix functions to use them).
PL/PerlU seems the right choice for this, with Perl for external datalinker. Proper security and safety will be hard to get right!
SQL Conformance
Identifier | Description | Status |
---|---|---|
M001 | Datalinks | |
M002 | Datalinks via SQL/CLI | |
M003 | Datalinks via Embedded SQL | |
M011 | Datalinks via Ada | |
M012 | Datalinks via C | |
M013 | Datalinks via COBOL | |
M014 | Datalinks via Fortran | |
M015 | Datalinks via M | |
M016 | Datalinks via Pascal | |
M017 | Datalinks via PL/I |
Code | Meaning |
---|---|
22000 | DATA EXCEPTION |
22017 | INVALID DATA SPECIFIED FOR DATALINK |
2201A | NULL ARGUMENT PASSED TO DATALINK CONSTRUCTOR |
2201D | DATALINK VALUE EXCEEDS MAXIMUM LENGTH |
HW000 | DATALINK EXCEPTION |
HW001 | EXTERNAL FILE NOT LINKED |
HW002 | EXTERNAL FILE ALREADY LINKED |
HW003 | REFERENCED FILE DOES NOT EXIST |
HW004 | INVALID WRITE TOKEN |
HW005 | INVALID DATALINK CONSTRUCTION |
HW006 | INVALID WRITE PERMISSION FOR UPDATE |
HW007 | REFERENCED FILE NOT VALID |
HY093 | INVALID DATALINK VALUE |