DATALINK

From PostgreSQL wiki
Jump to navigationJump to search

This is a design document describing planning for a feature that may not yet exist in PostgreSQL. See the official documentation for your version to get information on available features.

Introduction

SQL Datalinks are a part of SQL/MED ISO/IEC 9075-9:2003 specification. Note that this is distinct from and complements efforts described here.

Datalink is a special SQL type intended to store URLs in database columns, and a number of functions, which can be used in SQL queries.

There aren't many RDBMSs which implement SQL/MED datalinks. IBM DB2 does. A lot of it is defined by the SQL standard to be "implementation specific".

Values of DATALINK type are opaque, constructed by using function DLVALUE. There are several functions available for converting DATALINK back to text, for example DLURLCOMPLETE.

There is not much software, which demands SQL/MED datalinks. However, there is functionality concerning files, URLs and web, which is often needed in modern applications. Often, it is awkward to use these together with RDBMS data. SQL/MED standardizes some of this.

Features

A column containing a value of type DATALINK can optionally have some special semantics:

  1. referential integrity - file pointed to by a datalink cannot be renamed or deleted
  2. access control - either SQL or file system mediated
  3. point in time recovery of file changes

Only references to external files are stored in the database, not the content of the files themselves.

Datalinks as defined by SQL/MED should provide:

  • transactional semantics
  • checking if file exists
  • protection of linked file against renaming or deletion
  • read access control through database
  • write access control through database
  • point-in-time recovery
  • deletion of files no longer referenced
  • access to files on different servers

The following is not specified by SQL/MED, but is often needed in practice (and probably needed to implement datalink):

  • compact URL storage for lots and lots of URLs (say int instead of string; sort-of like enum, but for URLs)
  • URL to filesystem mapping, access control to file system hierarchies
  • local server file access (read, write)
  • remote server file access (get, put)
  • extended file system attributes (get, set)
  • file meta data (list, get, set)

Use cases

Common use cases include:

  • website content management - ensuring that linked-to URLs exist
  • workflow management - providing for locking and versioning of files
  • configuration - managing configuration files, ie have a change in database recreate text configuration file. This file should be locked and writeable only from database.
  • web server log storage (many referer URLs)
  • RDF databases (many URLs)

Datalinks should enable you to:

  • efficiently store URLs (datalinks are stored as integers, not strings)
  • conveniently map between URLs and file system paths
  • safely read and write files from SQL (SQL/MED does not specify functions for this)
  • keep backups of changed files
  • manage files on different servers
  • manage datalink metadata (unified interface to extended filesystem attributes)

Specification

Synopsys

create table my_files (
  ...,
  link datalink,
  ....
)

insert into my_files (link) select dlvalue('file:///etc/passwd');
 
select into url dlurlcomplete(link) from my_files;

update my_files set link=dlnewcopy(url,0);

update my_files set link=dlpreviouscopy(url,0);

Datalink support functions

Constructors for values of type datalink:

  • DLVALUE(url) → datalink (INSERT only)
  • DLNEWCOPY(url,tokenp) → datalink (UPDATE only)
  • DLPREVIOUSCOPY(url,tokenp) → datalink (UPDATE only)

Functions for extracting information from datalink type:

  • DLURLCOMPLETE(datalink) → url
  • DLURLCOMPLETEONLY(datalink) → url
  • DLURLCOMPLETEWRITE(datalink) → url
  • DLURLPATH(datalink) → file path
  • DLURLPATHONLY(datalink) → file path
  • DLURLPATHWRITE(datalink) → file path
  • DLURLSCHEME(datalink) → URL scheme ('HTTP' or 'FILE')
  • DLURLSERVER(datalink) → server address
  • DLREPLACECONTENT(url, url, comment) → datalink (new in SQL 2008)

Additional functions in another well known RDBMS

  • DLVALUE(address,linktype [,comment]) → datalink
  • DLCOMMENT(datalink) → text
  • DLLINKTYPE(datalink) → link type ('FILE' or 'URL')

Datalink attributes per SQL spec

These are refered to as link control options by SQL spec. They are insanely verbose as text, but can be stored in a typmod. They are specified per column and thus apply to all URLs stored.

LINK CONTROL

  • NO
    • just store the datalink
    • file is not "linked", no further control
  • FILE
    • file is "linked"
    • file has to exist
    • level of control can be specified with further options

INTEGRITY

  • ALL - linked files cannot be deleted or renamed
  • SELECTIVE - linked files can only be deleted or modified using file manager operations, if no datalinker is installed
  • NONE - referenced files can be deleted or modified using file manager operations, not compatible with FILE LINK CONTROL

READ ACCESS

  • DB
    • read access is controlled by SQL server, based on access privileges to the datalink value
    • involves read access tokens
      • encoded into the URL by the SQL server
      • verified by external file manager/data linker
  • FS
    • read access is determined by file manager

WRITE ACCESS

  • ADMIN REQUIRING TOKEN FOR UPDATE
    • write access governed by SQL server (and datalinker)
    • involves write access token for modifying file content
  • ADMIN NOT REQUIRING TOKEN FOR UPDATE
    • write access governed by SQL server (and datalinker)
  • BLOCKED
    • linked files cannot be modified
  • FS
    • write access controlled by file manager

RECOVERY

  • YES
  • NO

ON UNLINK

  • DELETE
    • file is deleted when unlinked
  • RESTORE
    • original properties (ownership, permissions) restored as well
  • NONE
    • ownership and permissions are not restored

Implementation

Implementation of SQL/MED requires a lot of hocus-pocus and is probably even impossible to do properly on plain unix. Many parts, however seem to be implementable.

System should support at least http: and file: URL schemes. These two require quite different implementations. Perhaps a support for pluggable URL schemes could be useful - consider for example RDF namespaces.

When storing many URLs, it should be possible to greatly reduce space requirements by storing URLs only once in a master URL table and using an int4 or int8 as an actual DATALINK type. This greatly reduces storage requirements needed for many URLs, provided for the use case mentioned. It should be possible to partition the master URL table according to several criteria.

Datalink software is divided in several parts:

  • datalink database schema. Master URL table is stored here.
  • external process datalinker. It uses datalink schema. It manages files on behalf of postgresql server. It uses postgresql for implementing transactional semantics.

QUESTION: what to do for multiple databases on same server?

Parts

  • DL - DATALINK SQL/MED + support functions
  • FILEIO - local file access functions
  • URL - parsing and manipulation - one often needs to get and set various parts of URLs and filenames

Some sort of inter-server communication is needed (to support remote files). Accessing datalinker api in a remote postgresql instance over libpq (say dblink) could maybe work.

Misc notes

On ext3 filesystem, files can be locked with 'chattr +i' command, but only by root. This prevents file from beeing changed, no matter what unix file permissions say, even for root user.

By changing file owner/group to postgres and setting unix file permissions, file can be made effectively owned by postgresql server (which can than use normal unix functions to use them).

PL/PerlU seems the right choice for this, with Perl for external datalinker. Proper security and safety will be hard to get right!

SQL Conformance

Datalink features in the SQL standard
Identifier Description Status
M001 Datalinks
M002 Datalinks via SQL/CLI
M003 Datalinks via Embedded SQL
M011 Datalinks via Ada
M012 Datalinks via C
M013 Datalinks via COBOL
M014 Datalinks via Fortran
M015 Datalinks via M
M016 Datalinks via Pascal
M017 Datalinks via PL/I
Error codes for datalinks
Code Meaning
22000 DATA EXCEPTION
22017 INVALID DATA SPECIFIED FOR DATALINK
2201A NULL ARGUMENT PASSED TO DATALINK CONSTRUCTOR
2201D DATALINK VALUE EXCEEDS MAXIMUM LENGTH
HW000 DATALINK EXCEPTION
HW001 EXTERNAL FILE NOT LINKED
HW002 EXTERNAL FILE ALREADY LINKED
HW003 REFERENCED FILE DOES NOT EXIST
HW004 INVALID WRITE TOKEN
HW005 INVALID DATALINK CONSTRUCTION
HW006 INVALID WRITE PERMISSION FOR UPDATE
HW007 REFERENCED FILE NOT VALID
HY093 INVALID DATALINK VALUE