HTTP API

From PostgreSQL wiki
Jump to navigationJump to search

The benefit of an HTTP API would be easier access via tools that don't support the PostgreSQL protocol, such as curl, web browsers, new programming languages, etc.

It'd also be attractive to the NoSQL crowd, who often want *simple* access. Particularly so when combined with PostgreSQL's new built-in JSON types, which could be returned in a JSON format result unescaped as a directly nested child object.

HTTP+REST is much easier to use from mobile apps. Currently if you want to write, say, an Android app that talks to PostgreSQL you need to do it via a web API midlayer you host somewhere. Getting rid of that requirement would be nice (though won't solve the security challenges of directly exposing your database, of course).

A HTTP REST API can make it very easy to traverse the database, making it discoverable and easily explored.

This page is an early draft proposal of an extension to PostgreSQL allowing clients to access the database using HTTP instead of PostgreSQL's proprietary protocol (FEBE).

RESTful URLs versus RPC URLs

The API is mainly following RESTful principles and many database objects are modeled as resources with their own URL. However this isn't always suitable; for example, running a query is clearly an RPC operation and those endpoints are marked as such (as opposed to being considered a resource of some sort).

The POST HTTP verb should be used for all RPC-like operations.

Session concept

PostgreSQL's FE/BE protocol uses long-lived TCP/IP connections with the strong concept of a user session. The user session has state, including a possibly open transaction, a bunch of session specific GUCs, any open cursor(s), zero or more prepared statements, etc.

RESTful APIs tend to be largely stateless, but this isn't practical when working with PostgreSQL. A way to remain as stateless as possible where possible is desirable, but there'll also have to be a way to link requests to a particular backend's state. Because sessions are extremely security sensitive (think "SET ROLE", "SET SESSION AUTHORIZATION") it's important that it not be possible to hijack them.

No design has yet been concluded for linking a request securely, efficiently and simply with a backend session. TODO.

One possible approach

By default, each request gets its own session, with auto-commit behavior if there is no error.

However, a connection can request session behavior by:

  1. Creating a session using a POST to /sessions would return a session ID; the session is associated with the TCP connection.
  2. Add a header X-PostgreSQL-Session-ID: ... to subsequent requests on the same TCP connection
  3. The session is closed and rolled back if the TCP connection is closed, or it can be specifically ended using a POST /sessions/<id>?commit=true|false

--Dobesv (talk) 17:03, 2 September 2013 (UTC)

Consistency

The meaning of a particular request method like GET, PUT, etc on an object should be consistent between objects of different types insofar as is possible.

The spec below proposes that GET on a schema returns the DDL of the schema. Will GET be the verb to request a definition in SQL of an object for all objects? Should it be, when the results can be quite big? Will the response include child objects, or be the definition of just the object its self?

An API user should know what to expect without looking up what a particular verb does on a particular object, and should be able to walk the tree of the database structure using only responses from the server.

How to represent and communicate sub-resources

Should we provide a child-list in the responses when an object is requested (as most of the JSON examples show here)? That's easy and discoverable, but means discovery of children is specific to the response format; you have to parse JSON to get children for example.

The other way is to use HTTP Link headers. These aren't as visible/obvious when using tools like curl but are independent of the response format of the body. They allow HEAD requests to be used to traverse the graph without fetching body content at all.

Using both is always an option, but that level of duplication isn't desirable in a database access protocol.

Parent/child relationships

Should a GET by default include children, or not? If it does, should it include just a list of child names, or full recursive definitions of children?

This is particularly tricky for schema, because a schema may be created with one or more children, or those children may be added later. Are tables, functions, etc a part of the schema DDL its self? Or not? Or only if they were created at the same time the schema was in the CREATE SCHEMA statement?

My opinion is: GET should by default include a list of children but not their definitions, and objects within a schema are children not part of the schema no matter how it is defined.

Given the example "public" schema with tables 'customer' and 'booking', view 'customer_summary' and function 'safe_delete_customer', a GET could return (if GETs returned SQL):

CREATE SCHEMA public;

or

CREATE SCHEMA public
  CREATE TABLE customer(
    .. blah ..
  )
  CREATE TABLE booking(
    .. blah ..
  )
  CREATE VIEW customer_summary AS SELECT * FROM customer,
  CREATE FUNCTION safe_delete_customer(customer_id) RETURNS void AS $$
     ... blah ...
  $$ language 'plpgsql';

and either would be equally valid. I (CR) think both would be horrible to work with and make the API impractical to use.

I think the default should probably be whatever is most common / useful, and then have options to customize that. I'm not sure what is the most useful/appropriate default, though. --Dobesv (talk) 17:03, 2 September 2013 (UTC)

Ease of working with responses

I (CR) am strongly against GET returning SQL language text, for example a GET on a schema returning the SQL DDL for that schema. SQL is horrible to parse and work with; clients should have to ask for it explicitly if they want it. The default should be easily processed, structured JSON listing an object and its attributes, plus (depending on the children query param) possibly a listing of all children, or a recursive listing of all children with their attributes.

Here's what the GET above should return something like:

{
  'type': 'schema',
  'name': 'public',
  'owner:' 'joebloggs',
  'children': ['customer', 'booking', 'customer_summary', 'safe_delete_customer']
}

This tells the client what they need to know quickly and efficiently. Params could modify the response, eg:

  • ?acls=true to list ACLs in a structured JSON child object (*NOT* Pg ACL text)
  • children=tables,views to only list tables & views not functions, sequences, etc
  • children=none or children=
  • Something like child_detail=all to include full JSON definitions of children, not just names. Need a way to get a whole tree, or the 1st level, without `n' queries for `n' children.
  • In addition to the 'children' array another entry could be a mapping of child names to child types. The 'children' array should remain a simple array for consistency with the other interfaces.

Note that the child names are relative URIs.

The rationale for returning DDL/SQL (at least as an option) is that this is already the standard I/O format for all existing tools (like pg_dump & psql) so it can be useful to get that out and transfer it into another tool that doesn't support JSON, and the other way around, too. I can see the value of the JSON response for API clients (like ORM tools) that want to scan the schema, though. I'm not sure which would be used more often, but JSON does seem like a good default also. --Dobesv (talk) 17:03, 2 September 2013 (UTC)

Authentication

The authentication mechanisms must be designed to avoid creating vulnerabilities to cross-site request forgery (XSRF) and other common HTTP vulnerabilities.

If cookies or HTTP Basic/Digest/Cert Authentication or any other authentication method where credentials may be presented automatically by the browser are used then request forgery protection is needed.

See Preventing CSRF Attacks for more.

Life's probably simpler if these less secure login methods are simply not supported, at least at first. --Dobesv (talk) 17:03, 2 September 2013 (UTC)

URLs

Server Root

URL: /

Server metadata / version available here.

Children: databases

With appropriate query parameters / Accept header this might produce a pg_dumpall backup of the server.

CR: IMO a GET against the server root should produce JSON structured info about the server.

GET /

... response ...

{
    'type' : 'server_root',
    'version' : 940,
    'version_human' : '9.4.0',
    'description' : 'PostgreSQL 9.4.0 Server',
    'children' : [ 'db' ]
}

Databases

URL: /databases

(CR: "/db" not "/databases"? This'll appear a *LOT*).

Get a list of databases.

With appropriate query parameters might be usable to get a pg_dump of all the databases.

CR: IMO A pg_dump is an RPC op and should be requested by a POST for /?dumpall with JSON parameters for the dumpall as the payload; it isn't simply a GET because of all the different ways it can be invoked. It shouldn't be invoked on /db but on the root, since a dumpall is about more than just databases..

CR: Response could look something like:

{
    'type' : 'databasess',
    'name' : 'db',
    'children' : [ 'postgres', 'template1', 'template0', 'mydb' ]
}

As long as doing a database dump is idempotent and doesn't change the database, I think GET is most appropriate regardless of the number of options. Are you aware of a way in which a database dump would change the database? --Dobesv (talk) 17:03, 2 September 2013 (UTC)

Database

URL: /databases/:database_name


GET with Accept returns a dump of the database. Query parameters:

With appropriate query parameters might be usable to get a pg_dump of the database.

CR: IMO A GET should return a JSON response showing the database info and a list of its schema, with query params to add/remove detail, eg:

GET /db/mydbname

... response ...

{
   'type' : 'database',
   'name' : 'mydbname',
   'owner' : 'joebloggs',
   'children' : ['schemas/public', 'schemas/pg_catalog', 'schemas/myschema']
}

Note that 'children' should always be URIs, in which case schema names will have to have / signs URI-encoded, eg "my/schema/name" becomes "my%2Fschema%2Fname" and thus "schemas/my%2Fschema%2Fname". Ugly, but we can always have another key named eg 'schemas' with non-escaped names; the most important thing is that the 'children' key always contain valid relative URIs for traversing the database structure.

I don't think pg_dump should be a GET, I think it is an RPC op and should be a response to a POST like for example:

POST /db/mydbname?dump
{
  'schema_only' : false,
  'inserts' : true
}

An accept header to select pg_dump isn't enough because of the different ways people need to run dumps.

It seems easier to pass the options as URL query parameters rather than in a JSON body, to me. And POST would imply creation of an object, or at least database modification, which I don't think a dump is. --Dobesv (talk) 17:03, 2 September 2013 (UTC)

Querying a Database

URL: /databases/:database_name/query

RPC endpoint for SQL queries on this database. POST SQL, get a response in a format according to the Accept header. Uses the default schema for the database, or can change schemas in the SQL.

CR: Consider wrapping the SQL in JSON to allow out-of-line query parameters, per-query GUC settings, etc. Think:

POST /mydb/query
{
   type: 'sqlquery',
   query : 'SELECT * FROM mytable WHERE userid = :userid',
   params : {
       'userid' : ['integer', 400]
   },
   query_gucs : {
       'work_mem' : '1M',
       'statement_timeout' : 5000,
       'search_path' : 'public,myschema'
   },
}

Personally I don't think raw SQL should be accepted as a POST at all, it should *have* to be in a JSON container. Very slightly slower, but lots more flexible, and the HTTP interface is quite likely going to be out-of-process anyway.

That's useful as an option, but I think accepting straight SQL will be of use to people more often. I don't know how often people attach those options to their SQL queries. --Dobesv (talk) 17:03, 2 September 2013 (UTC)

Schema

URL: /databases/:database_name/schemas/:schema_name

GET returns the DDL schema definition of the schema

CR: Totally against GET returning the DDL. Should return descriptive JSON with schema info and children. For SQL DDL should probably be GET with an accept header or even RPC-style POST with params. See above discussion of API consistency for why I think GET returning SQL here is bad.

Query using a Schema

URL: /databases/:database_name/schemas/:schema_name/query

RPC endpoint for SQL queries using this schema as the default. POST SQL and get a response in a format according to the Accept header.

CR: Given that schemas in Pg are a search path, what exactly would this mean? Set search_path to "theschema,pg_catalog" ? just prepend "theschema" ? I wonder if querying under /schemas is needed at all, or if it's better satisfied by passing a search_path GUC to queries at the database level, as above.

I guess I don't really know. What do tools like pgAdmin usually do when you browse into a schema and run some SQL? --Dobesv (talk) 17:06, 2 September 2013 (UTC)

Table

URL: /databases/:database_name/schemas/:schema_name/tables/:table_name

GET returns the DDL for the table (and optionally indexes if query parameter.

PUT to provide a new DDL

CR: GET should return *descriptive* info, not SQL DDL that's a horrific nightmare to parse for anything useful. Maybe accept header for SQL, or a POST to request it. GET should respond, eg:

{
    type : 'table',
    name : 'thetablename',
    owner : 'freddy',
    database: 'mydb',
    schema: 'public',
    tablespace: 'default',
    children: [ 
        'rows',
        'columns/id', 'columns/username',
        'constraints/username_must_be_unique',
        'constraints/username_is_lowercase'
    ],
    columns : {
       id: { index: 0, type: 'integer', default: 'nextval(\'thetablename_id_seq\')', nullable: false },
       username: { index: 1, type: 'text', nullable: true },
    },
    indexes : {
       thetablename_id_idx: { on: ['id'], index_type : 'btree', unique: true, schema: 'public', tablespace: 'default', uri: '/db/mydb/schemas/public/rels/thetablename_id_idx' },
       thetablename_username_unique_idx: { on: ['username'], index_type: 'btree', unique: true, schema: 'public', tablespace: 'default', uri: '/db/mydb/schemas/public/rels/thetablename_username_idx'}
    },
    constraints: {
       username_must_be_unique: { constraint_type: 'unique', on: ['username'], implementing_index: 'thetablename_username_unique_idx' },
       username_is_lowercase: { constraint_type: 'check', expression: 'lower(username) == username', where: 'id <> 0' }
    }
}

You get the idea. That's something computer-readable that you can get a lot of good info out of and learn things about like "the username column is a text column and is nullable, but if present values must be unique", not just some horrid-to-parse SQL.

In 'indexes', 'constraints', etc should probably be objects not arrays since they have no natural order, and shouldn't include anything more than names unless a query param asks for recursive info on children. Cols have natural order but that could be given by an ordinal index, making them consistent. The response above would be generated with with param "?child_detail=true". If that were omitted, all that'd be produced would be a child-list (where each child could of course then be queried):

{
    type : 'table',
    name : 'thetablename',
    database: 'mydb',
    schema: 'public',
    tablespace: 'default',
    children: [ 
        'rows',
        'columns/id', 'columns/username',
        'constraints/username_must_be_unique',
        'constraints/username_is_lowercase'
    ]
}

Note that as indexes are separate objects to tables and not necesarily even in the same schema, I've not listed them in 'children'.


Also, IMO "/schema/myschema/tables" is wrong because it should be transparent whether you're seeing a table or a view. "/schema/myschema/relations" maybe. Needs to match Pg namespace rules: tables can't have same name as views and are in same namespace, but (eg) functions _can_ have same name as a table.

Index

URL: /databases/:database_name/schemas/:schema_name/table/:table_name/indexes/:index_name

GET returns the DDL schema definition of the table

CR: As above, against returning DDL. This should return index info in json eg:

{
    type: 'index',
    name: 'the_table_id_idx',
    owner: 'freddy',
    database: 'mydb',
    schema: 'public',
    tablespace: 'default',
    target_table: {
       uri: '/db/dbname/schemas/public/the_table',
       name: 'the_table',
       schema: 'public',
    },
    columns: ['id'],
    index_type: 'btree',
    unique: true
}

Table Rows

URL: /databases/:database_name/schemas/:schema_name/tables/:table_name/rows

GET constructs an SQL SELECT using the query parameters and the URL and returns the result. Query parameters:

  • cols - SELECT column list
  • where - Specify conditions
  • group_by - Specify grouping
  • order_by - Specify ordering

POST constructs an SQL INSERT or COPY using the row data in the body, based on the Content-Type. When Content-Type is sql, use INSERT. If the Content-Type is csv or text use COPY semantics.

DELETE constructs an SQL DELETE using the query parameter where (same as for GET above).

CR: Each parameter chunk described above like 'where' or 'order_by' would almost have be a carved-out chunk of SQL that can accept arbitrary expressions, so is it useful to separate them at all rather than use a regular SQL query and let people use existing query generator facilities? How do you represent "col1 DESC NULLS FIRST, col2 ASC" in a query param *without* just making it a raw SQL excerpt?

CR: Seems what you want on the table/rows interface is a simple "CRUD" style API. If that's the case, GET on rows/ should return a list of all primary keys, and a GET on rows/[key] should return the data for that row. I'm extremely strongly against the idea of query param filters that're SQL snippets, since you're absolutely begging for horrible SQL injection problems.

GET /db/mydb/schemas/public/tables/tblname/rows

... response ...

{
   type: 'rows',
   table: 'tblname',
   schema: 'public',
   database: 'mydb',
   children: [ '1', '4', '8', '27' ]
}

You can then use regular GETs on the children, PUTs to replace them, you can POST on rows/ to create a new child, etc. CRUD. Slow and inefficient for bulk or high intensity operation perhaps, but clean, simple, with no SQL involved, and trivial to implement.

Anything more sophisticated is probably better done with SQL. Down the track maybe with a more machine-friendly query language (maybe json/xml/whatever based), but I don't think you can _securely_ implement WHERE, HAVING, ORDER BY, etc without having a structured query language. Dumping SQL snippets in is an instant security hole.

CR: Also worth considering the utility of matrix parameters for CRUD-style access. See matrix parameters in RESTEasy docs.

Row

URL: /databases/:database_name/schemas/:schema_name/tables/:table_name/row (?)

Variation on rows above, where if the number of rows fetched, inserted, or deleted != 1 it returns an error and rolls back any changes that might have been made.

CR: That's inconsistent with the structure of the rest of the API, and not REST. Should instead probably use GET rows/[id] to fetch a row, PUT rows/[id] to replace/edit, DELETE rows/[id] to delete, and POST rows with data for a new row to create, as per standard RESTful design.

Login Roles

URL: /logins

  • GET: List of login roles
  • POST: Add a new login role

CR: Should be /roles?login=true . Using "/logins" confuses namespaces because same object can appear in "/logins" and "/groups" (remember a login role can have members) and not all roles are groups *or* login roles. I strongly advise merging "/logins" and "/groups" into "/roles".

Login Role

URL: /logins/:name

  • GET: Get information about a single login role
  • PUT: Update a login role
  • DELETE: Remove a login role

CR: As above, should be /roles/:name . Agree with the rest of this design element. POST /logins should create a new role.

Group Roles

URL: /groups

  • GET: List of group roles
  • POST: Create a new group role

CR: as noted above, this should be merged with "/logins" into "/roles".

Group Role

URL: /groups/:name

  • GET: Information about this group role
  • PUT: Update this group role
  • DELETE: Remove this role

CR: AS /logins/:name

LOBs

URL: /databases/:database/lobs

  • POST: Upload a new large object with a newly generated oid

URL: /databases/:database/lobs/:oid

  • GET: Fetch a large object body
  • PUT: Update a large object body
  • DELETE: Delete a large object

Data Formats

The HTTP server should consider supporting XML, JSON, and SQL output formats. It determines the output format to use based on the "Accept" header of an HTTP request, or optionally the URI extension. It determines the format of the input based on the "Content-Type" header of an HTTP request, or optionally the filename extension in the Content-Disposition header.

CR: Using both extension and Accepts: makes it hard for security rules and scanners to do their job. Look at what happened with Microsoft Internet Explorer ignoring MIME type and favouring file extension. I don't think using file extension is a good idea. Just use MIME with accepts headers / Content-type.

SQL

SQL output should match what you see in a pg_dump. TODO: Examples

XML

To be designed...

  • SQL Server seems to be very flexible in its XML output such that you can produce structured XML rather than just a collection of rows
  • Need to figure out how and if we can use existing XML schemas and formats for this

JSON

To be designed...

  • Do we need structured output here, or just a collection of rows?
  • We have the option to output rows as arrays or maps ... should we allow both? How best would the user specify which they want?

SQL MIME Types

There is currently no registered mime type for SQL source code. MIME types that look SQL-ish should be treated as application/vnd.postgresql.sql, including but not limited to:

  • application/sql
  • text/sql
  • application/x-sql
  • text/x-sql

Versioning

The fields defined for the PostgreSQL data model might change over time. The mime-type can contain the version of the schema which should be tied to the release of PostgreSQL. For example:

Accept: application/vnd.postgresql.login-role+json;version=9.3

Means the client only wants to receive a json response compatible with the data model of PG 9.3.

File Extensions

Applying a suffix to a URI such as .sql, .csv, or .html overrides the provided Accept: header(s) (if any) with a matching content type for that suffix. When the Content-Type is text/plain or application/octet-stream then we will look at the file extension of the filename in the Content-Disposition header, if any, and override the Content-Type based on that.

  • .sql -> application/vnd.postgresql.sql
  • .csv -> text/csv
  • .html -> text/html

CR: Dangerous, see Internet Explorer. Why have this feature? What purpose does it serve? What problem does it solve?

Some features to think about

  • Ability to put EXPLAIN ANALYZE and other timing/performance information in a query response
  • SPDY protocol support (possibly without encryption / compression if we have a fat / secure pipe)
  • Partition IDs (possibly specifying a cluster node or partition in the request)
  • Virtual Hosts (direct request to a completely different DB server based on Host header)
  • How to reliably and securely associate requests with backend session state
    • what if each HTTP connection would be handled just as a normal session connection, rolled back soon as the connection is lost? You'd have to account for that in your client application design. It's workable though; many javascript applications out there do exactly that anyway.

Database Procedure Based Approach

Not necessarily contradictory, but in my mind more secure approach than generic REST access to schema and data would be to expose database procedures.

This is how it would work in my mind:

  1. An extension of PostgreSQL makes the main server process listen another port (443) for HTTPS connections.
  2. The main server process forks a normal (though HTTPS) client session to handle each incoming connection.
  3. After the cryptographic handshake the responsible process parses first incoming HTTP request.

For each HTTP request coming through the HTTPS connection

  1. The responsible PostgreSQL process executes SQL specified in general configuration with parsed HTTP request as input to select the intended database and schema as well as authentication and request handler procedures to call there.
    • a typical mapping might be from first part of domain name (<key>.example.com) or first URI path element (example.com/<key>) to database, schema, authn_proc_name, request_handler_proc_name tuple.
  2. The selected authentication procedure is called and expected to return a username for normal PostgreSQL access control.
    • This would allow applications to implement their own security, such as two way authentication.
    • Specifying a fallback procedure to produce a HTTP response in failure case might be supported as well.
    • It'd be good to provide sample procedure sets for HTTP digest, cookie and client certificate based authentications.
  3. The request handler procedure associated with the selected database and schema is called after authentication. It should use the HTTP request information to produce a HTTP response object, probably by passing the call to further procedures in a layered application.
    • a typical mapping to actual "business procedures" might be to use the next path element from URI to specify name of another procedure to call, and pass subsequent path elements as positional parameters, or GET and POST parameters as key-value pairs.
  4. The returned HTTP response object is serialized as a HTTP response to the client through the HTTPS connection.

Subsequent requests should probably be subjected to the same handling, complete with the database selection and authentication, because any request might point to a different database, and authentication of each request separately is in line with the statelessish nature and tradition of HTTP.

Once the HTTPS TCP connection breaks or is closed, the session ends like sessions normally do; any uncommitted changes in a transaction are rolled back.

This approach is radically different in the sense that it makes PostgreSQL a standalone application platform complete with a fully programmable web GUI. It should be possible to choose one or both of the approaches, REST schema and data access as well as this application style access, to be exposed, if both are implemented.

Sample request handler procedures should clarify idea of using HTTP request parts to select a "business procedure" to call with some parameters thereof, and how results of "business procedures" might be turned into different output formats (HTML/JSON/XML/CSV) depending on the requested type, not the business logic and formatted into appropriate layouts.

This procedure call based approach may be used to implement generic REST exposure of whole database system, but not vice versa. Likewise, the procedures may provide more control over who gets to see or do what than a flat approach relying only on common PostgreSQL ACLs.

--Korpiq 21:11, 28 September 2012 (UTC)

I like this idea in principle, but in practice I wonder about the implications of it. Normally the database is a sort of special-purpose component with its own resources and scaling methods. I think if someone actually did apply the full power of this system to build their whole web application logic into the database server they might run into problems if/when they need to scale up due to coupling the database and application layer. Everyone who didn't build their web application logic in the database would pay the performance penalty for parsing / authenticating requests in a scripting language instead of in compiled C code. The database layer is one possibly rare place where people will be counting milliseconds of overhead per request as a problem.

--Dobesv (talk) 16:39, 2 September 2013 (UTC)

Implementations

Many of the ideas in this wiki are now implemented in PostgREST.