SEPostgreSQL Architecture
This chapter introduces architecture of the SE-PostgreSQL
- List of chapters
How SE-PostgreSQL works with SELinux
SELinux security architecture
SELinux performs as a reference monitor in the Linux kernel. From the definition, the reference monitor is a small enough and tamperproof module which checks all the accesses and makes its decision when user invokes a request to the data object managed by the system. At the operating system, user has to invoke system calls to access the data objects managed by the operating system, such as files, sockets and so on. SELinux acquires any system call invocations via the security hooks deployed on the strategic points, and makes its decision based on its security model and the security policy.
SELinux's security model is very simple. It assigns an security identifier called security context, represented as a formatted string independent from the kind of objects, on all the objects managed by the operating system, such as filesystem objects, network sockets, processes and so on. Please note that any processes also have its security context. It means the caller of system calls has a security context, not only the target of the system calls.
SELinux looks up an entry in the security policy which is a set of access control rules for the pair of security context, and prevents the required accesses if it is not explicitly allowed to the pair. This mechanism is similar to the client-server model, so it is often called security server which can return its decision for the given pair of the security contexts.
Interaction with SELinux
SELinux also exports the interface for userspaces, to perform as a security server. It provides a pseudo filesystem named as selinuxfs (mounted on /selinux in generally), providing some of raw interfaces to in-kernel SELinux. The libselinux provides a set of the abstraction level APIs for userspace applications. It enables userspace applications to confirm SELinux whether the security policy allows the required actions for the pair of security contexts, or not.
SE-PostgreSQL performs as a client of the security server. It checks any given queries come from clients, as if SELinux checks any system call invocations, and ask in-kernel SELinux whether it should be allowed, or not. Then, SE-PostgreSQL can prevent the client to access the required database object, if violated. Some of userspace applications perform as a client of the security server more than SE-PostgreSQL, such as XACE/SELinux (X.org with SELinux enhancement), nscd daemons, password utilities and so on.
This design gives us a characteristic feature called system-wide consistency in access controls, because all the access control decisions are made by the SELinux security server based on its common criteria. For example, it shall prevent a user without clearance to read an information labeled as `credential`, even if it is stored as either a file or a database record.
Here is a protocol to communicate between SELinux and userspace applications, provided by the libselinux. It requires all the entities appeared in access controls to be abstracted as a security context, so it means SE-PostgreSQL needs to manage the security context of the client (as a subject) and the security context of the database objects (as an object as literal). The libselinux provides an interesting API to obtain the security context of the peer process for the given socket descriptor. SE-PostgreSQL applies its result as the security context of the client. For database objects, it need to assign an individual security context for each database objects simply and compactly as far as possible. See the Management of security context for more details.
Hooks in strategic points
At the implementation level, SE-PostgreSQL puts various kind of hooks on the original PostgreSQL to acquire the controls and make its decision. These points have significant meanings, so we call it strategic points.
The following list is a part of the strategic points.
- sepgsqlCheckRTEPerms() invoked from ExecCheckRTPerms(), just after ExecCheckRTEPerms() successed.
- sepgsqlCheckDatabaseSuperuse() invoked from superuser_arg(), if it allows the client to perform as superuser.
- sepgsqlCheckBlobCreate() invoked from LargeObjectCreate(), just before creating a new large object.
When SE-PostgreSQL is disabled at run-time or build-time, all the hooks don't affect anything. It fully performs as a normal PostgreSQL in other words.
All the SE-PostgreSQL logics are encapsulated behind the hooks, it enables to minimize the impact to the original PostgreSQL and gives well maintainability.
Userspace access vector cache
When SE-PostgreSQL communicates with in-kernel SELinux, it needs context-switching due to the system call invocation, however, it is basically a heavy operations, so it is necessary to reduce the number of system call invocations to minimize the performance loss due to the additional privileges checks. Especially, a query can fetch massive number of tuples in a single query, so it might be insufferable, if it has invoked a system call for each checks.
The idea of userspace avc (aceess vector cache) enables to minimize it. In the SELinux security model, massive number of objects tend to share a limited number of security context, and the same result shall be returned for the identical combination of the security context and actions, so we can cache recently asked pattern in the userspace.
When a client gives a SQL query, SE-PostgreSQL subsystem is invoked to make its decision via a hook on a strategic points. At first, it checks userspace avc. If found, it can return the result immediately. If the given combination is not found, it needs to invoke in-kernel SELinux according to the communication protocol. The in-kernel SELinux also has a similar structure. It can lookup the kernel avc with little cost, but it is a heavy step to lookup an entry from the security policy (in comparison to the avc lookup).
The userspace avc also has a good characteristic. It enables to lookup the hash table by security identifier, without any text representation of the security context. As we mention later, a text representation of security context has an integer identifier, called security id, and it can be fetched from the HeadTuple data structure using a simple macro. It means we don't need to translate a security id into the corresponding security context in text representation when SE-PostgreSQL makes its decision at the most frequency path.
The protocol requires to deliver the security context in text, so it is necessary to translate the security id when userspace avc mishits. However, its frequency is very small. More than 99% of checks hits the userspace avc generally.
Management of the security context
What is security context
Security context is a short formatted text which abstracts all the attributes of an entiry labeled in SELinux's access controls. On a system with SELinux enabled, any data objects managed by operating system are labeled with a certain security context.
For example, ls -Z shows the security context of files as follows. Major filesystems have a capability to associate individual files with an extended attribute (xattr), and SELinux utilize the feature to assign a certain security context on every files.
[kaigai@saba ~]$ <b>ls -Z /var/</b> drwxr-xr-x. root root system_u:object_r:acct_data_t:s0 account/ drwxr-xr-x. root root system_u:object_r:var_t:s0 cache/ drwxr-xr-x. root root system_u:object_r:cvs_data_t:s0 cvs/ drwxr-xr-x. root root system_u:object_r:var_t:s0 db/ drwxr-xr-x. root root system_u:object_r:var_t:s0 empty/ drwxr-xr-x. root root system_u:object_r:games_data_t:s0 games/ drwxrwx--T. root gdm system_u:object_r:xserver_log_t:s0 gdm/ - (snip) -
Files are not only objects on which SELinux assigns a security context. ps -Z or pstree -Z shows the security context of processes. Linux kernel provides a private field for the currently available security stuff, and SELinux utilizes it to store the security context of processes.
[kaigai@saba ~]$ <b>pstree -Z</b> init(`system_u:system_r:init_t:s0') ├─auditd(`unconfined_u:system_r:auditd_t:s0') │ ├─audispd(`unconfined_u:system_r:audisp_t:s0') │ │ ├─sedispatch(`unconfined_u:system_r:audisp_t:s0') │ │ └─{audispd}(`unconfined_u:system_r:audisp_t:s0') │ └─{auditd}(`unconfined_u:system_r:auditd_t:s0') ├─bash(`system_u:system_r:initrc_t:s0') ├─httpd(`system_u:system_r:httpd_t:s0') │ ├─httpd(`system_u:system_r:httpd_t:s0') │ ├─httpd(`system_u:system_r:httpd_t:s0') │ └─httpd(`system_u:system_r:httpd_t:s0') ├─postgres(`system_u:system_r:postgresql_t:s0') │ ├─postgres(`system_u:system_r:postgresql_t:s0') │ ├─postgres(`system_u:system_r:postgresql_t:s0') │ └─postgres(`system_u:system_r:postgresql_t:s0') ├─smbd(`system_u:system_r:smbd_t:s0') │ ├─smbd(`system_u:system_r:smbd_t:s0') │ └─smbd(`system_u:system_r:smbd_t:s0') - (snip) -
A security context consists of four field separated by a colon character.
Example of the security context:
system_u:system_r:postgresql_t:s0 for PostgreSQL server process system_u:object_r:shadow_t:s0 for /etc/shadow file unconfined_u:object_r:user_home_t:s0 for user home directory
The first field is selinux user, the second one is role, the third one is type, also called domain when the security context is assigned to processes, and the last field is range.
SELinux has a few security models (TE, MLS and RBAC), each security model picks up a certain field and makes its decision. The SELinux security server answers the given access should be allowed, only when all the security models allow it. Please see the SELinux Overview for more details.
The essential point is that a security context abstracts all the attributes of an object which is labeled as a certain security context, and SELinux makes its decision based on only security context, independent from any other factors.
It means SE-PostgreSQL also needs to provide a facility to manage security context of the database objects, as if major filesystems provide xattr capability. This section introduces how SE-PostgreSQL manages the security context of the database objects, and what interfaces are provided to users.
Interaction between pg_security system catalog
A typical security context is a short string with a few dozen byte in length. We need to provide a capability to assign a certain security context on database objects for access controls. PostgreSQL manages a database object as a tuple within system catalogs, so we can consider the issue as a way to associate a tuple with a security context.
A security context has a characteristic that massive number of objects tend to share a limited number of security contexts, because it abstracts all the security attributes of an object, so a set of objects with uniformed rules can share an identical security context.
Size of a security context is not large, however, the number of tuples are massive, so we need to consider the way to store security context compactly as far as possible.
The pg_security system catalog provides a capability to store a pair of a text representaion and an object identifier. Every database objects have a security identifier which is the object identifier of the pg_security system catalog, instead of its text representation. The HeapTupleHeaderData structure allows a variable padding field. The security identifier is stored on the field like as an object identifier stored in. This design enables to reduce the space to store a security attribute, and enables to look up the userspace AVC without comparison of strings.
When importing/exporting the security context, the security identifier is automatically translated from/to its text representation, as if it is handled as a text for users. If the given text representation is not on the pg_security, it is implicitly inserted into pg_security as a new record, and its object identifier is applied to the security identifier of the new security context.
The security_label system column
A security identifier of tuples is not stored as a regular field, it is necessary to provide an alternative way to access to it, like as the oid system column provides users the way to refer the object identifier of tuples.
A new system column named security_label was added. System columns are implicitly created for every relations (except for oid), and not expanded with SELECT * FROM ... statement. These characteristics contributes the compatibility to existing queries embedded within application sotwares.
Users can use the security_label system column as a field in SELECT statements as follows. It is declared as TEXT type, and a client can get a security context of the tuple in text representation. It is internally translated from the security identifier of the tuple to the corresponding security context on the pg_security system catalog.
postgres=# SELECT security_label, * FROM drink ORDER BY id; security_label | id | name | price -------------------------------------------------+----+-------+------- system_u:object_r:sepgsql_table_t | 1 | water | 100 system_u:object_r:sepgsql_ro_table_t:Classified | 2 | coke | 120 system_u:object_r:sepgsql_table_t:Classified | 3 | juice | 130 system_u:object_r:sepgsql_ro_table_t | 4 | cofee | 180 (4 rows)
A characteristic of the new security_label system column is writable, although existing system columns are read-only. It allows users to change the security context of tuples using UPDATE statement as far as they have enough privileges.
postgres=# UPDATE drink SET security_label = 'system_u:object_r:sepgsql_secret_table_t' WHERE id in (1,4); UPDATE 2 postgres=# SELECT security_label, * FROM drink ORDER BY id; security_label | id | name | price -------------------------------------------------+----+-------+------- system_u:object_r:sepgsql_secret_table_t | 1 | water | 100 system_u:object_r:sepgsql_ro_table_t:Classified | 2 | coke | 120 system_u:object_r:sepgsql_table_t:Classified | 3 | juice | 130 system_u:object_r:sepgsql_secret_table_t | 4 | cofee | 180 (4 rows)
We also can insert a new tuple with an explicit security context using security_label system column. When we don't give any security context, SE-PostgreSQL assigns a default security context. See the Default security context section for more details.
postgres=# INSERT INTO drink (security_label, id, name, price) VALUES ('system_u:object_r:sepgsql_table_t:Secret', 5, 'beer', 280); INSERT 16493 1 postgres=# SELECT security_label, * FROM drink ORDER BY id; security_label | id | name | price -------------------------------------------------+----+-------+------- system_u:object_r:sepgsql_secret_table_t | 1 | water | 100 system_u:object_r:sepgsql_ro_table_t:Classified | 2 | coke | 120 system_u:object_r:sepgsql_table_t:Classified | 3 | juice | 130 system_u:object_r:sepgsql_secret_table_t | 4 | cofee | 180 system_u:object_r:sepgsql_table_t:Secret | 5 | beer | 280 (5 rows)
The COPY statement also supports the security_label system catalog. It enables to export/import the security context, when a user specifies the column explicitly. If COPY FROM statement without columns list including the security_label, the default security context is labeled on all the new tuples.
The SELECT INTO and CREATE TABLE AS statement also supports the writable system column. When the fields list contains the security_label, its value is used for the explicitly specified security context.
postgres=# SELECT 'system_u:object_r:sepgsql_table_t:s0:c' || id AS security_label, * INTO t FROM drink WHERE id in (2,4,6); SELECT postgres=# SELECT security_label, * FROM t; security_label | id | name | price -----------------------------------------+----+-------+------- system_u:object_r:sepgsql_table_t:s0:c2 | 2 | coke | 120 system_u:object_r:sepgsql_table_t:s0:c4 | 4 | cofee | 180 system_u:object_r:sepgsql_table_t:s0:c6 | 6 | sake | 320 (3 rows)