BinaryFilesInDB

From PostgreSQL wiki
Jump to navigationJump to search

Storing Binary files in the Database

New users to PostgreSQL often ask on the PostgreSQL mailing lists how/if to store binary files like images, Word documents, spreadsheets, HTML, PDF, compressed archives, or even code in the database. There are two prevailing schools of thought on that, each having pluses and minus detailed below:

  • Save the binary file to the regular filesystem, then write metadata and some sort of symbolic link in the database to its location.
  • Store the large binary file of unstructured data directly in a database field.

Caveat: many people come to this page looking for PostgreSQL's equivalent of the MySLQ "BLOB*" data types. The equivalent in Postgres is BYTEA. For details on PostgreSQL's "binary large object" (which are quite different from MySQL BLOB's and provide random seeking, etc), see below.

Methods:

Storing the large binary* file aka unstructured data streams in a database.

Postgres along with other databases offer similar basic structures.

  • Storing binary data using bytea or text data types
    • Pluses
      • Storing and Accessing entry utilizes the same interface when accessing any other data type or record.
      • No need to track OID of a "large object" you create
    • Minuses
      • bytea and text data types both use TOAST (details here)
        • Limited to 1G per entry
        • 4 Billion (> 2KB) entries per table max.
      • Need to escape/encode binary data before sending to DB then do the reverse after retrieving the data
      • Memory requirements on the server can be steep even on a small record set.
  • BLOB binary large object see Large Object Support
    • Pluses
      • limited 4TB (PostgreSQL 9.3+) per entry, & 4 Billion per database
      • Can stream, and seek over entries (can reduce memory requirements on DB server and client)
      • No encoding or escaping required.
    • Minuses
      • Must use different interface from what is normally used to access BLOBs.
      • Need to track OID. Normally a separate table with additional meta data is used to describe what each OID is.
      • (8.4 and <8.4) No access controls in database.
      • Sometimes advised against (basically you only need them if your entry is so large you need/want to seek and read bits and pieces of it at a time).

Database storage has some common characteristics whichever specific storage you pick:

  • Pluses overall
    • Security and access control are simplified
    • Version controlling is easier
    • ACID
    • Files saved into a primary database will stream to standby replicas automatically
    • Backups are easier there is no need to track external files
  • Minuses overall
    • Performance hit storing files in database.
    • Memory requirements higher, for Database
    • Backups can take significantly longer
    • Access to files to external applications is complicated. Normally a temporary file is copied to the client to access and modify the file and then needs to be copied back.

Storing Meta data and symbolic link in the database to where a Binary file is located

  • Pluses
    • Performance accessing binary file skips DB access layer.
    • Number of files limited only by file system
    • Size limited by file system
  • Minuses
    • The database user needs permissions to write outside of the database. That's not normally an option in hosted PostgreSQL environments.
    • Need to develop an interface to keep track of externally attached files
    • External files and database can get out of sync
      • Binary files can have missing entries in the database
      • Database having entries to external files that's been deleted or moved by users or other events out of control of the database.
    • Security settings between file system and database are independent from each other. The application may need to may need multiple security logins to access all resources. This is dependent on security needs of the application, and methods used for access control. There are several methods to create a single log on infrastructure.
    • Multiple points of failure depending on complexity of the network.

When should files be stored in the database?

The common suggestion here is when the files have to be ACID.

Example: Photos taken during a Quality Inspection process:

A normal/ordinary record in a database would consist of an Inspector, time stamp of inspection, process used to inspect the part, was the part accepted or rejected, and the parts serial number. In this scenario photos are also taken showing the part inspected. The photos are an integral piece of the inspection record keeping process.

The photos being a critical piece making up the record/inspection report should not to be stored separately. Consider if the photos being stored on file sever are moved to new location as part of upgrade, the link to the record is bad if it has not been updated. So now the inspection record by itself is worthless as it lost the piece it is describing and the photos are worthless without the record describing them. So the two pieces must stay together to have any value as an inspection report.

When storing the photos outside the database, there is no easy way to guarantee atomicity, consistency, isolation and durability of the photos. Storing the photos in the database is the only practical solution where the programmer can guarantee rules are followed in the future. The file data will automatically replicate and flow to database backups. And since writing to arbitrary files on the database server requires extra trust, saving to the database may be the only option available or considered safe by your PostgreSQL administrator.

What is the best way to store the files in the Database?

There are three methods to choose from in PostgreSQL BLOB, bytea or text data type.

The main author of this article has changed to using bytea exclusively now that Python and pl/Python make it easy to use bytea vs text type.

bytea

  • Data storage requirements are the same as the object being stored in the file system
  • Standard SQL statements to INSERT, UDPATE, DELETE, and SELECT the object
  • How easy is it to get the data in and out depends on the programming tools. Python & .Net have the easiest automatic casting to and from languages type to postgres type (depending on db driver). Other languages database access layers require escaping or casting in a separate step before additional processing can be done.
  • Server side tools tools to encode or decode into other formats

text

  • Data storage requirements can be 30% higher than if the file is stored in the file system
  • Standard SQL statements to INSERT, UDPATE, DELETE, and SELECT the object
  • always has to be encoded and decoded into some format such as (not a complete list)

this can add significant overhead.

  • Server side tools to encode or decode into other formats

BLOB aka Large Objects

  • Completely different interface has to be used, read through Large-Objects documentation to have a full understanding.
  • Does have the advantage of being able to seek over data and stream data out.
  • Data storage requirements are the same as when saved in the file system
  • Limited server side tools to manipulate the objects.

What Meta Data should be stored in a database?

Here is a basic list of what should be recorded in the database along side the binary file

  • Original file name
  • Data type
    • This is needed to tell us what kind of binary file is stored in the database
    • The author suggest following the Internet Media Type it makes it easier to use the same information for a website or attaching to email.
  • Long description
  • Date added

When is it bad idea to store binary files in the database?

Very large files (100MB+), where performance is critical to the application. The database layer adds a lot of overhead and complexity that may not be required.