BinaryFilesInDB

From PostgreSQL wiki

Jump to: navigation, search

Contents

Storing Binary files in the Database

Purpose: to cover the pluses and minuses of storing files in a database versus storing the path to the binary file and other meta data.

Reason: this topic has come up several times in the past on the PostgreSQL mailing lists. With new user asking, is it OK store pictures, word, spreadsheet, html, pdf, or etc. in the database. Of course like everything thing there is 2 prevailing schools of thought each method having pluses and minus detailed below.

Caveat: many people come to this page looking for PostgreSQL's equivalent of the MySLQ "BLOB*" data types. The equivalent in Postgres is BYTEA. For details on PostgreSQL's "binary large object" (which are quite different from MySQL BLOB's and provide random seeking, etc), see below.

Methods:

Storing the large binary* file aka unstructured data streams in a database.

Postgres along with other databases offer similar basic structures.

  • BLOB binary large object see Large Object Support
    • Pluses
      • limited 4TB (PostgreSQL 9.3+) per entry, & 4 Billion per database
      • Can stream, and seek over entries ( can reduce memory requirements on DB server and client)
      • no encoding or escaping required.
    • Minuses
      • must use different interface from what is normally used to access BLOBs.
      • Need to track OID. Normally a separate table with additional meta data is used to describe what each OID is.
      • (8.4 and <8.4) No access controls in database.
      • Sometimes advised against (basically you only need them if your entry is so large you need/want to seek and read bits and pieces of it at a time).
  • Storing binary data using bytea or text data types
    • Pluses
      • Storing and Accessing entry utilizes the same interface when accessing any other data type or record.
      • No need to track OID of a "large object" you create
    • Minus
      • bytea and text data type both use TOAST (details here)
        • limited to 1G per entry
        • 4 Billion (> 2KB) entries per table max.
      • Need to escape/encode binary data before sending to DB then do the reverse after retrieving the data
      • Memory requirements on the server can be steep even on a small record set.
  • Pluses overall
    • Security and access control is simplified
    • version controlling is easier.
    • ACID
    • Backups are easier no need to track external files
  • Minus overall
    • Performance hit storing files in database.
    • Memory requirements higher, for Database
    • Backups can take significantly longer
    • Access to files to external applications is complicated. Normally a temporary file is copied to the client to access and modify the file then needs to be copied back.

Storing Meta data and symbolic link in the database to where a Binary file is located

  • Pluses
    • Performance accessing binary file skips DB access layer.
    • Number of files limited only by file system
    • Size limited by file system
  • Minuses
    • Need to develop an interface to keep track of externally attached files
    • external files and database can get out of sync
      • binary files missing entries in the database
      • database having entries to external files that's been deleted or moved by users or other events out of control of the database.
    • Security settings between file system and database are independent from each other. The application may need to may need multiple security logins to access all resources. This is dependent on security needs of the application, and methods used for access control. There are several methods to create a single log on infrastructure.
    • Multi points of failure depending on complexity of the network.

When should files be stored in the database?

The common suggestion here is when the files have to be ACID.

Example: Photos taken during a Quality Inspection process:

A normal/ordinary record in a database would consist of a Inspector, time stamp of inspection, process used to inspect the part, was the part accepted or rejected, and the parts serial number. In this scenario photos are also taken showing the part inspected. The photos are a integral piece of the inspection record keeping process.

The photos being a critical piece making up the record/inspection report should not to be stored separately. Consider if the photos being stored on file sever are moved to new location as part of upgrade, the link to the record is bad if it was not updated. So now the inspection record by itself is worth less as it lost the piece it is describing and the photos are worthless without the record describing them. So the two pieces must stay together to have any value as a inspection report.

Storing the photos outside the database, there is no easy way to guarantee atomicity, consistency, isolation and durability of the photos. Storing the photos in the database is the only practical solution where the programmer can guarantee rules are followed in the future.

What is the best way to store the files in the Database?

There are three methods to choose from in PostgreSQL BLOB, bytea or text data type.

BLOB aka Large Objects

  • Completely different interface has to be use, read through Large-Objects documentation to have a full understanding.
  • Does have advantage of being able to seek over data and stream data out.
  • data storage requirements are the same as if saved in the file system
  • limited server side tools to manipulate the objects.

bytea

  • data storage requirements are the same as the object being stored in the file system
  • Standard SQL statements to insert, update, delete, and select the object
  • How easy is it to get the data in and out depends on the programming tools. Python & .Net have the easiest automatic casting to and from languages type to postgres type (depends on db driver). Other languages database access layers require escaping or casting in a separate step before additional processing can be done.
  • Server side tools tools to encode or decode into other formats

text

  • data storage requirements can be 30% more required space than if the file stored in the file system
  • Standard SQL statements to insert, update, delete, and select the object
  • always has to be encoded and decoded into some format such as (not a complete list)

this can add significant overhead.

  • Server side tools to encode or decode into other formats

The author has changed to using bytea exclusively now as Python and plPython make it easy to use bytea vs text type.

What Meta Data should be stored in a database?

here is a basic list what should be recorded in the database along side the binary file

  • original file name
  • data type
    • this is needed to tell us what kinda of binary file is stored in the database
    • the author suggest following the Internet Media Type it makes it easier to use the same information for a website or attaching to email.
  • long description
  • date added

When is it bad idea to store binary files in the database?

Very large files (100meg+), where performance is critical to the application. The database layer adds a lot of overhead and complexity that may not be required.

Personal tools