BinaryFilesInDB

From PostgreSQL wiki

Revision as of 10:49, 13 February 2012 by Mbalmer (Talk | contribs)

Jump to: navigation, search

Contents

Storing Binary files in the Database

Purpose: to cover the pluses and minuses of storing files in a database versus storing the path to the binary file and other meta data.

Reason: this topic has come up several times in the past on the PostgreSQL mailing lists. With new user asking, is it OK store pictures, word, spreadsheet, html, pdf, or etc. in the database. Of course like everything thing there is 2 prevailing schools of thought each method having pluses and minus detailed below.


Methods:

Storing the large binary* file aka unstructured data streams in a database.

Postgres along with other databases offer similar basic structures.

  • BLOB binary large object see Large Object Support
    • Pluses
      • limited 2G's per entry, & 4 Billion per database
      • Can stream, and seek over entries ( can reduce memory requirements on DB server and client)
      • no encoding or escaping required.
    • Minuses
      • must use different interface from what is normally used to access BLOBs.
      • Need to track OID. Normally a separate table with addition meta data is used to describe OID
      • (8.4 or older) No access controls in database.
  • Storing binary data using btyea or text data types
    • Pluses
      • Storing and Accessing entry utilizes the same interface when accessing any other data type or record.
      • No need to track OID
    • Minus
      • bytea and text data type both use TOAST
        • limited to 1G per entry
        • 4 Billion entries per table
      • Need to escape/encode binary data before sending to DB then do the reverse after retrieving the data
      • Memory requirements on the server can be steep even on a small record set.
  • Pluses overall
    • Security and access control is simplified
    • version controlling is easier.
    • ACID
    • Backups are easier no need to track external files
  • Minus overall
    • Performance hit storing files in database.
    • Memory requirements higher, for Database
    • Backups can take significantly longer
    • Access to files to external applications is complicated. Normally a temporary file is copied to the client to access and modify the file then needs to be copied back.

Storing Meta data and symbolic link in the database to where a Binary file is located

  • Pluses
    • Performance accessing binary file skips DB access layer.
    • Number of files limited only by file system
    • Size limited by file system
  • Minuses
    • Need to develop an interface to keep track of externally attached files
    • external files and database can get out of sync
      • binary files missing entries in the database
      • database having entries to external files that's been deleted or moved by users or other events out of control of the database.
    • Security settings between file system and database are independent from each other. The application may need to may need multiple security logins to access all resources. This is dependent on security needs of the application, and methods used for access control. There are several methods to create a single log on infrastructure.
    • Multi points of failure depending on complexity of the network.

When should files be stored in the database?

The common suggestion here is when the files have to be ACID.

Example: Photos taken during a Quality Inspection process:

A normal/ordinary record in a database would consist of a Inspector, time stamp of inspection, process used to inspect the part, was the part accepted or rejected, and the parts serial number. In this scenario photos are also taken showing the part inspected. The photos are a integral piece of the inspection record keeping process.

The photos being a critical piece making up the record/inspection report should not to be stored separately. Consider if the photos being stored on file sever are moved to new location as part of upgrade, the link to the record is bad if it was not updated. So now the inspection record by itself is worth less as it lost the piece it is describing and the photos are worthless without the record describing them. So the two pieces must stay together to have any value as a inspection report.

Storing the photos outside the database, there is no easy way to guarantee atomicity, consistency, isolation and durability of the photos. Storing the photos in the database is the only practical solution where the programmer can guarantee rules are followed in the future.

What is the best way to store the files in the Database?

There are two methods to choose from in PostgreSQL BLOB or bytea/text data types. The authors personal preference is using text data type to store files in the database as it simplifies getting to the file and eases encoding the file for storage.

The BLOB type requires using a different interface and adds some complexity in programming. Although it does support a far larger file. This added complexity the author tries to avoid.

The author dislikes the bytea data type as it can be very difficult to escape and then unescape depending on programming tools used. There is nothing wrong with the bytea data type but it can be more difficult to use. Yet this has been made easier in some programming languages such as PHP, which includes a escape and unescape bytea functions. Yet this is not a common tool available in other programming languages and as a general rule/suggestion the database should be designed to work with as many programming languages and or frameworks as makes sense.

Storing files in the text data type requires the files to be encoded in some character to binary representation. There are several to chose. This is not exhausted list just commonly used ones.

Base64 probably today is the most common encoding method for binary files thanks to email. Many programing languages, and frameworks support base64. So the author encodes the files in base64 when storing files in the database. Postgresql also has base64 encoding and decoding "encode() & decode()"

see section 9.4 String Function and Operators for more details

Biggest draw back using text data type and base64 encoding is it inflates the data by 33%. This adds a significant increase to storage requirements and overhead that bytea and BLOB do not suffer from. So keep in mind when designing a database what appears to be a simple decision now, can have big impacts down the road.

This 33% hit in overhead can have a significant impact. Consider the process time and IO hit on a 10 gig table with binary files. Postgresql is having to process an additional 3.3 gigs of more data because of the encoding method.

What Meta Data should be stored in a database?

here is a basic list what should be recorded in the database along side the binary file

  • original file name
  • data type
    • this is needed to tell us what kinda of binary file is stored in the database
    • the author suggest following the Internet Media Type it makes it easier to use the same information for a website or attaching to email.
  • long description
  • date added

When is it bad idea to store binary files in the database?

Very large files (100meg+), where performance is critical to the application. The database layer adds a lot of overhead and complexity that may not be required.

Personal tools