PostgreSQL for Enterprise Business Applications
Introduction
A frequent question about PostgreSQL, and open source databases in general, is about their suitability for demanding, high-dollar, business applications. While many programmers have trusted PostgreSQL for simple bulletin boards, web counters, and small ordering systems, there have not been many public descriptions of much more complex business systems in place, or helpful setup guidelines for people seeking to build such systems.
In the process of building up some truly enterprise-level systems (400+ websites, $600K (USD) per month of monetary transactions, and 200+ physical satellite locations interacting with the systems and data on a daily basis) with a PHP/Apache/PostgreSQL combination, I've learned a few things. Some apply to working methods, some to large-business practice, some to simply tuning a system for maximum efficiency.
Ron Chmara (2002-01-29)
Things I've learned
- Treat it like any other accounting program. Separate user entered data from system recorded data. Use separate tables, and either external, or internal (to PostgreSQL) logic to validate the data. If your normal business logic and accounting practices are based on double entry, mirror that as well. If your systems have entry-only tables, and monthly record locks, you may find it desirable to have archival tables to store the now-locked data in.
- Use transactions. This will save time, money, and headaches. For large looping inserts, for example, it dramatically reduces the amount of time required for disk writes, and if you have a program failure, or a need to abort the data entry, you aren't left with half-entered data tables.
- Tune and optimize your code for doing as much as possible in one pg_exec() statement. Multiple statements should be avoided whenever possible. (let the db engine do the work, not PHP). This allows for quicker dB interaction, keeping the overhead of dB communication (and Apache CPU time) low. Since applications and tables are often built piecemeal, it's good to revisit older code, which may have more than 15 database interactions, and reduce it to one or two statements.
- Decide upon numeric precision ahead of time (how many decimals?) and code for that. Always. Without fail. Different companies, and industries have different requirements, but by setting this ruleset up ahead of time you avoid running into difficult precision and rounding debates later in the code's lifecycle.
- pgdump as frequently as you desire, and copy it off to a safe server. Not because it goes down (I had a programmer familiar with one of the other open-source databases (*cough*) ask if the dB had gone down, and was confused for a bit... because PostgreSQL *hasn't* gone down), but because you always want to be able to isolate a database at given breakpoints. While you can also do the same kind of thing with entry-only database systems (where an update is instead a "later" version of the record), it doesn't have the same level of # Did I mention having a backup server? It doesn't matter how good the app is, lightning will damage the best of them. The more mission critical a system is, the more you'll appreciate having off-site, or multi-site, backup database (and web) servers.
- You may need to recompile or otherwise adjust your kernel for larger shared memory (shm) and/or semaphore (sem) space. Otherwise, you'll run out of memory for backends when PostgreSQL needs them. The default on RedHat 6.2, for example, only allowed for 32. The pg_pconnect php function needs a backend per connection for each apache process, which means that for 50 apache processes to one database, you need 50 postmaster backends. If you are running ten databases, and you have 20 active apache processes, you would need 200 backends. One way around this issue is to set your httpd.conf 'MaxRequest' counter low to rotate the pool, which closes the apache processes frequently (ensuring that you gain some speed with persistent connections, but that connections are closed after a certain amount of time). Another solution is to only connect to one database (which has all of your tables), to reduce the number of required backends. See also: http://www.postgresql.org/docs/current/interactive/kernel-resources.html
- Investigate AOLserver as an alternative for connection pooling features (which takes care of the above issue).
- If you're porting old PHP code, get the latest CVS copy or >= 4.0.7 which should have the row grabbing option in PG functions, which makes life much easier. It's not completely documented (as of 6/30/01), but it works like the php mysql functions now. The pg_close functions also work, which can help to manage the connection pooling issues with Apache.
- A note about oid/currval/nextval: Ignore older documentation if you're using a recent version of pg. Several websites and books were written when these functions behaved in a different manner. One thing that does still hold true, is that using currval and nextval may not guarantee perfect serializtion (there may be holes if some interactions are aborted, or stopped, or unused). If you need perfect serialization, you can perform an insert and retrieve the record serial number based on oid, or, as noted above, use an entry table (with expected serial holes) which then feeds linearly into a table (without holes).
- If you haven't done accounting apps before, hire an accountant/programmer. Their services can be pricey until you learn the ropes, but after that, the knowledge is invaluable. Even having a non-coding accountant "collaborator" to look through the code diagrams and results helps to ensure a more stable, more business-rules oriented, product. Depending on the level of accounting logic required, it may also be helpful to have a manual of GAAP (Generally Accepted Accounting Principles) standards handy.
- Did I mention backups? Even inside of PG? Set up development only tables, or a dev-only machine, so your building and testing happens in an independent manner from the live data. Updating millions of customer orders to be $0.00 on a live server is not an experience you want to have. As your code complexity increases, you may wish to have two testing servers, one for code development, and another for version (Apache, PHP, PostgreSQL, etc.) testing. This way you can test new versions of application software separate from your ongoing development environment.
- Watch out for binary math. Many systems and function calls use it by default, so compensate (and/or cast) accordingly. One approach is to only use integer math, and then work backwards to your decimal places, another is to work at a higher precision (8 decimal points) and return data at a lower, required, precision (4 decimal points.)
- Watch out for bad data. Never, ever, under any circumstances, grant the web-server user too many rights ... or, even worse, put an SQL statement in a GET. Filter all input data for size, string type, expected user, etc. Filter, filter, filter. Then filter again. Most security warnings related to programming with PHP, or other web-based data systems, come from issues where the data is not being appropriately validated before it is being used.
- Force small data fields as part of the filtering. While is is tempting to make all fields TEXT, it creates massive dB rows in the long run, and will slow down the entire system. It also enforces another layer of filtering, in cases where bad (oversized) data could be inadvertently entered and used.
- Timestamp every record. This can really save you time when you need to find an exact range (such as everything done on tuesday, because you had some bad code, or tuesday's export is broken...). I find it useful to timestamp a creation date, and modification date. If you're using some form of internal user authorization (such as mod_ldap), it can also be valuable to stamp records with things like the creators uid and modifier's uid.
- For some systems, you should never use "update", just insert a reversing record. (See Tips #1 and #11.) This allows you to maintain a granular record of all transactions, preventing loss of historical data. It all depends on the accounting practices required, and how much data you will need to preserve. For example, rather than zeroing out a purchase amount, or deleting an invoice number, you would insert a negative purchase for an additional record.
- Vacuum often. Daily, or more often, as needed. After large record changes, VACUUM and VACUUM ANALYZE can boot performance for 1000% or more.
Hope this helps.