Microsoft SQL Server to PostgreSQL Migration by Ian Harding

From PostgreSQL wiki

Revision as of 23:41, 7 November 2012 by Dmsoft (Talk | contribs)

Jump to: navigation, search

by Ian A. Harding <ianh@tpchd.org> v1.00, Last updated 17th September 2001

How to move a database from a popular proprietary database to the world's most powerful open source database.

Contents

Disclaimer

The following document is offered in good faith as comprising only safe programming and procedures. No responsibility is accepted by the author for any loss or damage caused in any way to any person or equipment, as a direct or indirect consequence of following these instructions.

Introduction

Microsoft SQL Server is very popular relational database management systems (RDBMS) with highly restrictive licensing and high cost of ownership if the database is of significant size, or is used by a significant number of clients. It does, however, provide a very user-friendly interface, is easy to learn and use, and has low cost entry level configurations. This has resulted in a very large installed user base.

PostgreSQL now challenges MS SQL Server in basic feature set, reliability and performance, has a much less restrictive license, and is open source. As a matter of course, users are migrating to PostgreSQL from MS SQL Server as the cost of ownership becomes an issue, and as their knowledge of relational database systems increases.

This HOW-TO is intended for the MS SQL Server user who is now ready to migrate databases to PostgreSQL.

Considerations

RDBMS features are implemented differently and to different degrees by programmers. Some applications rely heavily on so-called middleware, or on the client application to handle business logic. Others attempt to put as much logic as possible in the database. Your migration will be far more difficult if your application is in the latter group. While it is a sound design choice to put logic in the database server, it will require programming in a vendor specific Structured Query Language (SQL) extension such as Microsoft's Transact SQL (T-SQL). This is also the case with PostgreSQL. There is no easy way to migrate stored procedures, triggers, or rules. On the bright side, PostgreSQL provides several language options, all of which are more graceful than T-SQL.

RDBMS all provide built-in functions. However, like procedural extensions to SQL, they are not portable. Fortunately, there is some overlap, and the simple syntax makes migration relatively easy.

Finally, the programmer's choice of SQL syntax can affect this process. Most RDBMS are approaching the evolving SQL standards. That is, they are leaning away from vendor specific syntax such as the '*=' syntax for a left outer join. This syntax is still supported in MS SQL Server as of version 7.0, but was never supported in PostgreSQL.

This process will require either a mind-numbing amount of hand editing of script and data files, or use of a scripting language to programmatically modify these files, followed by a somewhat less enormous amount of editing. I am not smart enough to identify every possible option for the migration, or to accomodate them in a script. I have done this migration on a relatively complex database application in a reasonable amount of time. This, rather than a technically flawless script, should be your goal.

I use Tool Command Language (TCL) for almost everything, so I use it here. You can use whatever language you like.

Tables

Dump the table defininitions with the MS SQL Server scripting tool. From the Enterprise Manager, right click on your database and select 'All Tasks', then 'Generate SQL Scripts' from the context menu. Uncheck 'Script All Objects', and select 'All Tables'. On the 'Formatting' tab, de-select 'Generate DROP...'. On the 'Options' tab, select 'Script indexes' and Script PRIMARY KEYS...'. Select the 'MS-DOS' file format, and make sure 'Create one file' is checked. Click OK, give it a name, and put it somewhere you can find it.

A brief look at this file will show you what we are up against. MS uses square brackets around all identifiers, to protect you from poor design choices such as using reserved keywords so crazy things like:

CREATE TABLE [dbo].[Select] ([Union] [int])

are possible. PostgreSQL uses double quotes instead. MS uses the object owner qualification for all objects, 'dbo' in this case. PostgreSQL has no such qualifications in object names.

Another thing to note is that MS SQL identifiers are case preserved, but in practice most installations are installed as case insensitive. PostgreSQL is case agnostic in the case of SQL keywords and unquoted identifiers, forcing all queries to lower case. It is not the same as being case insensitive, in that you can create tables using the double quote protection mentioned above, such that they can only be accessed using the same double quoting method. I find it is best to abandon case in object identifiers when migrating to PostgreSQL. Also, it is safest to avoid any identifiers that require quoting to avoid problems down the road.

It is worth noting that for data comparisons, PostgreSQL is case sensitive and there is no option to change this behaviour. You will have to force data to upper or lower on both sides of text comparisons if case is not important to the operation and there is a chance of it being different. This conversion might be a good time to force data used in joins and comparisons to all upper or lower case. You will also need to look at the application for code that does comparisons of user-entered information taking advantage of MS SQL Server's typical case insensitivity.

Another issue that is not immediately evident is that MS SQL Server supports ALTER TABLE statements that contain comma separated lists of alterations. PostgreSQL currently does not. This is important since the MS SQL Server scripting program creates all constraints in ALTER TABLE statements. If any tables have more than one constraint, you will have to surgically separate them into their own ALTER TABLE statements, or move them in the CREATE TABLE statement.

Indexes are a bright spot, mostly. The CLUSTER keyword in PostgreSQL is not the same as the CLUSTERED keyword in a MS SQL Server index creation. PostgreSQL will allow you to 'cluster' a table, that is, rearranging the tuples in a table in order for that field. This sounds good, except that the cluster is not maintained for updates and inserts, and the fact that it will break all your other indexes whenever you generate the clustering.

Having said all that, here is a partial list of things to correct:

  1. Force to lower case.
  2. Remove all square brackets.
  3. Remove all object owner prefixes (i.e. "dbo.")
  4. Remove all reference to filegroup (i.e. "ON PRIMARY")
  5. Remove all non-supported optional keywords (i.e. "WITH NOCHECK", "CLUSTERED")
  6. Update all non-supported data types (i.e. "DATETIME" becomes "TIMESTAMP") Also, this is a good time to get away from MONEY.It is supported in PostgreSQL, but is on its way out.Use NUMERIC(19,4).
  7. Replace the T-SQL batch terminator "GO" with the PostgreSQL batch terminator ";"

Put this file somewhere safe, and now let's get the data.

Data

Data is data. It is brought over in text form and cast into it's proper form by the database according to the datatypes you used in creating your tables. If you have binary data, I am the wrong guy to ask.

There are a couple gotchas here too, of course. Since we use the COPY command, and it interprets a newline as the end of a tuple, you need to clean out all those newlines lurking in your text fields in MS SQL Server. This is easy enough to do. Also, the data dump from MS SQL Server will use the standard cr/lf line terminator, which needs to be changed to lf or it will cause havoc in comparisons of strings, among other problems. I did this the easy way, downloading the dumps to my machine running my favorite Unix-like operating system via ftp, which does this translation for you.

The first step in dumping the data out of MS SQL Server is to type all the names of your fields into a text file on the Win32 machine. You can cheat and issue:

select name from sysobjects where type = 'U'

in Query Analyzer (ISQL-W) to get the list, then save the results to a file. Then, write a handy little script to call bcp, the Bulk Copy Program. Mine looks like this:

set file [open "C:\\inetpub\\ftproot\\tablelist.txt" r] while {![eof $file]} {
    set table [gets $file]
    exec bcp ..$table out $table -c -k -S192.168.100.1 -Usa -Ppassword -r ~
}
close $file

This will dump all the listed tables into files of the same name in the current directory. The -c flag means to use plain character format. The -k flag tells bcp to "keep nulls". This is important later when we import the data. The -r is the "row terminator". To make cleaning up the carriage returns easier, I use this to signal the end of a row. I put this script in the C:\InetPub\ftproot directory, so I can go right to the next step.

From the Unix-like machine, start ftp and get the file listing you created earlier. Put it in a work directory. Change to the new work directory and get the files:

ftp> lcd /home/homer/workdir Local directory now /home/homer/workdir
ftp> fget tablelist.txt

This should download all of the data files to the work directory, magically converting line terminators to Unix compatible format. If you can't use FTP, there are other ways to get files from here to there. Just be advised that you may need a little sed script to fix the cr/lf problem.

Now, let's fix the embedded line feed issue.

#!/usr/pkg/bin/tclsh
set file [open tblnames r]
set flist [read -nonewline $file]
close $file
set flist [split $flist \n]
foreach f $flist {
    set file [open $f r]
    set data [read -nonewline $file]
    close $file
    regsub -all {\000} $data {} data
    regsub -all {\n} $data \\\n data
    regsub -all {~} $data \n data
    set file [open $f w]
    puts -nonewline $file $data
    close $file
}

The regsub lines are where the work gets done. They replace all nulls (\000) with an empty string, then all linefeeds with a literal "\n" which will tell COPY what to do when we import the file, then my line terminators get replaced with a linefeed, which is what COPY is expecting. There are cleaner and easier ways to do this, but you get the point.

Now, go back to the sql file you edited to create your database objects. I assume it is on the Unix-like box at this point. It should have a series of CREATE TABLE statements, followed by ALTER TABLE and CREATE INDEX, etc statements. What we need to do now is tell it we want to load data after the tables are created, but before anything else.

For each CREATE TABLE statement, follow it with a COPY statment. Something like :

COPY tablename FROM '/home/homer/workdir/tablename' with null as ;

Once you have this done, execute it against your PostgreSQL database, something like

$ psql newdb < modifiedscript.sql &> outfile

should work. The output file is good to have for looking for problems. It gets messy so :

$ grep ERROR outfile

can give you an idea how things went. I guarantee you have some troubleshooting to do.

Views

Views are pretty easy, as long as you didn't use too many functions in them. A favorite of mine is isnull(). Like most functions, it has a PostgreSQL counterpart, coalesce(). A surprising number of functions will work just fine. For example, round() is exactly the same. datepart() becomes date_part(), but the arguments are the same, althought PostgreSQL may be more particular about format strings. For example, SQL Server will accept datepart(yyyy, mydatefield) as well as datepart(year, mydatefield). PostgreSQL wants to see date_part('year', mydatefield) (note single quotes).

Generating sql for views is pretty much the same as for tables. From the Enterprise Manager, right click on your database and select 'All Tasks', then 'Generate SQL Scripts' from the context menu. Uncheck 'Script All Objects', and select 'All Views'. On the 'Formatting' tab, de-select 'Generate DROP...'. On the 'Options' tab, Select the 'MS-DOS' file format, and make sure 'Create one file' is checked. Click OK, give it a name, and put it somewhere you can find it.

Run this file through the same script you created to clean the sql for your tables, and see if it will work on PostgreSQL. If not, you will have to do some fixing of functions.

Summary

Converting a database from MS SQL Server is not always easy. It is, however, always worth it. You will find PostgreSQL to be an extremely powerful and flexible product, with the best tech support in the world, the actual developers and users of the product. If you spent days trying to get xp_sendmail to work on SQL Server version 7.0, or wondered what was in those enormous "Service Packs" then you will appreciate this.

Extension Links

Personal tools