Porting data from dBASE IV to PostgreSQL

From PostgreSQL wiki
Jump to navigationJump to search

Porting data from dBASE IV to PostgreSQL

Last updated 7th September 2002.

Copyright © 2002 Vijay Deval.

I have ported a few tables from dBASE IV to PostgreSQL. This page illustrates the steps I followed for one of these tables, in case it's helpful to people.

The table we are porting is called client. In it, there is a column called CLID that is defined as a sequence in PostgreSQL. Because PostgreSQL automatically generates its sequence numbers, the COPY command can't be used to load the converted data in. Data must be transferred record by record. This can be done by converting records from dBASE into a series of SQL statements.

Before we do it, it is necessary to exclude the CLID column from the text file. A "cut" (that's a *nix command) or a (g)awk script can easily do this for us. In order to provide a simple check on the data transfered, the CLID data from the text file is inserted into a new OLDCLID column in the PostgreSQL table definition. If CLID and OLDCLID match all the way through after the transfer, it confirms a correct and successful data transfer. Once we know that has happened, the OLDCLID column is dropped.

If all the columns in a dBASE IV table are not required, then the unwanted columns need to be dropped. The sequence of the remaining columns in the text data file may also need rearranging. If this is the case, then using the comma character as a field separator proves unsuitable because commas can occur as part of text data. A (g)awk script used to rearrange and/or remove some of the fields from the text data file can't distinguish between commas in text data and commas inserted as a field separator. Field separator needs to be a character that is not likely to occur in the data.

With CLID in this case, we need all the columns of the client table so using a comma as the field separator would work fine. I have chosen to use a | character as the field separator here though, just to illustrate the freedom of choice. '|' is replaced with ',' before converting the text file into SQL commands (as clipop.sql).

The SQL COPY command lets you change the field separator from comma to something else, but for SQL statements it is still necessary to have only commas as field separators.

The actual steps are as follows.

In dBASE IV:

  1. Use client
  2. Copy stru extended to clistru
  3. From control center export clientf.dbf and clistru.dbf as fixed field length text files. client.txt is data and clistru.txt gives details of the structure of client.dbf.
  4. Back in Linux, mread client.txt and clistru.txt
  5. Put field separators in the text file using a C script. I have named it fields.c.
  6. Create a gawk script and convert each record into a sql statement.

The clistru.txt file, describing the structure of the client table in dBASE IV is as follows.

CLID N 4 0Y
NAME C 45 0N
ADDR1 C 40 0N
ADDR2 C 40 0N
ADDR3 C 30 0N

Using this file the corresponding table in PostgreSQL can be created.

A small extract of client.txt:

1
Philips India Limited.
Loni Kalbhor
Near(C.Rly)

Pune 412 201
2
Champion Electronics Pvt. Ltd.
S-17, M.I.D.C.
Bhosari
Pune 411 026

The table called client was created using this command:

create table CLIENT (
    CLID         serial,
    OLDCLID      INT,
    NAME         VARCHAR(40),
    ADDR1        VARCHAR(45),
    ADDR2        VARCHAR(45),
    ADDR3        VARCHAR(45),
    PRIMARY KEY(CLID) );

CLID is the primary key for this table.

The column OLDCLID can be dropped after data migration is confirmed to be accurate.

Compile fields.c

$ a.out < client.txt > tmp

This creates the file named tmp. It is the same as client.txt, but with field separators in place. An extract of tmp follows:

1|'Philips India Limited. '|'Loni Kalbhor '|'Near(C.Rly) '|'Pune 412201 ' 2|'Champion Electronics Pvt. Ltd. '|'S-17, M.I.D.C. '|'Bhosari '|'Pune 411 026 '

with help of sed or emacs replacing '|' with ',' .

gawk script

I named the gawk script as clipop.awk. It is as follows:

{print "insert into client (oldclid, name,addr1,addr2,addr3) values (" $0, ");"}

$ gawk -f clipop.awk tmp>clipop.sql

converts the file to SQL statements.

A sample line from clipop.sql:

insert into client (oldclid, name,addr1,addr2,addr3) values ( 1,'Philips India Limited. ','Loni Kalbhor ','Near(C.Rly) ','Pune 412 201 ');
\i clopop.sql

loads the file into PostgreSQL.


C script to put field separators at field boundaries

#include <stdio.h>

#define FNM 5 /* number of fields */
#define FS "|" /* field separator */
#define BLANKDATE "NULL" /* select default for blank date field. (8) */
#define BLANKCHAR "NULL" /* select default value for blank char field */
#define BLANKNUM "0" /* NULL if required */

main()
{
int c,i,len,count,blank;

int flen[] = {4,45,40,40,30};
int nmchr[] = {0,1,1,1,2}; /* (1) */
/* User need not edit the code that follows. */
/*---------------------------------------------------------------------*/
 blank=0;
 count=0;
 i=0;
 len=flen[0];
 c=getchar();
 while (c != EOF){
   if( (nmchr[0]==1)&&(count==0))
     printf("'"); /* (2) */
   if(count==len) { /* at the boundary, 1st char of new field */
     if ((nmchr[i]==1)||(nmchr[i]==2)) /* value of i is of prev field */
       printf("'"); /* (3) */
     if(i<FNM-1)
       printf("%s", FS); /*(4)*/
     i++;
     len=len+flen[i]; /* add len of new field */
     if((nmchr[i]==1)||(nmchr[i]==2))
       printf("'"); /* leading inv comma if reqd */
     blank=0; /* before starting work on new field balnk is set to zero */
   }
   if (c==' ')
     blank++;
   else
     blank=0;

   if((blank>1)||((count==(len-flen[i]))&&(c==' '))) /* (5) */
     ;
   else
     putchar(c);

   count=count+1;
   c=getchar();

   if ((count==len-1)&&(c== ' ')&&(blank==flen[i]-1)){
      if (nmchr[i]==0)
         printf("%s",BLANKNUM); /*(6) */
      if (nmchr[i]==2)
         printf("%s",BLANKDATE); /*(9)*/
      if (nmchr[i]==1)
         printf("%s",BLANKCHAR);
   }
   if(c=='\n'){
     if ((nmchr[i]==1)||(nmchr[i]==2))
       printf("'"); /*(7)*/
       count=0;
       i=0;
       len=flen[0];
       putchar(c);
       c=getchar();
     }
 }
 return 0;
}

Outline of the program When at the border of fields, you do the following:

  1. If it is char or date type, then put an inverted comma at end of previous field, or
  2. Put a field separator, or
  3. Put an inverted comma at the beginning of current field, if it is char or date
  4. Trim blanks

Identify blank fields, if any and replace with pre-set defaults:

  1. flen[ ] holds information about lengths of fields. nmchr[ ] holds information about type of the field. Numeric fields are assigned value 0 (zero), char fields 1 and date fields 2.
  2. If the first field is char or date , put inverted comma first
  3. The char read is first char of field i+1. It is not yet put. If nmchr[i] is not numeric, it needs a trailing inverted comma.
  4. After making decision about trailing inverted comma, put field separator. Do not put FS at end of last field. (i<FNM-1)
  5. Counter "blank" is incremented if char read into "c" is balnk space. If the counter exceeds 1, or if the first char of any field is blank, program proceeds without making any output. This trims blanks. Counter is reset to 0 when c is non space. condition "((count==(len-flen[i]))&&(c==' '))" identifies that the first character of the field is blank., and prevents output.
  6. If blank increments to flen[i]-1, it means that the field is blank. dBASE exports undefined numeric field as blank string. This causes mismatched data type error. If a numeric is blank, it is replaced by zero.
  7. When carriage return is encountered, put inverted comma if the last field is non numeric or date.
  8. Form of date is yyyymmdd */
  9. dBASE will export an undefined date field as a blank field. If blank date field is encountered, it needs to be replaced by some date.