COPY
COPY
is the Postgres method of data-loading. Postgres's COPY comes in two separate variants, COPY and \COPY: COPY is server based, \COPY is client based.
COPY will be run by the PostgreSQL backend (user "postgres"). The backend user requires permissions to read & write to the data file in order to copy from/to it. You need to use an absolute pathname with COPY. \COPY on the other hand, runs under the current $USER, and with that users environment. And \COPY can handle relative pathnames. The psql \COPY is accordingly much easier to use if it handles what you need.
With either of these you'll also need to have insert/update or select permission on the table in order to COPY to or from it.
Syntax
Here is the syntax for COPY, as returned by the 8.3 client:
db=# \h COPY Command: COPY Description: copy data between a file and a table Syntax: COPY tablename [ ( column [, ...] ) ] FROM { 'filename' | STDIN } [ [ WITH ] [ BINARY ] [ OIDS ] [ DELIMITER [ AS ] 'delimiter' ] [ NULL [ AS ] 'null string' ] [ CSV [ HEADER ] [ QUOTE [ AS ] 'quote' ] [ ESCAPE [ AS ] 'escape' ] [ FORCE NOT NULL column [, ...] ] COPY { tablename [ ( column [, ...] ) ] | ( query ) } TO { 'filename' | STDOUT } [ [ WITH ] [ BINARY ] [ OIDS ] [ DELIMITER [ AS ] 'delimiter' ] [ NULL [ AS ] 'null string' ] [ CSV [ HEADER ] [ QUOTE [ AS ] 'quote' ] [ ESCAPE [ AS ] 'escape' ] [ FORCE QUOTE column [, ...] ]
The syntax for \COPY is slightly different: (a) being a psql command, it is not terminated by a semicolon (b) file paths are relative the current working directory.
Many of these can come after the CSV, example, WITH CSV NULL AS is perfectly permissible.
Caveats with import
COPY is not terribly smart or clever, in fact it is dumb and simple. Here are some problems you expect with copy:
- Copying from a source with excess columns
- Let's say you're copying from a source that has the fields (Foo, Bar, Foo+Bar); when you're copying you realize that Foo+Bar is stupid when it is a function of Foo, and Bar. Postgresql provides no way to ignore this column in the load -- CSV LOAD requires every csv in the feed to be present in the table you're loading it too. Work to fix this has been going under the term "ragged csvs"
- Embedded delimiters
- This is all too common, especially with user-created memo text. COPY expects tabs as delimiters by default, but you can specify something else with "USING DELIMITERS 'whatever'". If you have extra delimiter characters, COPY will find too many fields to fit your table, and.... you can guess the rest. Most often this will show as a data type mismatch, as COPY attempts to stuff your text string into a date field or something similar.
- Backslash characters
- This means trouble, because the following delimiter is thereby escaped, and no longer recognized as a delimiter. All the fields get shifted down by one, with the result that (a) your COPY fails because of a data type mismatch, or (b) your data is silently accepted in a mangled state. Either way, not good.
- Carriage return characters (CR)
- If the CR is not removed, it will end up in the final field of your table. If that field is a number or date data type, your COPY will fail. If the field is a character data type, your COPY will succeed, and you will then be scratching your head trying to work out why comparisons using that field give such strange results. Some quick solutions:
- If you received the datafile from an ftp server, you can avoid these altogether by using the "ascii" transfer method. This will automatically adjust the line endings for you.
- Simply, delete all of the the carriage returns using a simple script:
tr -d '\r' < datafile
. - Use a program like GNU recode, which will alter only the CRs used in the line delimiter. The command for that is
recode /cl datafile
.
- NULL confusion
- COPY expects NULLs to be represented as "\N" (backslash-N) by default. You can change this by using WITH NULL AS 'something_else'. If you have empty fields in your data, I strongly recommend using WITH NULL AS ''. Otherwise, COPY will assume empty fields represent empty strings, and will bomb on the first empty number or date field. Note that you cannot mix NULL representations! This is subject to the Caveat with implementation on nulls.
- Wrong data for datatype
- Your data must match the format required by the relevant PostgreSQL data type. For example, if your integers are formatted with locale specific commands, or periods, such as
1,203,327
Postgres will not coerce them to an int.
- Speed
- while much faster than any other import path, the COPY program itself can still easily become the bottleneck for an import, rather than the server. It's not uncommon for the backend running COPY to be pegged using 100% of a CPU while the server itself is loafing.
Caveats with implementation
- COPY's behavior is not symmetrical
-
- If you COPY data into a table already containing data, the new data will be appended.
- If you COPY TO a file already containing data, the existing data will be overwritten.
- NULL behavior is counter-intuitive
- WITH NULL AS '""' will not insert a null when empty quotes are encountered, anything quoted is non-null, and this argument does not short-circuit that behavior.
- Can not handle constraint violations
- Duplicate rows are not permitted if they violate a unique constraint. Conversely, MySQL can with the option to REPLACE or IGNORE.
- Can not extend Pg coercions
- The data-loading mechanism relies on the data being a formal representation of a Pg data-type, or coercible (e.g, cast'able) by Pg. However, there isn't currently a way to add custom-coercions for the Pg types. You can not for instance, make
'31,337'::int
work by overriding the coercion to an Int.
Sometimes a \COPY failure is less descriptive than a COPY failure. If you get a failure on \COPY, try the server variant COPY, and see if that helps you troubleshoot the problem.
Other Loading Solutions
- pgloader is much better at loading error-prone data in a more flexible format than the built-in COPY is, even though it is using the COPY protocol internally.
- pg_bulkload aims at higher performance than standard COPY by skipping some sources of overhead.
Outside Links
- Official Pg documentation for COPY
- Other database implementations:
- MySQL's LOAD DATA INFILE
- MSSQL's BULK INSERT
- Perl's CSV_XS - not directly a database interface, but useful reading on CSV variations