Loading Data with SQL

This topic describes several possible ways to load data to OmniSci using SQL commands.

NoteIf there is a potential for duplicate entries and you prefer to avoid loading duplicate rows, see How can I avoid creating duplicate rows? on the Troubleshooting page.

COPY FROM

CSV/TSV Import

Use the following syntax for CSV and TSV files:

COPY <table> FROM '<file pattern>' [WITH (<property> = value, ...)];

<file pattern> must be local on the server. The file pattern can contain wildcards if you want to load multiple files. In addition to CSV, TSV, and TXT files, you can import compressed files in TAR, ZIP,7-ZIP, RAR, GZIP, BZIP2, or TGZ format.

You can import client-side files (\copy command in mapdql) but it is significantly slower. For large files, OmniSci recommends that you first scp the file to the server, and then issue the COPY command.

<property> in the optional WITH clause can be:

  • delimiter: A single-character string for the delimiter between input fields. Default: "," (a CSV file)
  • nulls: A string pattern indicating a field is NULL. Default: an empty string or \N
  • header: Either 'true' or 'false', indicating whether the input file has a header line in Line 1 that should be skipped. Default: 'true'.
  • escape: A single-character string for escaping quotes. Default: the quote character
  • quoted: 'true' or 'false' indicating whether the input file contains quoted fields. Default: 'true'.
  • quote: A single-character string for quoting a field. Default: double quote ("). All characters inside quotes are imported “as is,” except for line delimiters.
  • line_delimiter A single-character string for terminating each line. Default: "\n"
  • array_marker: A two-character string consisting of the start and end characters surrounding an array. Default: { }. For example, data to be inserted into a table with a string array in the second column (for example, BOOLEAN, STRING[], INTEGER) can be written as true,{value1,value2,value3},3.
  • array_delimiter: A single-character string for the delimiter between input values contained within an array. Default: ",".
  • threads: Number of threads for performing the data import. Default: the number of CPU cores on the system
  • max_reject: Number of records that the COPY statement allows to be rejected before terminating the COPY command. Records can be rejected for a number of reasons, including invalid content in a field, or an incorrect number of columns. The details of the rejected records are reported in the ERROR log. COPY returns a message identifying how many records are rejected. The records that are not rejected are inserted into the table, even if the COPY stops due to the max_reject count being reached. Default: 100,000.
  • plain_text: This parameter indicates that the input file is a plain text file so as to bypass the libarchive decompression utility. CSV, TSV, and TXT are always handled as plain text by default.
  • lonlat: The default behavior is to consume two columns as longitude (x) and then latitude (y) when populating a POINT column. If the order of the coordinates in the CSV file is the opposite, you can load the data using the WITH option lonlat='false'. Default: true.
Note

By default, the CSV parser assumes one row per line. To import a file with multiple lines in a single field, specify threads = 1 in the WITH clause.

Examples:

COPY tweets from '/tmp/tweets.csv' WITH (nulls = 'NA');
COPY tweets from '/tmp/tweets.tsv' WITH (delimiter = '\t', quoted = 'false');
COPY tweets from '/tmp/*'          WITH (header='false');

Geo Import

You can use COPY FROM to import geo files. Use the following syntax, depending on the file source.

Source Syntax
Local server
COPY FROM '</filepath>' WITH (geo='true', ...);
Web site
COPY FROM '<[http | https]://website/filepath>' WITH (geo='true', ...);
Amazon S3
COPY FROM '<s3://bucket/filepath>' WITH (geo='true', s3_region='region', s3_access_key='accesskey', s3_secret_key='secretkey', ... );

The following WITH options are available for geo file imports from all sources:

  • geo_coords_type: Coordinate type used; must begeography.
  • geo_coords_encoding: Coordinates encoding; can be geoint(32) or none. Default: geoint(32)
  • geo_coords_srid: Coordinates spatial reference; must be 4326 (WGS84 longitude/latitude).

The following file types are supported:

  • ESRI Shapefile (.shp).
  • GeoJSON (.geojson, .json, .geojson.gz, .json.gz).
  • KML (.kml, kmz).
  • File bundles:
    • .zip
    • .tar
    • .tar.gz
    • .tgz
    • NOTE: The first compatible file (.shp, .geojson, .kml) in the bundle is loaded, traversing subfolders until a compatible file is found. The rest of the contents in the bundle are ignored. If the bundle contains multiple filesets, unpack the file manually and specify it for import.

S3 AWS File Import

You can use the SQL COPY FROM statement to import files stored on AWS S3 into a OmniSci table, in much the same way you would with local files. In the WITH clause, specify the S3 credentials and region information of the bucket accessed.

COPY <table> FROM '<S3_file_URL>' WITH (s3_access_key = <key_name>,s3_secret_key = <key_name>,s3_region = <region>);

Access key, secret key, and region are required. For information about AWS S3 credentials, see https://docs.aws.amazon.com/general/latest/gr/aws-sec-cred-types.html#access-keys-and-secret-access-keys.

The following examples show failed and successful attempts to copy the table trips from AWS S3.

mapdql> COPY trips FROM 's3://mapd-s3-no-access/trip_data_9.gz';
Exception: failed to list objects of s3 url 's3://mapd-s3-no-access/trip_data_9.gz': AccessDenied: Access Denied
mapdql> COPY trips FROM 's3://mapd-s3-no-access/trip_data_9.gz' with (s3_access_key='xxxxxxxxxx',s3_secret_key='yyyyyyyyy');
Exception: failed to list objects of s3 url 's3://mapd-s3-no-access/trip_data_9.gz': AuthorizationHeaderMalformed: Unable to parse ExceptionName: AuthorizationHeaderMalformed Message: The authorization header is malformed; the region 'us-east-1' is wrong; expecting 'us-west-1'
mapdql> COPY trips FROM 's3://mapd-parquet-testdata/trip.compressed/trip_data_9.csv' with (s3_access_key=’xxxxxxxx’,s3_secret_key='yyyyyyyy',s3_region='us-west-1');
Result
Loaded: 100 recs, Rejected: 0 recs in 0.361000 secs

The following example imports all the files in the trip.compressed directory.

mapdql> copy trips from 's3://mapd-parquet-testdata/trip.compressed/' with (s3_access_key=’xxxxxxxx’,s3_secret_key='yyyyyyyy',s3_region='us-west-1');
Result
Loaded: 105200 recs, Rejected: 0 recs in 1.890000 secs

trips Table

The table trips is created with the following statement:

mapdql> \d trips
        CREATE TABLE trips (
        medallion TEXT ENCODING DICT(32),
        hack_license TEXT ENCODING DICT(32),
        vendor_id TEXT ENCODING DICT(32),
        rate_code_id SMALLINT,
        store_and_fwd_flag TEXT ENCODING DICT(32),
        pickup_datetime TIMESTAMP,
        dropoff_datetime TIMESTAMP,
        passenger_count SMALLINT,
        trip_time_in_secs INTEGER,
        trip_distance DECIMAL(14,2),
        pickup_longitude DECIMAL(14,2),
        pickup_latitude DECIMAL(14,2),
        dropoff_longitude DECIMAL(14,2),
        dropoff_latitude DECIMAL(14,2))
WITH (FRAGMENT_SIZE = 75000000);

SQLImporter

java -cp [OmniSci JDBC driver]:[3rd party JDBC driver]
com.mapd.utility.SQLImporter -t [OmniSci table name] -su [external source user]
-sp [external source password] -c "jdbc:[external
source]://server:port;DatabaseName=some_database" -ss "[select statement]"
usage: SQLImporter
-b,--bufferSize <arg>      Transfer buffer size
-c,--jdbcConnect <arg>     JDBC Connection string
-d,--driver <arg>          JDBC driver class
-db,--database <arg>       OmniSci Database
-f,--fragmentSize <arg>    Table fragment size
-i <arg>                   Path to initialization file.
-p,--passwd <arg>          OmniSci Password
--port <arg>               OmniSci Port
-r <arg>                   Row Load Limit
-s,--server <arg>          OmniSci Server
-sp,--sourcePasswd <arg>   Source Password
-ss,--sqlStmt <arg>        SQL Select statement
-su,--sourceUser <arg>     Source User
-t,--targetTable <arg>     OmniSci Target Table
-tr,--truncate                   Drop and recreate the table, if it exists
-u,--user <arg>            OmniSci User

SQL Importer executes a select statement on another database via JDBC and brings the result set into OmniSci Core.

If the table does not exist, SQL Importer creates the table in OmniSci Core.

If the truncate flag is set, it truncates the contents of the file.

If the file exists and truncate is not set, data import fails if the table does not match the SELECT statement metadata.

OmniSci recommends that you use a service account with read-only permissions when accessing data from a remote database.

The -i argument provides a path to an initialization file. Each line of the file is sent as a SQL statement to the remote server from which the data is copied. This can be used to set additional custom parameters before the data is loaded.

MySQL Example:

java -cp mapd-1.0-SNAPSHOT-jar-with-dependencies.jar:
mysql/mysql-connector-java-5.1.38/mysql-connector-java-5.1.38-bin.jar
com.mapd.utility.SQLImporter -t test1 -sp mypassword -su myuser
-c jdbc:mysql://localhost -ss "select * from employees.employees"

SQLServer Example:

java -cp
/path/to/mapd/bin/mapd-1.0-SNAPSHOT-jar-with-dependencies.jar:/path/to/sqljdbc4.jar
com.mapd.utility.SQLImporter -d com.microsoft.sqlserver.jdbc.SQLServerDriver -t
mapd_target_table -su source_user -sp source_pwd -c
"jdbc:sqlserver://server:port;DatabaseName=some_database" -ss "select top 10 *
from dbo.some_table"

PostgreSQL Example:

java -cp
/p/to/mapd/bin/mapd-1.0-SNAPSHOT-jar-with-dependencies.jar:
/p/to/postgresql-9.4.1208.jre6.jar
com.mapd.utility.SQLImporter -t mapd_target_table -su source_user -sp
source_pwd -c "jdbc:postgresql://server/database" -ss "select * from some_table
where transaction_date > '2014-01-01'"