DDL - Tables
These functions are used to create and modify data tables in OmniSci.
- Nomenclature Constraints
- CREATE TABLE
- CREATE TABLE AS SELECT
- ALTER TABLE
- DROP TABLE
- TRUNCATE TABLE
- OPTIMIZE TABLE
- VALIDATE CLUSTER
Nomenclature Constraints
Table names must use the NAME format, described in regex notation as:
[A-Za-z_][A-Za-z0-9\$_]*
CREATE TABLE
CREATE TABLE [IF NOT EXISTS] <table> (<column> <type> [NOT NULL] [ENCODING <encodingSpec>], [SHARD KEY (<column>)], [SHARED DICTIONARY (<column>) REFERENCES <table>(<column>)], ...) [WITH (<property> = value, ...)];
Create a table named <table>
specifying <columns>
and table properties. Table and column names cannot include quotes, spaces, or special characters.
Supported Datatypes
* In OmniSci release 4.4.0 and higher, you can use existing 8-byte DATE
columns, but you can create only 4-byte DATE
columns (default) and 2-byte DATE
columns (see DATE ENCODING FIXED(16)
).
For more information, see Datatypes and Fixed Encoding.
For geospatial datatypes, see Geospatial Primitives
.Example
Create a table named tweets
and specify the columns, including type, in the table.
CREATE TABLE IF NOT EXISTS tweets ( tweet_id BIGINT NOT NULL, tweet_time TIMESTAMP NOT NULL ENCODING FIXED(32), lat FLOAT, lon FLOAT, sender_id BIGINT NOT NULL, sender_name TEXT NOT NULL ENCODING DICT, location TEXT ENCODING DICT, source TEXT ENCODING DICT, reply_to_user_id BIGINT, reply_to_tweet_id BIGINT, lang TEXT ENCODING DICT, followers INT, followees INT, tweet_count INT, join_time TIMESTAMP ENCODING FIXED(32), tweet_text TEXT, state TEXT ENCODING DICT, county TEXT ENCODING DICT, place_name TEXT, state_abbr TEXT ENCODING DICT, county_state TEXT ENCODING DICT, origin TEXT ENCODING DICT, phone_numbers bigint);
Supported Encoding
Encoding | Descriptions |
---|---|
DICT |
Dictionary encoding on string columns (default for TEXT columns). Limit of 2 billion unique string values. |
NONE |
No encoding. Valid only on TEXT columns. No Dictionary is created. Aggregate operations are not possible on this column type. |
FIXED (bits) |
Fixed length encoding of integer or timestamp columns. See Datatypes and Fixed Encoding. |
WITH
Clause Properties
Sharding
Sharding partitions a database table across multiple servers so each server has a part of the table with the same columns but with different rows. Partitioning is based on a sharding key defined when you create the table.
Without sharding, the dimension tables involved in a join are replicated and sent to each GPU, which is not feasible for dimension tables with many rows. Specifying a shard key makes it possible for the query to execute efficiently on large dimension tables.
Currently, specifying a shard key is useful for joins, only:
- If two tables specify a shard key with the same type and the same number of shards, a join on that key only sends a part of the dimension table column data to each GPU.
- For multi-node installs, the dimension table does not need to be replicated and the join executes locally on each leaf.
Constraints
- A shard key must specify a single column to shard on. There is no support for sharding by a combination of keys.
- One shard key can be specified for a table.
- Data are partitioned according to the shard key and the number of shards (
shard_count
). - A value in the column specified as a shard key is always sent to the same partition.
- The number of shards should be equal to the number of GPUs in the cluster.
- Only integers and dictionary-encoded columns can be shard keys.
- Tables must share the dictionary for the column to be involved in sharded joins. If the dictionary is not specified as shared, the join does not take advantage of sharding. Dictionaries are reference-counted and only dropped when the last reference drops.
Recommendations
- Set
shard_count
to the number of GPUs you eventually want to distribute the data table across. - Referenced tables must also be
shard_count
-aligned. - Sharding should be minimized because it can introduce load skew accross resources, compared to when sharding is not used.
Examples
Basic sharding:
CREATE TABLE customers( accountId text, name text, SHARD KEY (accountId)) WITH (shard_count = 4);
Sharding with shared dictionary:
CREATE TABLE transactions( accountId text, action text, SHARD KEY (accountId), SHARED DICTIONARY (accountId) REFERENCES customers(accountId)) WITH (shard_count = 4);
CREATE TABLE AS SELECT
CREATE TABLE <newTableName> AS (<SELECT statement>) [WITH (<property> = value, ...)];
Create a table with the specified columns, copying any data that meet SELECT statement criteria.
WITH
Clause Properties
Examples
Create the table newTable. Populate the table with all information from the table oldTable, effectively creating a duplicate of the original table.
CREATE TABLE newTable AS (SELECT * FROM oldTable);
Create a table named trousers. Populate it with data from the columns name, waist, and inseam from the table wardrobe.
CREATE TABLE trousers AS (SELECT name, waist, inseam FROM wardrobe);
Create a table named cosmos. Populate it with data from the columns star and planet from the table universe where planet has the class M.
CREATE TABLE cosmos AS (SELECT star, planet FROM universe WHERE class='M');
ALTER TABLE
ALTER TABLE <table> RENAME TO <table>; ALTER TABLE <table> RENAME COLUMN <column> TO <column>; ALTER TABLE <table> ADD [COLUMN] <column> <type> [NOT NULL] [ENCODING <encodingSpec>] ALTER TABLE <table> ADD (<column> <type> [NOT NULL] [ENCODING <encodingSpec>], ...)
Examples
Rename the table tweets to retweets.
ALTER TABLE tweets RENAME TO retweets;
Rename the column source to device in the table retweets.
ALTER TABLE retweets RENAME COLUMN source TO device;
Add the column lang to the table tweets using a TEXT ENCODING DICTIONARY.
ALTER TABLE tweets ADD COLUMN lang TEXT ENCODING DICT;
Add the columns lang and encode to the table tweets using a TEXT ENCODING DICTIONARY for each.
ALTER TABLE tweets ADD (lang TEXT ENCODING DICT, encode TEXT ENCODING DICT);
Note | Currently, OmniSci does not support adding a geo column type (POINT, LINESTRING, POLYGON, or MULTIPOLYGON) to a table. |
DROP TABLE
DROP TABLE [IF EXISTS] <table>;
Deletes the table structure, all data from the table, and any dictionary content unless it is a shared dictionary. (See the note regarding disk space reclamation.)
Example
DROP TABLE IF EXISTS tweets;
TRUNCATE TABLE
TRUNCATE TABLE <table>;
Use the TRUNCATE TABLE
statement to remove all rows from a table without deleting the table structure.
This releases table on-disk and memory storage and removes dictionary content unless it is a shared dictionary. (See the note regarding disk space reclamation.)
Removing rows is more efficient than using DROP TABLE. Dropping followed by recreating the table invalidates dependent objects of the table requiring you to regrant object privileges. Truncating has none of these effects.
Example
TRUNCATE TABLE tweets;
Note | When you DROP or TRUNCATE, the command returns almost immediately. The directories to be purged are marked with the suffix __DELETE_ME. The files are automatically removed asynchronously. In practical terms, this means that you will not see a reduction in disk usage until the automatic task runs, which might not start for up to five minutes. You might also see directory names appended with __DELETE_ME. You can ignore these, with the expectation that they will be deleted automatically over time. |
OPTIMIZE TABLE
OPTIMIZE TABLE [<table>] [WITH (VACUUM='true')]
Use this statement to remove rows from storage that have been
marked as deleted via DELETE
statements.
When run without the vacuum option, the column-level metadata is recomputed for each column in the specified table. OmniSci Core makes heavy use of metadata to optimize query plans, so optimizing table metadata can increase query performance after metadata widening operations such as updates or deletes. OmniSci Core does not narrow metadata during an update or delete — metadata is only widened to cover a new range.
When run with the vacuum option, it removes any rows marked "deleted" from the data stored on disk. Vacuum is a checkpointing operation, so new copies of any vacuum records are deleted. OmniSci Core currently does not remove old (pre-checkpoint) fragments after vacuuming, so you might notice your data directory growing.
VALIDATE CLUSTER
VALIDATE CLUSTER [WITH (REPAIR_TYPE = ['NONE' | 'REMOVE'])];
Perform checks and report discovered issues on a running OmniSci cluster. Compare metadata between the aggregator and leaves to verify that the logical components between the processes are identical.
Examples
If VALIDATE CLUSTER
detects issues, it returns a report similar
to the following:
mapd@thing3 ~]$ /mnt/gluster/dist_mapd/mapd-sw2/bin/mapdql -p HyperInteractive User mapd connected to database mapd omnisql> validate cluster; Result Node Table Count =========== =========== Aggregator 1116 Leaf 0 1114 Leaf 1 1114 No matching table on Leaf 0 for Table cities_dtl_POINTS table id 56 No matching table on Leaf 1 for Table cities_dtl_POINTS table id 56 No matching table on Leaf 0 for Table cities_dtl table id 80 No matching table on Leaf 1 for Table cities_dtl table id 80 Table details don't match on Leaf 0 for Table view_geo table id 95 Table details don't match on Leaf 1 for Table view_geo table id 95
If no issues are detected, it will report as follows:
Cluster OK
You can include the WITH(REPAIR_TYPE)
argument.
(REPAIR_TYPE='NONE')
is the same as running the command with no
argument. (REPAIR_TYPE='REMOVE')
removes any leaf
objects that have issues. For example:
VALIDATE CLUSTER WITH (REPAIR_TYPE = 'REMOVE');