Datatypes and Fixed Encoding
Each MapD datatype uses space in memory and on disk. The default datatype sizes are listed in the following table:
Datatype | Size (bytes) | Notes |
---|---|---|
TEXT ENCODING DICT |
4 | Max cardinality 1 billion distinct string values. |
TEXT ENCODING NONE |
Variable | Size of the string + 6 bytes. |
TIMESTAMP |
8 | Linux timestamp from 30610224000000 (1/1/1000 00:00:00.000 ) through 29379542399999 (12/31/2900 23:59:59.999 ). Can also be inserted and stored in human-readable format (YYYY-MM-DD HH:MM:SS ). |
TIME |
8 | Minimum value: 00:00:00 ; maximum value: 23:59:59 . |
DATE |
8 | Minimum value: 1000-01-01 ; maximum value: 2900-12-31 . |
FLOAT |
4 | Variable precision. Minimum value: -3.4 x e^38 ; maximum value: 3.4 x e^38 . |
DOUBLE |
8 | Variable precision. Minimum value: -1.79 x e^308 ; maximum value: 1.79 x e^308 |
INTEGER |
4 | Minimum value: -2,147,483,647 ; maximum value: 2,147,483,647 . |
TINYINT |
1 | Minimum value: -127 ; maximum value: 127 . |
SMALLINT |
2 | Minimum value: -32,767 ; maximum value: 32,767 . |
BIGINT |
8 | Minimum value: -9,223,372,036,854,775,807 ; maximum value: 9,223,372,036,854,775,807 . |
BOOLEAN |
1 | TRUE: 'true' ,
'1' , 't' . FALSE: 'false' , '0' , 'f' . Text values are not case-sensitive. |
DECIMAL/NUMERIC |
8 | Precise. Maximum precision: 19 . Scale must be less than precision. |
Variable Length Array | n/a | For example, myIntArray int[] . |
Fixed Length Array | n/a | For example, myFixedLengthBigIntArray bigint[3] . |
Note |
|
For certain datatypes, you can use a more compact representation of these values. The options for these datatypes are listed in the following table:
Encoding | Size (bytes) | Notes |
---|---|---|
TIMESTAMP ENCODING FIXED(32) |
4 | Range: 1901-12-13 20:45:53 - 2038-01-19 03:14:07 |
TIME ENCODING FIXED(32) |
4 | Range: 00:00:00 - 23:59:59 |
DATE ENCODING FIXED(32) |
4 | Range: 1901-12-13 - 2038-01-19 |
TEXT ENCODING DICT(16) |
2 | Max cardinality 64K distinct string values |
TEXT ENCODING DICT(8) |
1 | Max cardinality 255 distinct string values |
INTEGER ENCODING FIXED(16) |
2 | Same as SMALLINT |
INTEGER ENCODING FIXED(8) |
1 | Range: -127 – 127 |
SMALLINT ENCODING FIXED(8) |
1 | Range: -127 – 127 |
BIGINT ENCODING FIXED(32) |
4 | Same as INTEGER |
BIGINT ENCODING FIXED(16) |
2 | Same as SMALLINT |
BIGINT ENCODING FIXED(8) |
1 | Range: -127 – 127 |
To use these fixed length fields, the data range must fit into the constraints as described.
These encodings are most effective on low-cardinality TEXT
fields, where you can achieve large savings of storage space and improved processing speed, and on TIMESTAMP
fields where the timestamps range between 1901-12-13 20:45:53 and 2038-01-19 03:14:07.
All encoding options are shown. Some of the INTEGER options overlap. For example, INTEGER ENCODINGFIXED(8) and SMALLINT ENCODINGFIXED(8) are essentially the same.
If a TEXT ENCODING field does not match the defined cardinality, MapD substitutes a NULL
value and logs the change.
Understanding your schema and the scope of potential values in each field helps you achieve significant savings by carefully applying these fixed encoding types.
Geospatial Primitives
Type | Description | Example |
---|---|---|
POINT |
A point described by two coordinates. | POINT(0 0) |
LINESTRING |
A sequence of 2 or more points and the lines that connect them. | LINESTRING(0 0,1 1,1 2) |
POLYGON |
A set of one or more rings (closed line strings), with the first representing the shape (external ring) and the rest representing holes in that shape (internal rings). | POLYGON((0 0,4 0,4 4,0 4,0 0),(1 1, 2 1, 2 2, 1 2,1 1)) |
MULTIPOLYGON |
A set of one or more polygons. | MULTIPOLYGON(((0 0,4 0,4 4,0 4,0 0),(1 1,2 1,2 2,1 2,1 1)), ((-1 -1,-1 -2,-2 -2,-2 -1,-1 -1))) |
Shared Dictionaries
You can improve performance of string operations and optimize storage using shared dictionaries. You can share dictionaries within a table or between different tables in the same database. The table with which you want to share dictionaries must exist when you create the table that references the TEXT ENCODING DICT
field, and the column that you are referencing in that table must also exist. The following small DDL shows the basic structure:
CREATE TABLE text_shard (
i TEXT ENCODING DICT(32),
s TEXT ENCODING DICT(32),
SHARD KEY (i))
WITH (SHARD_COUNT = 2);
CREATE TABLE text_shard1 (
i TEXT,
s TEXT ENCODING DICT(32),
SHARD KEY (i),
SHARED DICTIONARY (i) REFERENCES text_shard(i))
WITH (SHARD_COUNT = 2);
In the table definition, make sure that referenced columns appear before the referencing columns.
For example, this DDL is a portion of the schema for the flights
database. Because airports are both origin and destination locations, it makes sense to reuse the same dictionaries for name
, city
, state
, and country
values.
create table flights (
*
*
*
dest_name TEXT ENCODING DICT,
dest_city TEXT ENCODING DICT,
dest_state TEXT ENCODING DICT,
dest_country TEXT ENCODING DICT,
*
*
*
origin_name TEXT,
origin_city TEXT,
origin_state TEXT,
origin_country TEXT,
*
*
*
SHARED DICTIONARY (origin_name) REFERENCES flights(dest_name),
SHARED DICTIONARY (origin_city) REFERENCES flights(dest_city),
SHARED DICTIONARY (origin_state) REFERENCES flights(dest_state),
SHARED DICTIONARY (origin_country) REFERENCES flights(dest_country),
*
*
*
)
WITH(
*
*
*
)
To share a dictionary in a different existing table, replace the table name in the REFERENCES
instruction. For example, if you have an existing table called us_geography
, you can share the dictionary by following the pattern in the DDL fragment below.
create table flights (
*
*
*
SHARED DICTIONARY (origin_city) REFERENCES us_geography(city),
SHARED DICTIONARY (origin_state) REFERENCES us_geography(state),
SHARED DICTIONARY (origin_country) REFERENCES us_geography(country),
SHARED DICTIONARY (dest_city) REFERENCES us_geography(city),
SHARED DICTIONARY (dest_state) REFERENCES us_geography(state),
SHARED DICTIONARY (dest_country) REFERENCES us_geography(country),
*
*
*
)
WITH(
*
*
*
);
NOTE: The referencing column cannot specify the encoding of the dictionary because it uses the encoding from the referenced column.