Optimizing Performance

Hardware

  • Even though MapD is an “in-memory” database, when the database first starts up, it reads data from disk. A large database can take a long time to read from a slow hard disk. Import and execution performance rely on disks with high performance characteristics to match the general nature of the database. As a starting point, MapD recommends fast SSD drives on a good hardware controller in RAID 10 configuration. If you use a virtual machine such as Amazon Web Services, MapD recommends you use Provisioned IOPS SSD disks in RAID configuration for storage.
  • Do not run unnecessary daemons. Ideally, only MapD services run on your MapD server.
  • For a production server, set the performance setting to performance instead of power saving. The performance setting is typically controlled by the system BIOS and prevents throttling back of the CPU. You also have to change the settings in the Linux power governor setup.
  • A large amount of swap activity on the machine probably indicates a memory shortage. Compare the amount of data the database is attempting to process in memory to the amount of memory available.
  • Because some work is always done on the CPUs, speed is important. MapD recommends you use systems that balance a high core count with high CPU speed.
  • Use the nvidia-smi -pm and nvidia-smi -ac commands to set the clock speeds of the GPUs to their maximum. On an NVIDIA Tesla K80, the commands look like this:
sudo nvidia-smi -pm 1
sudo nvidia-smi -ac 3004,875
sudo nvidia-smi --ecc-config=0

Database Design

Review a representative sample of the data from which your table is to be created. This helps you determine the datatypes best suited to your columns. Where possible, place data into columns with the smallest representation that can fit the cardinality involved.

Look for these areas of potential optimization:

  • Can you apply fixed encoding to TIMESTAMP fields?
  • Can you apply fixed sizes to FIXED ENCODING DICT fields?
  • What kind of INTEGER is appropriate for the values involved?
  • Is DOUBLE required, or is FLOAT enough to store expected values?
  • Is ENCODING NONE set for high-cardinality TEXT fields?
  • Can the data be converted from its current form to a more denormalized form?

Using the smallest possible encoding speeds up all aspects of MapD, from initial load to query execution.

Loading Data

  • Loading large flat files of 100M or more is the most efficient way to import data to MapD.
  • Consider increasing the block sizes of StreamInserter or SQLImporter to reduce the overhead of records loaded or streamed.
  • If you use a particular column on a regular basis to restrict the queries to a table, load the table sorted on the data in that column. For example, if most queries have a DATE dimension, then load data in date order for best performance.

Preloading Data

On startup, you can load data from a standard list of queries. You can customize the queries for each analyst and load into memory the data they commonly use first.

Creating the Query List

Create a file with a code block for each user. The keyword USER must be uppercase. Provide the user name and database name, followed by a series of SQL query statements enclosed in curly braces.

USER user_name db_name {
query_a1;
query_a2;
...
query_aN;
}

Add code blocks for all users who benefit from pre-loaded data.

...
USER user_name_2 db_name_2 {
query_b1;
query_b2;
...
query_bN;
}

Example

This code preloads queries to the flights database for the user Ziegfried in a database named mapd:

USER Ziegfried mapd {
select count (*) from flights_2008_10k;
select * from flights_2008_10k limit 10;
select flight_year, flight_month, flight_dayofmonth, flight_dayofweek, deptime, crsdeptime from flights_2008_10k limit 10;
select plane_model, plane_status, plane_aircraft_type, plane_engine_type, plane_year, origin_name, origin_city, origin_state, origin_country,
origin_lat, origin_lon, dest_name from flights_2008_10k limit 10;
}

Loading Data Using the Query List

These are the steps to load the default queries on startup:

  1. Create your query list file and store it on your MapD server.
  2. If necessary, stop your MapD server.
  3. Restart your server with the following option:

    start_mapd_server --db-query-list <path-to-db-query-file>

On startup, your queries are loaded automatically, speeding up query results you defined for your users.

Parallel GPUs

Parsing, optimization, and parts of rendering can overlap between queries, but most of the execution occurs single file. In general, you get the most throughput on the GPU by letting a query have all the resources. Contention is not a concern for buffer or cache memory. If queries are done very quickly, you get low latency, even with many simultaneous queries.

For simple queries on relatively small datasets, consider executing queries on subsets of GPUs. Different GPU groups can execute at the same time. This configuration benefits from parallelizing “fixed overheads” on each query between MapD servers on the same node.

You can implement this behavior by running multiple MapD servers on the same node and mapping each to different sets of GPUs with the --start-gpu and --num-gpus flags (see Configuration file).