Frequently Asked Questions
These questions are culled from our public forum and support team. If you have a question to contribute (or better still, a question and answer) please post it on the MapD Community Forum.
- Architecture
- Troubleshooting
- Why do I keep running out of memory for GPU queries?
- Why do I keep running out of memory for rendering?
- How can I confirm that MapD is actually running on GPUs?
- How do I compare the performance on GPUs vs. CPUs to demonstrate the performance gain of GPUs?
- Does MapD support a single server with different GPUs?
- What is watchdog, and when should it heel?
- How can I see how much (GPU) memory is being used?
- What data is brought into GPU RAM during query execution?
- Integrations and Connectors
- Licensing
- Tips and Tricks
Architecture
In Immerse, which charts are rendered server-side (using GPUs) versus in the browser (CPUs)?
When running in CPU mode (or when building from Open Source, which does not have the rendering option), the Scatter Plot and Geo Heatmap are not available. Other charts, including Pointmap, are rendered in the browser.
Is MapD backward compatible?
MapD is not backward compatible. With every release, new efficiencies are introduced that are not necessarily compatible with the previous version. As with any database, you should always back up your data before migrating to a later version, just in case you have to revert for any reason.
Troubleshooting
Why do I keep running out of memory for GPU queries?
This typically occurs when the system cannot keep the entire working set of columns in GPU memory.
MapD provides two options when your system does not have enough GPU memory available to meet the requirements for executing a query.
The first option is to turn off the watch dog (--enable_watch_dog=0
). That allows the query to run in stages on the GPU. MapD orchestrates the transfer of data through layers of abstraction and onto the GPU for execution. See Advanced Configuration Flags for MapD Server.
The second option is to set --allow-cpu-retry
. If a query does not fit in GPU memory, it falls back and executes on the CPU. See Configuration Flags for MapD Server.
MapD is an in-memory database. If your common use case exhausts the capabilities of the VRAM on the available GPUs, try re-estimating the scale of the implementation required to meet your needs. MapD can scale across multiple GPUs in a single machine: up to 20 physical GPUs (the most MapD has found in one machine), and up to 64 using GPU visualization tools such as Bitfusion Flex. MapD can scale across multiple machines in a distributed model, allowing for many servers, each with many cards. The operational data size limit is very flexible.
Why do I keep running out of memory for rendering?
This typically occurs when there is not enough OpenGL memory to render the query results.
Review your Mapd_server.INFO log and see if you are exceeding GPU memory. These appear as EVICTION messages.
You might need to increase the amount of buffer space for rendering using the --render-mem-bytes
configuration flag. Try setting it to 1000000000. If that does not work, go to 2000000000.
How can I confirm that MapD is actually running on GPUs?
The easiest way to compare GPU and CPU performance is by using the mapdql command line client, which you can find at $MAPD_PATH/bin/mapdql.
To start the client, use the command bin/mapdql -p HyperInteractive
, where HyperInteractive is the default password.
Once mapdql is running, use one of the following methods to see where your query is running:
- Prepend the
EXPLAIN
command to aSELECT
statement to see a representation of the code that will run on the CPU or GPU. The first line is important; it shows eitherIR for the GPU
orIR for the CPU
. This is most direct method. - The server logs show a message at startup stating if MapD has fallen back to CPU mode. The logs are in your MAPD_DATA directory (default /var/lib/mapd/data), in a directory named mapd_log.
- After you perform some queries, the
\memory_summary
command shows how much memory is in use on the CPU and on each GPU. MapD manages memory itself, so you will see separate columns for in use (actual memory being used) and allocated (memory assigned to mapd_server, but not necessarily in use yet). Data is loaded lazily from disk, which means that you must first perform a query before the data is moved to CPU and GPU. Even then, MapD only moves the data and columns on which you are running your queries.
How do I compare the performance on GPUs vs. CPUs to demonstrate the performance gain of GPUs?
Now, to see the performance advantage of running on GPU over CPU, manually switch where your queries are run:
- Enable timing reporting in mapdql using
\timing
. - Ensure that you are in GPU mode (the default):
\gpu
. - Run your queries a few times. Because data is lazily moved to the GPUs, the first time you query new data/columns takes a bit longer than subsequent times.
- Switch to CPU mode:
\cpu
. Again, run your queries a few times.
If you are using a data set that is sufficiently large, you should see a significant difference between the two. However, if the sample set is relatively small (for example, the sample 7-million flights dataset that comes preloaded in MapD) some of the fixed overhead of running on the GPUs causes those queries to appear to run slower than on the CPU.
Does MapD support a single server with different GPUs? For example, can I install MapD on one server with two NVIDIA GTX 760 GPUs and two NVIDIA GTX TITAN GPUs?
MapD does not support mixing different GPU models. Initially, you might not notice many issues with that configuration because the GPUs are the same generation. However, in this case you should consider removing the GTX 760 GPUs, or configure MapD to not use them.
To configure MapD to use specific GPUs:
- Run the
nvidia-smi
command to see the GPU IDs of the GTX 760s. Most likely, the GPUs are grouped together by type. - Edit the
mapd_server
config file as follows:- If the GTX 760 GPUs are
0,1
, configuremapd_server
with the optionstart-gpu=2
to use the remaining two TITAN GPUs. - If the GTX 760s are
2,3
, add the optionnum-gpus=2
to the config file.
- If the GTX 760 GPUs are
The location of the config file depends on how you installed MapD.
Integrations and Connectors
How do I connect to MapD using Tableau?
You can connect to MapD from Tableau using an ODBC connection. The ODBC connector is included with MapD Enterprise Edition. Contact support@mapd.com for assistance and specific instructions for how to connect to MapD from Tableau.
Licensing
What am I allowed to do using MapD Community Edition?
MapD Community Edition is free for non-commercial use.
What are the capabilities of MapD Open Source and Community edition?
Open source and Community Edition support as many GPUs as will fit in a single node. They do not support High Availability, Distributed Systems, or server-based rendering.
Tips and Tricks
When should I use dictionary encoding?
The short answer is "whenever possible." MapD uses a Dictionary Coder to optimize storage and performance. When you import data with Immerse, MapD scans text-based fields for duplicate values. If 70% of the values appear more than once, the field is defined as TEXT ENCODING DICT. A dictionary table stores the common values, while the database stores references to the full values. This can result in huge savings in storage and processing time..
How can I optimize compression?
Use the mapdql \o
command to output an optimized
CREATE TABLE statement, based on the size of the actual data stored in your
table. See mapdql
How can I ensure that certain columns are ready at startup?
MapD uses a smart cache mechanism that caches parts of columns(chunks) into main memory or GPU memory.
You can use the command line argument db-query-list
to provide a
path to a file that contains SELECT queries you want performed at start-up.
Begin the file with the line USER [super-user-name] [database-name]
.
For example:
USER mapd mapd
Specify columns that you want to cache in a WHERE
condition that
does not have any other filters. For example, in an employee database a query
to cache salary might be the following.
select count(*) from employee where salary > 1;
This query attempts to fit entire salary column into hottest parts of cache. Try to keep the query as simple as possible in order to cache the entire column(s).
Note | A query such as
Select * from employee; does not cache anything of interest since there
is nothing to "process." |
See Preloading Data.
When should I use shared dictionaries?
You can improve performance of string operations and optimize storage using shared dictionaries. You can share dictionaries within a table or between different tables in the same database.
When should I use max_rows
?
The max_rows
setting defines the maximum number of rows allowed
in a table. When you reach the limit, the oldest fragment is removed.
This can be helpful when executing operations that insert and retrieve records
based on insertion order. The default value for max_rows
is
2^62.
What are some best practices for inserting data?
When inserting data, it is best to load data in batches rather than loading one row at a time (as you might with a streaming data source). The overhead for loading data is comparatively high for each transaction, regardless of the number of rows you insert.
See Database Design.