Running Hive Queries on Tez¶
To run Hive queries on Tez, you need to:
- Configure ApplicationMaster Memory
- Enable Offline Job History
- Start or Restart the Cluster
- Configure Tez as the Hive Execution Engine
- Configure YARN ATS Version with Tez
- Configuring Custom Tez Shuffle Handler
Note
While running a Tez query on a JDBC table, you may get an exception that you can debug by using the workaround described in Handling Unsuccessful Tez Queries While Querying JDBC Tables.
Configure and Start a Hadoop (Hive) Cluster¶
A Hadoop (Hive) cluster is configured by default in QDS. The default cluster should work well for Hive queries on Tez, but if you modify it, make sure the instances you choose for the cluster nodes have plenty of local storage; disk space used for queries is freed up only when the Tez DAG is complete.
The ApplicationMaster also takes up more memory for multi-stage jobs than it needs for similar MapReduce jobs, because in the Tez case it must keep track of all the tasks in the DAG, whereas MapReduce processes one job at a time.
Configure ApplicationMaster Memory¶
To make sure that the ApplicationMaster has sufficient memory, set the following parameters for the cluster on which you are going to run Tez:
tez.am.resource.memory.mb=<Size in MB>;
tez.am.launch.cmd-opts=-Xmx<Size in MB>m;
To set these parameters in QDS, go to the Control Panel and choose the pencil item next to the cluster you are going to use; then on the Edit Cluster page enter the parameters into the Override Hadoop Configuration Variables field.
Do pre-production testing to determine the best values. Start with the value currently set for MapReduce; that is,
the value of yarn.app.mapreduce.am.resource.mb
(stored in the Hadoop file mapred-site.xml
). You can see the current
(default) value in the Recommended Configuration field on the Edit Cluster page. If out-of-memory (OOM) errors occur
under a realistic workload with that setting, start bumping up the number as a multiple of
yarn.scheduler.minimum-allocation-mb
, but do not exceed the value of yarn.scheduler.maximum-allocation-mb
.
Enable Offline Job History¶
To enable offline job history, set the following parameter for the cluster on which you are going to run Tez:
yarn.ahs.leveldb.backup.enabled = true
To set this parameter in QDS, go to the Control Panel and choose the pencil item next to the cluster you are going to use; then on the Edit Cluster page enter the parameter into the Override Hadoop Configuration Variables field.
Start or Restart the Cluster¶
To start the cluster, click on the arrow to the right of the cluster’s entry on the Clusters page in the QDS Control Panel.
Configure Tez as the Hive Execution Engine¶
You can configure Tez as the Hive execution engine either globally (for all queries) or for a single query at query time.
To use Tez as the execution engine for all queries, enter the following text into the bootstrap file:
set hive.execution.engine = tez
To use Tez as the execution engine for a single Hive query, use the same command, but enter it before the query itself in the QDS UI.
To use Tez globally across your QDS account, set it in the account-level Hive bootstrap file. For more information, see Managing Hive Bootstrap and set-view-bootstrap-api.
Configure YARN ATS Version with Tez¶
You may want to choose YARN ATS v1.5 instead of the default, ATS v1, as v1.5 provides more scalability and reliability. In particular, you may want to switch to v1.5 if you run many concurrent queries using Tez.
ATS v1.5 for Tez is supported only in Hive versions 2.1.1 and 2.3. To switch to ATS v1.5, create a ticket with Qubole Support.
Configuring Custom Tez Shuffle Handler¶
Qubole supports custom Tez shuffle handler in Hive 3.1.1 (beta), which can speed up the worker nodes’ downscaling process in a Hadoop (Hive) cluster. It is not available by default. Create a ticket with Qubole Support to enable the custom Tez Shuffle handler.
Qubole Hive supports MapReduce shuffle handler by default. A running Tez application has several DAGs that complete before the application is completed. The shuffle data of the completed DAG is not cleared until the application terminates. As multiple DAGs runs sequentially on an application, the shuffle data of completed DAGs creates a hindrance for the cluster to downscale. So, to speed up downscaling and cut down on running worker nodes’ costs, Qubole recommends switching to custom Tez shuffle handler if you are using Hive 3.1.1 (beta) version on a Hadoop (Hive) cluster.