Optimizing Hive Queries

This section describes optimizations related to Hive queries.

Changing SerDe to DelimitedJSONSerDe for Results with Complex Datatypes

Qubole Hive converts all SELECT queries to INSERT-OVERWRITE-DIRECTORY (IOD) format to save results back to a Cloud location.

When writing data to a directory, Apache Hive uses LazySimpleSerde for serialization (writing) of results/data. But LazySimpleSerde does not honor a <key,value> structure and ignores keys. Honoring the keys plays an important role in displaying data-typed columns.

In Qubole Hive, setting hive.qubole.directory.serde=org.apache.hadoop.hive.serde2.DelimitedJSONSerDe changes the SerDe to DelimitedJSONSerDe, which honors more complex datatypes such as Maps and Arrays. This configuration when set, helps you to view and use query results with complex datatypes correctly.

Handling Direct Writes of INSERT OVERWRITE Query Results

For INSERT OVERWRITE queries, Qubole Hive allows you to directly write the results to Google Cloud Storage. Apache Hive normally writes data to a temporary location and then moves it to Google Cloud Storage. In Google Cloud Storage, moving data is expensive because it requires copy and delete operations. So, directly writing the INSERT OVERWRITE query results to Google Cloud Storage is an optimization that Qubole Hive offers.

However, there is an issue that you may face while writing INSERT OVERWRITE query results to Google Cloud Storage. While writing to a partitioned and bucketed table using the INSERT OVERWRITE command, there is a chance that multiple reducers will simultaneously write the result file to a specific table’s location in Google Cloud Storage. As it is an INSERT OVERWRITE command, the existing files written before the current job in the table’s location are deleted before the reducer tasks write new result files. In this situation, a scenario can occur where the delete request sent to Google Cloud Storage by one reduce task (say R1) gets throttled due to the Google Cloud Storage issue and by this time, another reduce task (say R2) deletes the old files and writes a new file. The delete request sent by reduce task R1 is now processed by Google Cloud Storage and it ends up deleting the file written by reduce task R2. To overcome this issue, Qubole provides an enhancement to avoid files from being deleted by reduce tasks. The enhancement is not enabled on the QDS account by default. To enable it on the account, create a ticket with Qubole Support.

When the enhancement is enabled, a prefix which is unique per job is added to the result files. This ensures that only old files which do not have the latest prefix are deleted. Thus it solves the data loss issue which can happen due to Google Cloud Storage throttling when multiple reducers try to simultaneously write query results.