Troubleshooting Errors and Exceptions in Hive Jobs

This topic provides information about the errors and exceptions that you might encounter when running Hive jobs or applications. You can resolve these errors and exceptions by following the respective workarounds.

Container memory requirement exceeds physical memory limits

Problem Description

A Hive job fails, and the error message below appears in the Qubole UI under the Logs tab of the Workbench page, or in the Mapper logs, Reducer logs, or ApplicationMaster logs:

Container [pid=18196,containerID=container_1526931816701_34273_02_000003] is running beyond physical memory limits.
Current usage: 2.2 GB of 2.2 GB physical memory used; 3.2 GB of 4.6 GB virtual memory used. Killing container.

Diagnosis

Three different kinds of failure can result in this error message:

  • Mapper failure
    • This error can occur because the Mapper is requesting more memory than the configured memory. The parameter mapreduce.map.memory.mb represents Mapper memory.
  • Reducer failure
    • This error can occur because the Reducer is requesting more memory than the configured memory. The parameter mapreduce.reduce.memory.mb represents Reducer memory.
  • ApplicationMaster failure
    • This error can occur when the container hosting the ApplicationMaster is requesting more than the assigned memory. The parameter yarn.app.mapreduce.am.resource.mb represents the memory allocated.

Solution

Mapper failure: Modify the two parameters below to increase the memory for Mapper tasks if a Mapper fails with the above error.

  • mapreduce.map.memory.mb: The upper memory limit that Hadoop allows to be allocated to a Mapper, in megabytes.
  • mapreduce.map.java.opts: Sets the heap size for a Mapper.

Reducer failure: Modify the two parameters below to increase the memory for Reducer tasks if a Reducer fails with the above error.

  • mapreduce.reduce.memory.mb: The upper memory limit that Hadoop allows to be allocated to a Reducer, in megabytes.
  • mapreduce.reduce.java.opts: Sets the heap size for a Reducer.

ApplicationMaster failure: Modify the two parameters below to increase the memory for the ApplicationMaster if the ApplicationMaster fails with the above error.

  • yarn.app.mapreduce.am.resource.mb: The amount of memory the ApplicationMaster needs, in megabytes.
  • yarn.app.mapreduce.am.command-opts: To set the heap size for the ApplicationMaster.

Make sure that yarn.app.mapreduce.am.command-opts is less than yarn.app.mapreduce.am.resource.mb. Qubole recommends that the value of yarn.app.mapreduce.am.command-opts be around 80% of yarn.app.mapreduce.am.resource.mb.

Example: Use the set command to update the configuration property at the query level.

set yarn.app.mapreduce.am.resource.mb=3500;

set yarn.app.mapreduce.am.command-opts=-Xmx2000m;

To update configs at the cluster level:

  • Add or update the parameters under Override Hadoop Configuration Variables in the Advanced Configuration tab in Cluster Settings and restart the cluster.
  • See also: MapReduce Configuration in Hadoop 2

GC overhead limit exceeded, causing out of memory error

Problem Description

A Hive job fails with an out-of-memory error “GC overhead limit exceeded,” as shown below.

java.io.IOException: org.apache.hadoop.ipc.RemoteException(java.lang.OutOfMemoryError): GC overhead limit exceeded
at org.apache.hadoop.mapred.ClientServiceDelegate.invoke(ClientServiceDelegate.java:337)
at org.apache.hadoop.mapred.ClientServiceDelegate.getJobStatus(ClientServiceDelegate.java:422)
at org.apache.hadoop.mapred.YARNRunner.getJobStatus(YARNRunner.java:579)
at org.apache.hadoop.mapreduce.Job$1.run(Job.java:348)
at org.apache.hadoop.mapreduce.Job$1.run(Job.java:345)
at java.security.AccessController.doPrivileged(Native Method)
at javax.security.auth.Subject.doAs(Subject.java:415)

Diagnosis

This out-of-memory error is coming from the getJobStatus method call. This is likely to be an issue with the JobHistory server running out of memory. This can be confirmed by checking the JobHistory server log on the coordinator node in media/ephemeral0/logs/mapred. The JobHistory server log will show the out of memory exception stack trace as above.

The out of memory error for the JobHistory server usually happens in the following cases:

  1. The cluster coordinator node is too small and the JobHistory server is set to, for example, a heap size of 1 GB.
  2. The jobs are very large, with thousands of mapper tasks running.

Solution

  • Qubole recommends that you use a larger cluster coordinator node, with at least 60 GB RAM and a heap size of 4 GB for the JobHistory server process.
  • Depending on the nature of the job, even 4 GB for the JobHistory server heap size might not be sufficient. In this case, set the JobHistory server memory to a higher value, such as 8 GB, using the following bootstrap commands:
sudo echo 'export HADOOP_JOB_HISTORYSERVER_HEAPSIZE="8192"' >> /etc/hadoop/mapred-env.sh
sudo -u mapred /usr/lib/hadoop2/sbin/mr-jobhistory-daemon.sh stop historyserver
sudo -u mapred /usr/lib/hadoop2/sbin/mr-jobhistory-daemon.sh start historyserver

Mapper or reducer job fails because no valid local directory is found

Problem Description

Mapper or reducer job fails with the following error:

Caused by: org.apache.hadoop.util.DiskChecker$DiskErrorException: Could not find any valid directory for <file_path>

at org.apache.hadoop.fs.LocalDirAllocator$AllocatorPerContext$DirSelector.getPathForWrite(LocalDirAllocator.java:541)

at org.apache.hadoop.fs.LocalDirAllocator$AllocatorPerContext.getLocalPathForWrite(LocalDirAllocator.java:627)

at org.apache.hadoop.fs.LocalDirAllocator.getLocalPathForWrite(LocalDirAllocator.java:173)

at org.apache.hadoop.fs.LocalDirAllocator.getLocalPathForWrite(LocalDirAllocator.java:154)

at org.apache.tez.runtime.library.common.task.local.output.TezTaskOutputFiles.getInputFileForWrite(TezTaskOutputFiles.java:250)

at org.apache.tez.runtime.library.common.shuffle.orderedgrouped.MapOutput.createDiskMapOutput(MapOutput.java:100)

at org.apache.tez.runtime.library.common.shuffle.orderedgrouped.MergeManager.reserve(MergeManager.java:404)

at org.apache.tez.runtime.library.common.shuffle.orderedgrouped.FetcherOrderedGrouped.copyMapOutput(FetcherOrderedGrouped.java:476)

at org.apache.tez.runtime.library.common.shuffle.orderedgrouped.FetcherOrderedGrouped.copyFromHost(FetcherOrderedGrouped.java:278)

at org.apache.tez.runtime.library.common.shuffle.orderedgrouped.FetcherOrderedGrouped.fetchNext(FetcherOrderedGrouped.java:178)

at org.apache.tez.runtime.library.common.shuffle.orderedgrouped.FetcherOrderedGrouped.callInternal(FetcherOrderedGrouped.java:191)

at org.apache.tez.runtime.library.common.shuffle.orderedgrouped.FetcherOrderedGrouped.callInternal(FetcherOrderedGrouped.java:54)

Diagnosis

This error can appear on the Worbench page of the QDS UI or in the Hadoop Mapper or Reducer logs.

MapReduce stores intermediate data in local directories specified by the parameter mapreduce.cluster.local.dir in the mapred-site.xml file. During job processing, MapReduce checks these directories to see if there is enough space to create the intermediate files. If there is no directory that has enough space, the MapReduce job will fail with the error shown above.

Solution

  1. Make sure that there is enough space in the local directories, based on the requirements of the data to be processed.
  2. You can compress the intermediate output files to minimize space consumption.

Parameters to be set for compression:

set mapreduce.map.output.compress = true;
set mapreduce.map.output.compress.codec = org.apache.hadoop.io.compress.SnappyCodec; -- Snappy will be used for compression

Out of Memory error when using ORC file format

Problem Description

An Out of Memory error occurs while generating splits information when the ORC file format is used.

Diagnosis

The following logs appear on the Workbench page of the QDS UI under the Logs tab:

Caused by: java.util.concurrent.ExecutionException: java.lang.OutOfMemoryError: GC overhead limit exceeded
at java.util.concurrent.FutureTask.report(FutureTask.java:122)
at java.util.concurrent.FutureTask.get(FutureTask.java:192)
at org.apache.hadoop.hive.ql.io.orc.OrcInputFormat.generateSplitsInfo(OrcInputFormat.java:1098)
... 15 more
Caused by: java.lang.OutOfMemoryError: GC overhead limit exceeded
at com.google.protobuf.ByteString.copyFrom(ByteString.java:192)
at com.google.protobuf.CodedInputStream.readBytes(CodedInputStream.java:324)
at org.apache.hadoop.hive.ql.io.orc.OrcProto$StringStatistics.(OrcProto.java:1331)
at org.apache.hadoop.hive.ql.io.orc.OrcProto$StringStatistics.(OrcProto.java:1281)
at org.apache.hadoop.hive.ql.io.orc.OrcProto$StringStatistics$1.parsePartialFrom(OrcProto.java:1374)

Solution

The Out of Memory error could be because of using the default split strategy (HYBRID), which requires more memory. Qubole recommends using the ORC split strategy as BI by setting the parameter below:

hive.exec.orc.split.strategy=BI

Hive job fails when “lock wait timeout” is exceeded

Problem Description

A Hive job fails with the following error message:

Lock wait timeout exceeded; try restarting transaction.

The timeout happens while partioning Insert operations.

Diagnosis

The following content will appear in the hive.log file:

ERROR metastore.RetryingHMSHandler (RetryingHMSHandler.java:invoke(173)) - Retrying HMSHandler after 2000 ms (attempt 9 of 10) with error:
javax.jdo.JDODataStoreException: Insert of object “org.apache.hadoop.hive.metastore.model.MPartition@74adce4e” using statement
“INSERT INTO `PARTITIONS` (`PART_ID`,`TBL_ID`,`LAST_ACCESS_TIME`,`CREATE_TIME`,`PART_NAME`,`SD_ID`) VALUES (?,?,?,?,?,?)” failed :
Lock wait timeout exceeded; try restarting transaction

This MySQL transaction timeout can happen during heavy traffic on the Hive Metastore when the RDS server is too busy.

Solution

Try setting a higher value for innodb_lock_wait_timeout on the MySQL side. innodb_lock_wait_timeout defines the length of time in seconds an InnoDB transaction waits for a row lock before giving up. The default value is 50 seconds.