Spark

New Features

  • SPAR-3979 and SPAR-2953: QDS implements the following improvements in Dynamic Filtering in Spark 2.4.3 and later versions:

    • Partitions are pruned at the scan level to prevent the overhead of scanning redundant partitions.
    • Filter values generated by dynamic filtering are now pushed down to ORC (Optimized Row Columnar), in addition to Parquet, data sources.

    Gradual Rollout.

  • SPAR-3713: For JOIN operations, Spark automatically detects skew in data; skew join optimization is used to handle the skew. Via Support.

Enhancements

  • SPAR-3616: Out-of-Memory errors that could occur when memory was tuned appropriately are now handled so that Spark applications run reliably. Supported in Spark 2.4.3 and later versions. Via Support.
  • SPAR-3071: Allocates driver memory for Spark commands on the basis of the instance type of the cluster worker nodes so as to optimize memory usage. Supported on homogeneous clusters running Spark 2.3.2 and later versions. (A homogeneous cluster has worker nodes of only one instance type.)

Spark 2.1.0 and 2.0.2 Deprecated

Spark 2.1.0 and 2.0.2 are deprecated for releases after R57.

Bug Fixes

  • SPAR-3862: Driver logs were not displayed when Spark applications were deployed on a cluster. This issue is now fixed.
  • SPAR-3714: Queries with a large number of nested sub-queries ran slowly when Hive Authorization was enabled. This fix improves the performance of such queries. Fixed in Spark 2.4.3.