Using the Presto Query Retrying Mechanism¶
Qubole has added a query retry mechanism to handle query failures (if possible). It is useful in cases when Qubole adds nodes to the cluster during autoscaling or after a preemptible instance loss (that is when the cluster composition contains preemptible instances). The new query retry mechanism:
- Retries a query which is failed with the
LocalMemoryExceeded
error when the new nodes are added to the cluster or in the process of being added to the cluster. - Retries a query which failed with an error due to the worker node loss.
In the above two scenarios, there is a waiting time period to let new nodes join the cluster before Presto retries the failed query. To avoid an endless waiting time period, Qubole has added appropriate timeouts. Qubole has also ensured that any actions performed by the failed query’s partial execution are rolled back before retrying the failed query.
Uses of the Query Retry Mechanism¶
The query retry mechanism is useful in these two cases:
- When a query triggers upscaling but fails with the
LocalMemoryExceeded
error as it is run on a smaller-size cluster. The retry mechanism ensures that the failed query is automatically retried on that upscaled cluster. - When a preemptible node loss happens during the query execution. The retry mechanism ensures that the failed query is automatically retried when new nodes join the cluster (when there is a preemptible node loss, Qubole automatically adds new nodes to stabilize the cluster after it receives a preemptible termination notification. Hence, immediately after the preemptible node loss, a new node joins the cluster).
Disabling the Query Retry Mechanism¶
You can enable this feature at cluster and session levels by using the corresponding properties:
At the cluster level: Override
retry.autoRetry=false
in the Presto cluster overrides. On the Presto Cluster UI, you can override a cluster property under Advanced Configuration > PRESTO SETTINGS > Override Presto Configuration. This property is enabled by default.At the session level: Set
auto_retry=false
in the specific query’s session. This property is enabled by default.Note
The session property is more useful as an option to disable the retry feature at query level when
autoRetry
is enabled at the cluster level.
Configuring the Query Retry Mechanism¶
You can configure these parameters:
retrier.max-wait-time-local-memory-exceeded
: It is the maximum time to wait for Presto to give up on retrying while waiting for new nodes to join the cluster, if the query has failed with theLocalMemoryExceeded
error. Its value is configured in seconds or minutes. For example, its value can be2s
, or2m
, and so on. Its default value is5m
. If a new node does not join the cluster within this time period, Qubole returns the original query failure response.retrier.max-wait-time-node-loss
: It is the maximum time to wait for Presto to give up on retrying while waiting for new nodes to join the cluster if the query has failed due to the preemptible node loss. Its value is configured in seconds or minutes. For example, its value can be2s
, or2m
, and so on. Its default value is3m
. If a new node does not join the cluster within this configured time period, the failed query is retried on the smaller-sized cluster.retry.nodeLostErrors
: It is a comma-separated list of Presto errors (in a string form) that signify the node loss. The default value of this property is"REMOTE_HOST_GONE","TOO_MANY_REQUESTS_FAILED","PAGE_TRANSPORT_TIMEOUT"
.
Understanding the Query Retry Mechanism¶
The query retries can occur multiple times. By default, three retries can occur if all conditions are met. The conditions on which the retries happen are:
- The error is retryable. Currently,
LocalMemoryExceeded
and node loss errors:REMOTE_HOST_GONE
,TOO_MANY_REQUESTS_FAILED
, andPAGE_TRANSPORT_TIMEOUT
are considered retryable. This list of node loss errors is configurable using theretry.nodeLostErrors
property. INSERT OVERWRITE DIRECTORY
,INSERT OVERWRITE TABLE
, andCREATE TABLE AS SELECT (CTAS)
queries are considered retryable.SELECT
queries that do not return data before they fail are also retryable.- The actions of a failed query are rolled back successfully. If the rollback fails or if Qubole times out waiting, then it does not retry it.
- A failed query has a chance to succeed if retried:
- For the
LocalMemoryExceeded
error: The query has a chance to succeed if the current number of workers is greater than the number of workers handling theAggregation
stage. If Qubole times out waiting to get to this state, it does not retry. - For the node loss errors: The query has a chance to succeed if the current number of workers is greater than or equal to the number of workers that the query ran on earlier. If Qubole times out waiting to get to this state, it goes ahead with the retry as the query may still pass in the smaller-sized cluster.
- For the