Cluster and Node Termination Causes¶
Clusters terminate for various reasons and not necessarily by manual or a preset automatic termination. Cluster nodes get terminated during autoscaling or health check. The following lists different causes for the clusters and nodes’ termination:
Cluster Termination Causes¶
These are some of the reasons clusters terminate:
INACTIVITY: This is the reason displayed when a cluster gets terminated due to inactivity (self-explanatory). It is governed by Idle Cluster Timeout, which is configurable in hours and/or minutes. If no jobs are running on a cluster or if a cluster node reaches its timely boundary, then Qubole identifies such cluster as inactive and it terminates that cluster. For more information, see Shutting Down an Idle Cluster, aggressive-downscaling, and aggressive-downscaling-azure.
HEALTH_CHECK_FAILED: This is the reason generally displayed when QDS discovers an unhealthy cluster. For example, if there are no nodes in a running cluster, or the ResourceManager does not exist, QDS identifies the cluster as unhealthy and terminates it.
Node Termination Causes¶
These are some of the reasons due to which cluster nodes terminate:
User Initiated: This is the reason displayed when Qubole terminates the cluster node as part of autoscaling or automated cluster lifecycle management. It is also the reason displayed when a user manually terminates the cluster node.
Cluster Instance Terminated: This is the reason displayed when nodes are terminated during the cluster termination.
Cloud Provider Initiated: This is the reason displayed when nodes are terminated due to the Spot (AWS, Azure) or Preemptible VM (GCP) node interruption.
NA: This is the reason displayed when Qubole could not capture the cause for the cluster node termination.
Service Initiated: This is the reason displayed when there is a Spot node loss (initiated by the Cloud Provider).
Server.InternalError: This is the reason displayed when there is an internal server error due to which the cluster gets terminated. An error at the Cloud Provider’s end causes this error.
HEALTH_CHECK_FAILED: This is the reason displayed generally when there are unhealthy cluster nodes.