Configuring Clusters

Every new account is configured with some clusters by default; these are sufficient to run small test workloads. This section and Managing Clusters explain how to modify these default clusters and add and modify new ones:

Cluster Settings Page

To use the QDS UI add or modify a cluster, choose Clusters from the drop-down list on the QDS main menu, then choose New and the cluster type and click on Create, or choose Edit to change the configuration of an existing cluster.

See Managing Clusters for more information.

The following sections explain the different options available on the Cluster Settings page.

General Cluster Configuration

Many of the cluster configuration options are common across different types of clusters. Let us cover them first by going over some of the most important categories.

Cluster Labels

As explained in Cluster Labels, each cluster has one or more labels that are used to route Qubole commands. In the first form entry, you can assign one or more comma-separated labels to a cluster.

Cluster Type

QDS supports the following cluster types:

  • Airflow (not configured by default).

    Airflow is a platform for programmatically authoring, scheduling, and monitoring workflows. It supports integration with third-party platforms. You can author complex directed acyclic graphs (DAGs) of tasks inside Airflow. It comes packaged with a rich feature set, which is essential to the ETL world. The rich user interface and command-line utilities make it easy to visualize pipelines running in production, monitor progress, and troubleshoot issues as required. To know more about Qubole Airflow, see Airflow.

  • Hadoop 2 (one Hadoop 2 cluster is configured by default in all cases).

    Hadoop 2 clusters run a version of Hadoop API compatible with Apache Hadoop 2.6. Hadoop 2 clusters use Apache YARN cluster manager and are tuned for running MapReduce and other applications.

  • Spark (one Spark cluster is configured by default in all cases).

    Spark clusters allow you to run applications based on supported Apache Spark versions. Spark has a fast in-memory processing engine that is ideally suited for iterative applications like machine learning. Qubole’s offering integrates Spark with the YARN cluster manager.

Cluster Size and Instance Types

From a performance standpoint, this is one of the most critical sets of parameters:

  • Set a Minimum and Maximum Worker Nodes for a cluster (in addition to one fixed Coordinatorr node).

Note

All Qubole clusters autoscale up and down automatically within the minimum and maximum range set in this section.

  • Coordinator and Woker Node Type

    Select the Worker Node Type according to the characteristics of the application. A memory-intensive application would benefit from memory-rich nodes (such as r3 node types in AWS, or E2-64 V3 in Azure), while a CPU-intensive application would benefit from instances with higher compute power (such as the c3 types in AWS or E2-64 V3 in Azure).

The Coordinator Node Type is usually determined by the size of the cluster. For smaller clusters and workloads, small
instances suffice. But for extremely large clusters (or for running a large number of concurrent applications), Qubole recommends large-memory machines.

QDS uses Linux instances as cluster nodes.

Node Bootstrap File

This field provides the location of a Bash script used for installing custom software packages on cluster nodes.

Advanced applications often require custom software to be installed as a prerequisite. A Hadoop Mapper Python script may require access to SciPy/NumPy, for example, and this is often best arranged by simply installing these packages (using yum for example) by means of the node bootstrap script. See Understanding a Node Bootstrap Script for more information.

The account’s storage credentials are used to read the script, which runs with root privileges on both Coordinator and worker nodes; on worker nodes, make it runs before any task is launched on behalf of the application.

Note

QDS does not check the exit status of the script. If software installation fails and it is unsafe to run user applications in this case, you should shut the machine down from the bootstrap script.

Qubole recommends installing or updating custom Python libraries after activating Qubole’s Virtual Environment and installing libraries in it.

See Running Node Bootstrap and Ad hoc Scripts on a Cluster for more information on running node bootstrap scripts.

Other Settings

See Managing Clusters.

GCP Settings

To use the QDS UI to add or modify a GCP cluster, choose Clusters from the drop-down list on the QDS main menu, then choose New and the cluster type and click on Create, or choose Edit to change the configuration of an existing cluster. Configure the cluster as described under Modifying Cluster Settings for GCP.