Understanding a Node Bootstrap Script

Bootstrap scripts allow installation, management, and configuration of tools useful for cluster monitoring and data loading. A node bootstrap script runs on all cluster nodes, including autoscaling nodes, when they come up.

Node bootstrap scripts must be placed in the default location, for example, something similar to:

gs://test-vs/scripts/hadoop/node_bootstrap.sh

The logs written by the node bootstrap script are saved in node_bootstrap.log in /media/ephemeral0/logs/others.

The Node Bootstrap Logs are also available in the cluster UI as part of the Nodes table for a running cluster. In the cluster UI, below a running cluster, the number of nodes in the cluster is displayed next to Nodes. Click the number to see the Nodes table. For more information on Resources, see Using the Cluster User Interface.

Note

Qubole recommends you install or update custom Python libraries after activating Qubole’s virtual environment and installing libraries in it. Qubole’s virtual environment is recommended as it contains many popular Python libraries and has the advantages described in Using Pre-installed Python Libraries from the Qubole VirtualEnv.

You can install or update Python libraries in Qubole’s virtual environment by adding a script to the node bootstrap file as in the following example:

source /usr/lib/hustler/bin/qubole-bash-lib.sh
qubole-use-python2.7
pip install <library name>

The node bootstrap script is invoked as a root user. It does not have a terminal (TTY or text-only console); note that many programs do not run without a TTY. In Hadoop clusters, a node bootstrap script is invoked after the HDFS daemons have been bought up in case of Worker nodes but before MapReduce and YARN daemons have been initialized. However, in case of the coordinator node, a node bootstrap script is invoked after the ResourceManager is started. This means that Hadoop applications are run only after the node bootstrap completes.

The node bootstrap process is executed via code resident on the node. This code is executed only on the first boot cycle, not on reboot.

The cluster launch process waits without limit for the node bootstrap script to complete. Specifically, worker daemons and task execution daemons – for example, NodeManager (Hadoop2) waits for the script to execute.

Qubole provides a library of certified bootstrap functions for use in node bootstraps. It is recommended to use those certified bootstrap functions to avoid compatibility issues with future versions of Qubole Software.

Running Node Bootstrap Scripts on a Cluster describes how to run node bootstraps on a cluster and Run Utility Commands in a Cluster describes how to run utility commands to get the node-related information such as seeing if a node is a Worker or Coordinator, or getting the coordinator node’s IP address. You can also see How do I check if a node is a coordinator node or a worker node?.