Running Node Bootstrap and Ad hoc Scripts on a Cluster

Qubole allows you to run node bootstrap scripts, and other scripts ad hoc as needed, on cluster nodes. The following topics describe running node bootstrap and ad hoc scripts:

Running Node Bootstrap Scripts on a Cluster

You can edit the default node bootstrap script from the cluster settings page: in the QDS UI, navigate to Clusters and click Edit against a specific cluster. Managing Clusters provides more information.

Note

Qubole recommends installing or updating custom Python libraries after activating Qubole’s virtual environment and installing libraries in it. Qubole’s virtual environment is recommended as it contains many popular Python libraries and has advantages as described in Using Pre-installed Python Libraries from the Qubole VirtualEnv.

You can install and update Python libraries in Qubole’s virtual environment by adding code to the node bootstrap script, as follows:

source /usr/lib/hustler/bin/qubole-bash-lib.sh
qubole-use-python2.7
pip install <library name>

The node bootstrap logs are written to node_bootstrap.log under /media/ephemeral0/logs/others. You can also find them from the QDS UI in the Nodes table for a running cluster: in the Clusters section of the UI, below the active/running cluster, the number of nodes on the cluster is displayed against Nodes; click the number to see the Nodes table. For more information on Resources, see Using the Cluster User Interface.

Understanding a Node Bootstrap Script provides more information.

Examples of Bootstrapping Cluster Nodes with Custom Scripts

Example: Installing R and RHadoop on cluster

  1. Create a file named node_bootstrap.sh (or other name you choose) with the content:
sudo yum -y install R
echo "install.packages(c(\"rJava\", \"Rcpp\", \"RJSONIO\", \"bitops\", \"digest\",
               \"functional\", \"stringr\", \"plyr\", \"reshape2\", \"dplyr\",
               \"R.methodsS3\", \"caTools\", \"Hmisc\"), repos=\"http://cran.uk.r-project.org\")" > base.R
Rscript base.R
wget https://github.com/RevolutionAnalytics/rhdfs/blob/master/build/rhdfs_1.0.8.tar.gz
echo "install.packages(\"rhdfs_1.0.8.tar.gz\", repos=NULL, type=\"source\")" > rhdfs.R
Rscript rhdfs.R
wget https://github.com/RevolutionAnalytics/rmr2/releases/download/3.3.1/rmr2_3.3.1.tar.gz
echo "install.packages(\"rmr2_3.3.1.tar.gz\", repos=NULL, type=\"source\")" > rmr.R
Rscript rmr.R
cd /usr/lib/hadoop
wget http://www.java2s.com/Code/JarDownload/hadoop-streaming/hadoop-streaming-1.1.2.jar.zip
unzip hadoop-streaming-1.1.2.jar.zip
  1. Edit the cluster in the QDS UI and enter the name of the bootstrap file into the Node Bootstrap File field, so as to place the file in the appropriate location in Cloud storage.

The above example installs R, RHadoop and RHDFS on the cluster nodes. You can now run R commands as well as RHadoop commands. A sample R script using RHadoop is as given below.

Sys.setenv(\"HADOOP_STREAMING\"=\"/usr/lib/hadoop/hadoop-streaming-1.1.2.jar\")
library(rmr2)
small.ints = to.dfs(1:1000)
  mapreduce(
    input = small.ints,
    map = function(k, v) cbind(v, v^2))

Running Ad hoc Scripts on a Cluster

You may want to execute scripts on a cluster in an ad hoc manner. You can use a REST API to execute a script located in Cloud storage. See Run Adhoc Scripts on a Cluster for information about the API.

The Run-Adhoc Script functionality uses the pssh to spawn adhoc scripts on the cluster nodes. It has been tested under the following conditions:

  • Works in clusters that are being set up using a proxy tunnel server
  • Even if the script execution time is longer than the pssh timeout, the script still executes on the node.

Limitations of Running Ad hoc Scripts

If a script is running and you try to execute the same script on the same cluster, the second instance will not run. To work around this, you can tweak the path of the script, and then run it as a separate instance of the API.