Running Node Bootstrap and Ad hoc Scripts on a Cluster¶
Qubole allows you to run node bootstrap scripts, and other scripts ad hoc as needed, on cluster nodes. The following topics describe running node bootstrap and ad hoc scripts:
- Running Node Bootstrap Scripts on a Cluster
- Examples of Bootstrapping Cluster Nodes with Custom Scripts
- Running Ad hoc Scripts on a Cluster
- Limitations of Running Ad hoc Scripts
Running Node Bootstrap Scripts on a Cluster¶
You can edit the default node bootstrap script from the cluster settings page: in the QDS UI, navigate to Clusters and click Edit against a specific cluster. Managing Clusters provides more information.
Note
Qubole recommends installing or updating custom Python libraries after activating Qubole’s virtual environment and installing libraries in it. Qubole’s virtual environment is recommended as it contains many popular Python libraries and has advantages as described in Using Pre-installed Python Libraries from the Qubole VirtualEnv.
You can install and update Python libraries in Qubole’s virtual environment by adding code to the node bootstrap script, as follows:
source /usr/lib/hustler/bin/qubole-bash-lib.sh
qubole-use-python2.7
pip install <library name>
The node bootstrap logs are written to node_bootstrap.log under /media/ephemeral0/logs/others. You can also find them from the QDS UI in the Nodes table for a running cluster: in the Clusters section of the UI, below the active/running cluster, the number of nodes on the cluster is displayed against Nodes; click the number to see the Nodes table. For more information on Resources, see Using the Cluster User Interface.
Understanding a Node Bootstrap Script provides more information.
Examples of Bootstrapping Cluster Nodes with Custom Scripts¶
Example: Installing R and RHadoop on cluster¶
- Create a file named
node_bootstrap.sh
(or other name you choose) with the content:
sudo yum -y install R
echo "install.packages(c(\"rJava\", \"Rcpp\", \"RJSONIO\", \"bitops\", \"digest\",
\"functional\", \"stringr\", \"plyr\", \"reshape2\", \"dplyr\",
\"R.methodsS3\", \"caTools\", \"Hmisc\"), repos=\"http://cran.uk.r-project.org\")" > base.R
Rscript base.R
wget https://github.com/RevolutionAnalytics/rhdfs/blob/master/build/rhdfs_1.0.8.tar.gz
echo "install.packages(\"rhdfs_1.0.8.tar.gz\", repos=NULL, type=\"source\")" > rhdfs.R
Rscript rhdfs.R
wget https://github.com/RevolutionAnalytics/rmr2/releases/download/3.3.1/rmr2_3.3.1.tar.gz
echo "install.packages(\"rmr2_3.3.1.tar.gz\", repos=NULL, type=\"source\")" > rmr.R
Rscript rmr.R
cd /usr/lib/hadoop
wget http://www.java2s.com/Code/JarDownload/hadoop-streaming/hadoop-streaming-1.1.2.jar.zip
unzip hadoop-streaming-1.1.2.jar.zip
- Edit the cluster in the QDS UI and enter the name of the bootstrap file into the Node Bootstrap File field, so as to place the file in the appropriate location in Cloud storage.
The above example installs R, RHadoop and RHDFS on the cluster nodes. You can now run R commands as well as RHadoop commands. A sample R script using RHadoop is as given below.
Sys.setenv(\"HADOOP_STREAMING\"=\"/usr/lib/hadoop/hadoop-streaming-1.1.2.jar\")
library(rmr2)
small.ints = to.dfs(1:1000)
mapreduce(
input = small.ints,
map = function(k, v) cbind(v, v^2))
Running Ad hoc Scripts on a Cluster¶
You may want to execute scripts on a cluster in an ad hoc manner. You can use a REST API to execute a script located in Cloud storage. See Run Adhoc Scripts on a Cluster for information about the API.
The Run-Adhoc Script functionality uses the pssh
to spawn adhoc scripts on the cluster nodes. It has
been tested under the following conditions:
- Works in clusters that are being set up using a proxy tunnel server
- Even if the script execution time is longer than the
pssh
timeout, the script still executes on the node.
Limitations of Running Ad hoc Scripts¶
If a script is running and you try to execute the same script on the same cluster, the second instance will not run. To work around this, you can tweak the path of the script, and then run it as a separate instance of the API.