4. How do I install custom Python libraries through the node bootstrap?

Qubole recommends installing and updating custom Python libraries after activating Qubole’s virtual environment and installing libraries in it. Qubole’s virtual environment is recommended as it contains many popular Python libraries and has advantages as described in Using Pre-installed Python Libraries from the Qubole VirtualEnv.

Installing and updating Python libraries in Qubole’s virtual environment can be done by adding the below script in the node bootstrap.

source /usr/lib/hustler/bin/qubole-bash-lib.sh
qubole-use-python2.7
pip install <library name>

For more information on the node bootstrap, see Understanding a Node Bootstrap Script and Running Node Bootstrap and Ad hoc Scripts on a Cluster.

4.1. Using Pre-installed Python Libraries from the Qubole VirtualEnv

Qubole also recommends looking at the list of Python libraries installed in Qubole virtualenv and figure out if the virtualenv can be used. The advantage is that you do not pay the cost of installing a library if it is already installed. There is a reduction in the time taken for the first query to run if the majority of packages are already installed in the Qubole virtualenv. Here are the pre-installed libraries in the virtualenv.

airflow (1.7.0)
alembic (0.8.6)
amqp (1.4.9)
anyjson (0.3.3)
appdirs (1.4.3)
argparse (1.2.1)
awscli (1.11.70)
Babel (1.3)
backports.ssl-match-hostname (3.5.0.1)
beautifulsoup4 (4.5.1)
billiard (3.3.0.23)
boto (2.40.0)
boto3 (1.3.1)
botocore (1.4.93)
bs4 (0.0.1)
celery (3.1.23)
certifi (2016.2.28)
cffi (1.4.2)
chartkick (0.4.2)
Cheetah (2.4.1)
click (6.6)
colorama (0.3.7)
configobj (4.6.0)
croniter (0.3.12)
cryptography (1.7.1)
Cython (0.20.1)
datadog (0.12.0)
decorator (3.3.2)
dill (0.2.5)
Django (1.6.4)
django-extensions (0.9)
docutils (0.13.1)
enum34 (1.1.6)
Flask (0.10.1)
Flask-Admin (1.4.0)
Flask-Cache (0.13.1)
Flask-Login (0.2.11)
Flask-WTF (0.12)
flower (0.9.1)
future (0.15.2)
futures (3.0.5)
gunicorn (19.3.0)
idna (2.2)
inflection (0.3.1)
iniparse (0.3.1)
ipaddress (1.0.17)
itsdangerous (0.24)
Jinja2 (2.8)
jmespath (0.9.0)
kombu (3.0.35)
lxml (2.3)
Mako (1.0.4)
Markdown (2.6.6)
MarkupSafe (0.23)
mpi4py (1.3.1)
mrjob (0.3.5)
MySQL-python (1.2.5)
ndg-httpsclient (0.4.1)
networkx (1.8.1)
nltk (2.0.4)
numexpr (2.6.1)
numpy (1.11.1rc1)
ordereddict (1.1)
packaging (16.8)
pandas (0.18.1)
paramiko (1.7.7.1)
PIL (1.1.7)
pip (1.4.1)
psutil (4.3.1)
publicsuffix (1.0.4)
pyasn1 (0.1.2)
pycparser (2.14)
pycrypto (2.5)
pycurl (7.19.0)
pydot (1.0.2)
Pygments (2.1.3)
pygpgme (0.1)
pyOpenSSL (16.2.0)
pyparsing (2.2.0)
python-dateutil (2.5.3)
python-editor (1.0.1)
python-gflags (2.0)
python-magic (0.4.11)
pytz (2016.4)
PyYAML (3.10)
qds-sdk (1.9.6)
rdbtools (0.1.5, /usr/lib/virtualenv/python27/src/rdbtools)
recordclass (0.4.1)
redis (2.7.6)
requests (2.10.0)
rsa (3.4.2)
s3cmd (1.5.2)
s3transfer (0.1.10)
scikit-image (0.9.3)
scipy (0.13.3)
setproctitle (1.1.10)
setuptools (23.0.0)
simplejson (2.3.3)
six (1.10.0)
SocksiPy-branch (1.1)
spotman-client (0.2.0)
SQLAlchemy (1.1.0b1)
thrift (0.9.3)
tornado (4.2)
ujson (1.33)
urlgrabber (3.9.1)
urllib3 (1.16)
Werkzeug (0.11.10)
wheel (0.24.0)
workerpool (0.9.2)
wsgiref (0.1.2)
WTForms (2.1)