Qubole Data Service Documentation

Getting Started

Quick Start Guides

These guides provide a quick introduction to the Qubole Data Service (QDS). For more comprehensive information, see the Administration Guide, the User Guide, and the FAQs. QDS functions can also be called programmatically; see the REST API Reference.

The Guide for the GCP platform is:

GCP Quick Start Guide

The topics in this section provide a quick introduction to getting started with Qubole Data Service (QDS) on GCP.

Prerequisites and Signup
Prerequisites

Ensure that you have the following prerequisites before beginning the setup process:

  • Google account: You must have a Google account to begin the setup process. You will use this Google account to sign up for QDS on GCP.
  • GCP project: You must have a GCP project that is tied to a valid billing account to use QDS on GCP. All of the GCP resources you create to use with QDS will be contained within this project. You can either create a new project or use an existing project. For information on creating a project, see Creating and Managing Projects in the GCP documentation.

Note

In the current version of QDS on GCP, each QDS account can be associated with only one GCP project.

  • Your GCP project must have the following APIs enabled:

    • Compute Engine API
    • Cloud Resource Manager API
    • Identity and Access Management (IAM) API
    • Big Query Storage API (optionally, to use BigQuery)
  • Your GCP project can already have existing service accounts in it, but must have remaining quota to allow the creation of two more service accounts, because Qubole must create two new service accounts in the project. By default, GCP has a limit of 100 service accounts per project.

Sign up for Qubole on GCP
  1. Make sure that your GCP project meets the prerequisites described above.

  2. Click Sign up with Google on the signup page:

    _images/01-SignupPage.png
  3. Click the Google account to be used for the signup process:

    _images/02b-SelectGoogleAccount.png
  4. Provide additional details and click Save And Start Authentication with Google Cloud:

    _images/03-ProvideMoreDetails.png
Test Drive Qubole on GCP

Once you have completed the sign-up process, you can begin your free test drive of Qubole on GCP by clicking Test Drive Qubole:

_images/54-start-test-drive.png

Test Drive begins with a guided tour of the Qubole Analyze page, and guides you through running a Hive query. From there, you are free to run queries of your own and explore the QDS interface. Test Drive includes pre-loaded data sets, example use cases, and free technical support, and gives you free access to Qubole for 14 days. At the end of the 14 days, or at any time once you have run a command on the Analyze page, a banner appears at the top of the screen where you can click to unlock a 30-day free trial of QDS:

_images/55-start-30-day-free-trial.png

During your 14-day test drive, Qubole and GCP cloud services are free. During the 30-day free trial, Qubole continues to be free, but you must use your own paid GCP account.

Setting up your Qubole Account on GCP

There are two ways to set up Qubole on GCP:

  • Automated setup: The automated setup procedure makes it easy to get started with Qubole on GCP. Qubole automatically creates the service accounts you need and assigns the required roles and permissions needed to run your Qubole workloads on GCP.
  • Manual setup: The manual setup procedure provides a guided experience to setup your account using a script provided by Qubole that you can download and execute to setup your account.
Automated Setup

This section describes the automated process for setting up a Qubole account on GCP. Before beginning the setup process, be sure that you have the prerequisites described in Prerequisites and Signup.

Required Permissions for Setup
  • Default Storage Location (defloc): QDS must have read/write access to a default storage location in Google Cloud Storage where you want QDS to save log files and write the results of the queries you run. You will enter the defloc location during the setup process.
  • QDS must have read/write access to the buckets in Cloud Storage where you will store the data you want to process with QDS.
  • To perform the automated account setup, certain permissions must be assigned to the Qubole service account (QSA), as described in step f of the Setup Process below.

Note

For information on creating a service account, see Creating and managing service accounts in the GCP documentation. For information on assigning roles to a service account, see Granting roles to service accounts in the GCP documentation.

Setup Process
  1. In the QDS UI, go to Control Panel > Account Settings > Access Settings. For Access Mode Type, select Automated.

  2. In this step, you will assign the required GCP permissions for your QDS account.

    1. Log in to the GCP console and navigate to the IAM & admin page (https://console.cloud.google.com/iam-admin/iam).

    2. At the top of the IAM & admin page, select the project that you want to associate with your QDS account.

    3. Click the Add button to assign an IAM policy to the project.

      _images/40-gcp-console-IAM-page.png
    4. In the QDS UI, go to Control Panel > Account Settings > Access Settings, and copy the Qubole service account’s (QSA) email address:

      _images/41-autosetup-sa-eamail.png
    5. Paste the copied email address into the New members text box on the IAM & admin page in the GCP console:

      _images/42-gcp-new-member-field.png
  1. Add either the roles or the granular permissions below on the Qubole service account (QSA). You can either assign predefined GCP roles or create custom granular roles. Note that when you assign predefined roles, you might be assigning broader permissions than what QDS requires. If any of the required permissions are missing, however, account setup may fail.

    1. Predefined roles:
      1. Service Account Admin: This role includes permissions for working with service accounts.
      2. Project IAM Admin: This role contains permissions to access and administer a project’s IAM policies.
      3. Storage Legacy Bucket Owner/Storage Admin: At least one of these two roles should appear in the IAM section of your GCP console. Add either of these roles to provide read/write access to existing buckets with object listing, creation, and deletion.
      4. Role Administrator role: Create custom roles.
    2. Custom granular permissions:

    To apply granular permissions, you must first create a custom role and then assign it to the Qubole service account (QSA). For information about creating custom roles, see Creating and managing custom roles in the GCP documentation. Include the following permissions in your custom role:

    1. iam.roles.create
    2. iam.roles.delete
    3. iam.roles.get
    4. iam.roles.list
    5. iam.roles.undelete
    6. iam.roles.update
    7. iam.serviceAccounts.create
    8. iam.serviceAccounts.delete
    9. iam.serviceAccounts.get
    10. iam.serviceAccounts.getIamPolicy
    11. iam.serviceAccounts.list
    12. iam.serviceAccounts.setIamPolicy
    13. iam.serviceAccounts.update
    14. resourcemanager.projects.get
    15. resourcemanager.projects.getIamPolicy
    16. resourcemanager.projects.list
    17. resourcemanager.projects.setIamPolicy
    18. storage.buckets.getIamPolicy
    19. storage.buckets.list
    20. storage.buckets.setIamPolicy
    1. Click Save to complete the assigning of the IAM permissions.
  1. It might take a few seconds for the permission changes to take effect.
  1. In the QDS UI, go to Control Panel > Account Settings > Access Settings and reload the page.

  2. From the Projects dropdown, select the project ID for the project to which you gave IAM permissions to the Qubole service account (QSA).

  3. In the Default Location field, enter the name of the bucket (without the prefix (gs://) that will serve as your default location (defloc) in Cloud Storage.

  4. Optionally, in the Data Bucket(s) field, provide a comma-separated list of data buckets (without the gs:// prefix) where you want QDS to read and write data. You can provide up to a maximum of five data buckets.

  5. Click Save. Appropriate error messages are displayed if there are errors.

  6. Validation of credentials after Save:

    • If your settings were saved successfully, you will see a message at the top of the page saying, “Please wait while we validate your settings. It may take up to a few minutes.” Upon completion of the validation, your account will be fully operational.
    • Qubole validates your settings in the background, allowing you to use the application while the settings are being validated, but you will not be allowed to update the access settings or perform operations that interact with GCP, such as starting a cluster.
    • Validation may take up to 5 minutes.
    • If validation is successful, you will see green check marks in the Access Settings section next to the Default Location and Data Bucket(s) fields. If validation fails, you will see a red X next to the respective field(s).
  7. Troubleshooting:

    1. Try re-saving the access settings.
    2. If the problem persists, contact Qubole Support.
  8. Changing project ID for the QDS account:

    1. If you update your project ID in the QDS Access Settings UI, you must assign QSA the required permissions (as described above) again on the new project.
    2. The project can be changed when there are no running clusters.
Custom Roles Created During Automated Setup

During automated setup, Qubole creates two custom roles in your project: qbol_compute_role and qbol_storage_role and assigns both of them to both your Compute Service Account (CSA) and Instance Service Account (ISA). The GCP permissions included in these roles are listed below. Do not modify or delete these roles from the project as doing so might lead to unexpected behavior.

The custom qbol_compute_role includes the following GCP permissions:

  • compute.addresses.use
  • compute.addresses.useInternal
  • compute.disks.create
  • compute.disks.delete
  • compute.disks.get
  • compute.disks.list
  • compute.disks.setLabels
  • compute.disks.use
  • compute.diskTypes.list
  • compute.firewalls.create
  • compute.firewalls.delete
  • compute.firewalls.get
  • compute.firewalls.list
  • compute.firewalls.update
  • compute.globalOperations.get
  • compute.instances.attachDisk
  • compute.instances.create
  • compute.instances.delete
  • compute.instances.detachDisk
  • compute.instances.get
  • compute.instances.list
  • compute.instances.reset
  • compute.instances.resume
  • compute.instances.setLabels
  • compute.instances.setMetadata
  • compute.instances.setServiceAccount
  • compute.instances.setTags
  • compute.instances.start
  • compute.instances.stop
  • compute.instances.suspend
  • compute.instances.use
  • compute.networks.list
  • compute.networks.updatePolicy
  • compute.networks.use
  • compute.networks.useExternalIp
  • compute.regions.get
  • compute.subnetworks.list
  • compute.subnetworks.use
  • compute.subnetworks.useExternalIp
  • compute.zoneOperations.get

The custom qbol_storage_role includes the following GCP permissions:

  • storage.buckets.get
  • storage.buckets.getIamPolicy
  • storage.buckets.list
  • storage.objects.create
  • storage.objects.delete
  • storage.objects.get
  • storage.objects.list
Roles for Google BigQuery

In addition, two GCP roles are assigned to enable use of Google BigQuery as follows:

  • bigquery.dataViewer is assigned on your Compute Service Account (CSA)
  • bigquery.readSessionUser is assigned on your Instance Service Account (ISA)

For more information about these roles, see Predefined roles and permissions in the GCP BigQuery documentation.

bigquery.dataViewer contains the following GCP permissions:

  • bigquery.datasets.get
  • bigquery.datasets.getIamPolicy
  • bigquery.models.getData
  • bigquery.models.getMetadata
  • bigquery.models.list
  • bigquery.routines.get
  • bigquery.routines.list
  • bigquery.tables.export
  • bigquery.tables.get
  • bigquery.tables.getData
  • bigquery.tables.list
  • resourcemanager.projects.get
  • resourcemanager.projects.list

bigquery.readSessionUser contains the following GCP permissions:

  • bigquery.readsessions.*
  • resourcemanager.projects.get
  • resourcemanager.projects.list
Manual Setup (with Qubole script)

This section describes the manual process for setting up a Qubole account on GCP. The user running this setup will need access to a service account and its associated JSON credentials file with permissions specified below. Before beginning the setup process, be sure that you have the prerequisites described in Prerequisites and Signup.

Required Permissions for Setup
  • The user executing the manual script must be assigned the following IAM roles:

    • Service Account Key Admin: To generate JSON keys for service accounts.
    • Storage Admin: To create and assign read/write privileges on storage buckets.
  • The service account used to execute the manual script must have the following IAM roles:

    1. Service Account Admin: This role includes permissions for working with service accounts.
    2. Project IAM Admin: This role contains permissions to access and administer a project’s IAM policies.
    3. Role Administrator role: This role includes permissions to create custom roles.
  • Create a new bucket with bucket-level permissions enabled for the default storage location (defloc). From the permissions tab on the bucket details page, assign either the Storage Legacy Bucket Owner or the Storage Admin role on the defloc service account used to execute the setup:

    _images/39-give-bucket-role-to-sa.png

Note

At least one of the two roles Storage Legacy Bucket Owner and Storage Admin should appear in the IAM section of your GCP console. Add either of these roles to provide the permissions required for the default storage location.

Setup Process
  1. In the QDS UI, go to Control Panel > Account Settings > Access Settings. For Access Mode Type, select Manual.

    _images/03b-CreateQuboleServiceAccount.png
  2. Click Download account setup script to download the setup_service_account.sh script for the setup process.

  3. Leave the fields Compute Service Account and Instance Service Account blank. The service accounts will be created for you when you execute the setup script (see step 7), and these fields will be populated automatically.

  4. Enter your Project ID and the Default Location (defloc) for storing Qubole data and logs.

Note

Omit gs:// when specifying your defloc value here.

  1. You must provide the JSON credentials file corresponding to the service account you are using for this setup process as an input to the script. For information on creating a JSON credentials file, see Creating and managing service account keys in the GCP documentation.

  2. Upload your credentials file and the downloaded setup script to Cloud Shell. For information on uploading files to Cloud Shell, see Using the Session Window in the GCP documentation.

  3. Invoke the downloaded setup script using the source command. The script uses gcloud and gsutil command lines, which are available as a part of the Google Cloud SDK. You should execute the script from a Google Cloud Shell under your GCP account, since all required Google Cloud SDK packages come pre-installed with Cloud Shell. For information on how to invoke Cloud Shell, see Starting Cloud Shell in the GCP documentation.

    Enter these values specific to your GCP account into the setup script:

    • Your Qubole Service Account.
    • Your JSON credentials file.
    • Your project ID.
    • A default location (defloc) on Google Cloud storage that Qubole can use to store logs and processing output. This will be a URL of the form gs://<path-to-bucket>.

    Usage:

    source setup_service_accounts.sh \
    --qubole_sa=<qubole_service_account> \
    --credentials_file=<customer_json_credentials_file> \
    --project=<customer_ProjectID> \
    --defloc=<google_storage_bucket_for_qubole>
    

    Example:

    source setup_service_accounts.sh \
    --qubole_sa=qds1-618@testgcp-218818.iam.gserviceaccount.com \
    --credentials_file=gcp-key.json \
    --project=qubole-gce \
    --defloc=gs://vs-test
    

In this example, gcp-key.json is the credentials file uploaded to the cloud shell.

The output of the setup script will look similar to the following. You can use this output to complete the fields in Access Settings fields in the QDS Control Panel:

Compute Service Account : testvs-comp@qubole-gce.iam.gserviceaccount.com
Instance Service Account: testvs-inst@qubole-gce.iam.gserviceaccount.com
Project ID              : qubole-gce
Default Location        : vs-test
Data Buckets            : vs-data-buckets,arkaraj-acm

Note

You can also display the values of the Compute Service Account and the Instance Service Account with the echo command in Cloud Shell:

  • A service account called Compute Service Account under the environment variable $COMPUTE_SERVICE_ACCOUNT_FOR_QUBOLE.
  • A service account called Instance Service Account under the environment variable $INSTANCE_SERVICE_ACCOUNT_FOR_QUBOLE.

Qubole recommends that you also store the service account names in a secure place.

  1. To use Google BigQuery, you must add two additional roles:

    • roles/bigquery.dataViewer on your Compute Service Account (CSA).
    • roles/bigquery.readSessionUser on your Instance Service Account (ISA).
  2. Click Save to finish the setup of your account.

  3. Validation of credentials after Save:

    • If your settings are saved successfully, you will see a message at the top of page saying, “Please wait while we validate your settings. This may take a few minutes.” Upon completion of the validation, your account will be fully operational.
    • Qubole validates your settings in the background, so you can use the application while the settings are being validated, but you will not be allowed to update the access settings or perform any GCP operations, such as starting a cluster.
    • Validation may take up to 5 minutes.
    • If validation is successful, you will see a green check mark in the Access Settings section next to the Default Location field. If validation fails, you will see a red X in the Access Settings section next to the Default Location field.
Adding Cloud Storage Buckets and Configuring Permissions

Cloud Storage buckets used with QDS must be configured to provide read/write access to your Compute Service Account and Instance Service Account. To add Cloud Storage storage buckets and configure access permissions, perform the following steps:

  1. In the Navigation menu of the GCP console, click Storage and then click Create Bucket.

    _images/51-storage-create-bucket.png
  2. Provide a name for the bucket and click Create.

  3. On the Bucket details screen, click the Permissions tab.

  4. Click Add members.

    _images/53-storage-bucket-add-members.png
  5. In the QDS UI, from the Access Settings section of the Control Panel, copy the names of your Compute Service Account and Instance Service Account.

  6. Paste the service account names into the New members field in the GCP console Add members screen.

  7. In the Role field, click the role selection dropdown list and select Custom > Custom Qubole Storage Role to assign the Qubole Storage Role to the new members.

    _images/52-storage-bucket-add-role.png
  8. Click Save.

Points to remember
  1. In manual setup (as with automated setup), Qubole creates two custom roles in your project: qbol_compute_role and qbol_storage_role. Do not modify or delete these roles from the project as doing so might lead to unexpected behavior.
  2. Every time access settings are saved in automated process, you must ensure that QSA has required permissions as mentioned above.
Manual Setup: Do-it-yourself (without script)

This section describes setting up a Qubole account on GCP manually, without the use of a Qubole-provided script. With this setup, you configure all GCP permissions by creating custom roles directly in the GCP console. Before beginning the setup process, be sure that you have the prerequisites described in Prerequisites and Signup.

There are four steps in this setup procedure:

_images/48-infographic.png

Click the corresponding tab below to see the instructions for each of the four steps.

Step 1: Create Custom Roles

In the GCP console, create two custom roles, Qubole Custom Compute Role and Qubole Custom Storage Role, with the permissions listed below for compute and storage respectively.

_images/49-GCP-IAM-console.png

Qubole Custom Compute Role:

  • compute.addresses.use
  • compute.addresses.useInternal
  • compute.disks.create
  • compute.disks.delete
  • compute.disks.get
  • compute.disks.list
  • compute.disks.setLabels
  • compute.disks.use
  • compute.diskTypes.list
  • compute.firewalls.create
  • compute.firewalls.delete
  • compute.firewalls.get
  • compute.firewalls.list
  • compute.firewalls.update
  • compute.globalOperations.get
  • compute.instances.attachDisk
  • compute.instances.create
  • compute.instances.delete
  • compute.instances.detachDisk
  • compute.instances.get
  • compute.instances.list
  • compute.instances.reset
  • compute.instances.resume
  • compute.instances.setLabels
  • compute.instances.setMetadata
  • compute.instances.setServiceAccount
  • compute.instances.setTags
  • compute.instances.start
  • compute.instances.stop
  • compute.instances.suspend
  • compute.instances.use
  • compute.networks.list
  • compute.networks.updatePolicy
  • compute.networks.use
  • compute.networks.useExternalIp
  • compute.regions.get
  • compute.subnetworks.list
  • compute.subnetworks.use
  • compute.subnetworks.useExternalIp
  • compute.zoneOperations.get

Qubole Custom Storage Role:

  • storage.buckets.get
  • storage.buckets.getIamPolicy
  • storage.buckets.list
  • storage.objects.create
  • storage.objects.delete
  • storage.objects.get
  • storage.objects.list

Step 2: Create Service Accounts

Create the following service accounts in your GCP project:

  1. Compute Service Account: Used to spin up clusters in the customer project
  2. Instance Service Account: Used to autoscale clusters based on workload and SLA.
_images/50-create-service-accounts.png

Step 3: Assign Roles to Service Accounts

  • First assign the custom roles created in Step 1 to both the Compute and Instance Service Accounts you created in Step 2.

  • To use Google BigQuery, you must also add two additional roles:

    • roles/bigquery.dataViewer on your Compute Service Account (CSA).
    • roles/bigquery.readSessionUser on your Instance Service Account (ISA).
  • Next, establish the right links between the Qubole Service account, and the Compute and Instance Service accounts you created in Step 2.

    Follow the steps below to link the accounts:

    • Copy the Qubole Service Account from your QDS Control Panel > Account Settings > Access Settings section, as shown below.

      _images/45-dyi-01b.png
    • Go to IAM & admin > Service accounts in the GCP UI. Click SHOW INFO PANEL if it is not already displayed to show the Permissions section of the Service accounts screen.

    • Add the Qubole Service Account as a Service Account User and Service Account Token Creator on the Compute Service Account:

      _images/46-dyi-02.png
    • The Compute Service Account will now look like this:

      _images/57-add-service-accts-to-roles.png
    • In similar fashion, add the following service accounts for the roles indicated:

      • Add the Compute Service Account as a Service Account User on the Instance Service Account.
      • Add the Instance Service Account as a Service Account User on the Instance Service Account.
    • The Instance Service Account will now look like this:

      _images/56-add-service-accts-to-roles.png

Step 4: Configure Storage Access

Create the following Cloud Storage buckets and provide read/write access for them to the Compute Service Account and Instance Service Account:

  • A Cloud Storage bucket for your defloc (default location)
  • The Cloud Storage buckets where you will store datasets that you want to analyze using QDS

To create these buckets and configure the permissions, perform the following steps:

Cloud Storage buckets used with QDS must be configured to provide read/write access to your Compute Service Account and Instance Service Account. To add Cloud Storage storage buckets and configure access permissions, perform the following steps:

  1. In the Navigation menu of the GCP console, click Storage and then click Create Bucket.

    _images/51-storage-create-bucket.png
  2. Provide a name for the bucket and click Create.

  3. On the Bucket details screen, click the Permissions tab.

  4. Click Add members.

    _images/53-storage-bucket-add-members.png
  5. In the QDS UI, from the Access Settings section of the Control Panel, copy the names of your Compute Service Account and Instance Service Account.

  6. Paste the service account names into the New members field in the GCP console Add members screen.

  7. In the Role field, click the role selection dropdown list and select Custom > Custom Qubole Storage Role to assign the Qubole Storage Role to the new members.

    _images/52-storage-bucket-add-role.png
  8. Click Save.

    Note

    If you add additional Cloud Storage buckets later, you must follow the same procedure to give access permission to the Compute Service Account and Instance Service Account for the new buckets.

    Your account setup is now complete.

Go back up to Tabs.

Using QDS on GCP
Running a Hive Query

Perform the following steps to run a Hive query:

  1. On the Cluster page in the QDS UI, make sure the cluster you want to use is running. If the cluster is stopped, click Start, or click New to create a new Hive cluster.

  2. On the Workbench page in the QDS UI, select the Hive engine and the cluster you want to use.

  3. Click Examples, Hive Select Query Example, and then click Run or Run Again.

    _images/31-GCP-new-Hive-query-example.png

When the Hive query is completed, you should see results like this:

_images/33-GCP-Hive-results.png
Running a Spark Query

Perform the following steps to run a Spark query:

  1. If you don’t have a Spark cluster running, go to the Clusters page and create a Spark cluster by clicking New and clicking the Spark icon.

  2. Start your Spark cluster.

  3. Go to the Workbench page in the QDS UI.

  4. Choose the Spark engine and your Spark cluster, then click on the link for Examples and click the Python Pi Example:

    _images/36-QDS-Spark-query.png
  5. Click Run to run the Python Pi Example query:

_images/37-QDS-run-Spark-query.png
  1. The result will be displayed in the Results tab when the query has finished running:
_images/38-QDS-Spark-query-results.png
Using a Notebook

To use notebooks, perform these steps:

  1. If you don’t have a Spark cluster running, go to the Clusters page and create a Spark cluster by clicking New and clicking the Spark icon.

  2. Start your Spark cluster.

  3. Open the Notebooks page in the QDS UI, and click on the link for Examples.

  4. Click Getting Started > Getting Started:

    _images/34-QDS-Notebooks-Examples.png
  5. Click Copy Notebook. In the dialog window, give the notebook a new name or leave the default name. Choose a location or leave the default location. Choose the cluster and click Copy.

    _images/35-QDS-new-Copy-Notebook.png
  6. This creates and runs the new notebook. You’ll see a message that the notebook is in read-only mode while the cluster is starting up. This can take a few minutes.

    _images/30-QDS-Notebooks-cluster.jpg
  7. When your notebook is ready, you can use it to increase your knowledge of Notebooks:

    1. Look at the many examples included in the Examples section of the Notebooks page to help you get started quickly.
    2. For more information on managing Notebooks, see the Qubole documentation section Notebooks.

For more comprehensive information, see:

Release Information

QDS Release Notes

Version R56 (GCP)

This section of the Release Notes describes capabilities of the Qubole Data Service (QDS) on Google Cloud Platform (GCP), as of Release Version R56.

For details on this version, see:

Release Highlights

With R56, Qubole includes support for Google Cloud Platform (GCP).

The following are highlights of Qubole’s R56 release:

Analytics experience
  • Added error logs API for Presto commands.

GET /api/v1.2/commands/<Command-ID>/error_logs

Data engineering
Airflow
  • Latest Airflow 1.10 version is now fully supported with Python 3.5 package management.
Custom Metastore
  • You can now connect to any remote metastore server using the Thrift URI.
Support for multiple notification channels in Scheduler
  • You can now configure and send alerts to multiple end-points, such as Slack, PagerDuty, Email, other webhooks.
Data science
  • Notebook usability improvements for paragraphs such as active indicators, compact size to fit query, etc.
Administration and TCO
  • Added user email in commands API response to help admins check usage easily. GET /api/v1.2/commands/
Engines
  • Spark

    • Enhanced Broadcast Joins: Introducing Executor-based broadcast, where values to be broadcasted are not collected on the driver. This ensures that driver memory is not a bottleneck for using broadcast, enabling users with lower driver memory to do broadcast.
    • Cost Based Optimizer (CBO) Support: We enabled CBO support by default in Spark version 2.4.0 that uses table statistics to optimize the query for performance.
    • Hint-based Skew Join Support: Users can now specify hints for skewed columns and values for a join. Spark automatically allocates more resources to the skewed value based on the hint.
  • Spark Structured Streaming

    • Streaming State Store: New state storage management using RocksDB in Spark Structured Streaming. This is designed for better scalability, latency in stateful processing such as stream joins, and deduplication.
  • Hive

    • Added MySql 8.x support as Hive Metastore.

Version R57 (GCP)

This section of the Release Notes describes new and changed capabilities of the Qubole Data Service (QDS) on Google Cloud Platform, as of Release Version R57.

For information about what has changed in this version, see:

What’s New

Important new features and improvements are as follows.

Note

A link in blue text next to a description in these Release Notes indicates the launch state, availability, and default state of the item (for example, Beta). The link provides more information. Unless otherwise stated, features are generally available, available as self-service (without intervention by Qubole support), and enabled by default.

  • The Account Level Concurrent Command Limit (shown under Account Settings in the QDS UI) has increased from 20 to 100. Gradual Rollout. Learn more.
  • The Clusters page of the QDS UI displays a new cluster health tile card with metrics. Learn more.
  • Other enhancements and bug fixes.
  • A new version of the Analyze page, previously released as New Analyze, is now called Workbench.
  • Cluster monitoring, includng daemon status, heap usage and coordinator node metrics, is available in Workbench and a REST API.
  • QDS now allows you to configure buffer capacity in Hadoop and Spark clusters. Learn more.
  • QDS Hadoop now allows more containers per node, improving memory management in YARN.
  • QDS supports enterprise installations of Github and Gitlab. Via Support. Learn more.
  • The Environments UI is now available in the Control Panel by default for new users. Beta.
  • Added a new scheduler to optimally schedule tasks based on locality of data cached with Rubix. See `https://www.qubole.com/blog/presto-rubix-scheduler-improves-cache-reads/`__.
  • Added call hive.default.clear_cache() procedure call to clear stale Hive metastore caches. Useful when metastore updates might have occurred from outside the Presto cluster.
  • Made performance improvement in queries involving IN and NOT IN over a subquery. See `https://prestosql.io/blog/2019/05/30/semijoin-precomputed-hasd.html`__.
  • Improved smart query retry to support INSERT OVERWRITE TABLE, CREATE TABLE AS and SELECT queries which failed without returning any data. Tracking of query retries has been improved in command logs with Query Tracker links for retries.
  • Qubole supports Apache Ranger integration with Spark on Spark 2.4.0 and later versions. Beta, Via Support. Learn more.
  • Spark 2.4.3 is generally available. Learn more.
Engines
Hadoop 2
Enhancement

HADTWO-2000: A Hadoop 2 (Hive) or Spark cluster can be configured to have a specific buffer capacity. Set yarn.cluster_start.buffer_nodes.count to the number of nodes to be used for the buffer and pass it as a Hadoop override for the cluster. You can also let QDS maintain the buffer capacity by settting yarn.autoscaling.buffer_nodes.count.is_dynamic=true as a Hadoop override. Disabled | Cluster Restart Required

This buffer capacity will remain free for the lifetime of the cluster, except when the cluster reaches or exceeds its configured maximum number of nodes. The advantage of configuring buffer capacity is that a new command can run immediately without needing to wait for the cluster to upscale. For more information, see the documentation.

Bug Fix

HADTWO-2204: NullPointerException is thrown while updating scheduler node resource if it has already been removed via some async event.

Hive
New Hive Version

Hive 2.3 is generally available. Cluster Restart Required

QHIVE-4645: QDS Hive 2.3 has been updated with all changes through Apache Hive 2.3.5 but continues to use Apache ORC v1.3.3.

Read more about Apache Hive 2.3.5.

Deprecation of Qubole JDBC Storage Handler

The Qubole-Hive JDBC Storage Handler is deprecated as of QDS R57; all Qubole Hive versions will use the OSS JDBC Storage Handler.

Enhancement
  • QHIVE-3515: QDS supports blacklisting and whitelisting Hive tables for automatic statistics collection.
Bug Fix
  • QTEZ-446: QDS has added code in HiveServer2 to automatically kill the Tez application when the query has completed or failed, or has been canceled. See also open-source JIRA TEZ-3405.
Presto
New Features and Enhancements
  • PRES-2372: Cost-based optimization (CBO) for JOIN reordering and JOIN distribution type selection, using statistics in the Hive metastore, is enabled by default for Presto version 0.208.

    The following values have been added to the default cluster configuration for Qubole Presto version 0.208.

    optimizer.join-reordering-strategy=AUTOMATIC
    join-distribution-type=AUTOMATIC
    join-max-broadcast-table-size=100MB
    
  • PRES-2695: QDS allows you to override the required number of workers feature’s cluster-level properties, query-manager.required-workers-max-wait and query-manager.required-workers at the query level using the corresponding session-level properties required_workers_max_wait and required_workers.

  • PRES-2918: A new experimental configuration property experimental.reserved-pool-enabled has been added to Presto version 0.208 to allow you to disable the Reserved Pool. The Reserved Pool prevents deadlocks when memory is exhausted in the General Pool; the largest query is promoted to the to Reserved Pool. But only one query is promoted and the remaining queries in the General Pool are blocked state whenever the pool is full. To avoid this, you can set experimental.reserved-pool-enabled to false thereby disabling the Reserved Pool. For more information, see Disabling Reserved Pool.

  • PRES-3001: Qubole proactively replaces existing preemptible VMs with new ones before the interruption time of 24 hours, thereby preventing the duration limit from having an adverse effect on running queries.

  • PRES-2657: The path for spill-to-disk functionality, experimental.spiller-spill-path=/media/ephemeral0/presto/spill_dir, has been configured by default in Qubole Presto 0.208. This allows you to use spill-to-disk easily, either by setting set session spill_enabled=true for individual queries, or adding experimental.spill-enabled=true to the Presto cluster configuration override to enable spill-to-disk for all queries.

  • PRES-111: Added call hive.default.clear_cache() procedure call to clear stale Hive metastore caches. Useful when metastore updates might have occurred from outside the Presto cluster. The command is supported only on Presto version 0.208.

  • PRES-2744: New session property qubole_max_raw_input_datasize=1TB limits the total bytes scanned. Queries that exceed this limit fail with the RAW_INPUT_DATASIZE_READ_LIMIT_EXCEEDED exception. This ensures rogue queries do not run for a very long time.

  • PRES-2790: Performance improvement in queries involving IN and NOT IN over a subquery. See this blog post.

  • PRES-2605 Added a new scheduler to optimally schedule tasks according to where Rubix caches the data. See `https://www.qubole.com/blog/presto-rubix-scheduler-improves-cache-reads/`__.

  • PRES-2584: Improved smart query retry to support INSERT OVERWRITE TABLE, CREATE TABLE AS and SELECT queries which failed without returning any data. Tracking of query retries has been improved in command logs with Query Tracker links for retries.

  • JDBC-124: QDS now supports concurrent multiple statements in Presto FastPath.

  • PRES-2510: Choosing the Presto UI from the QDS Control Panel redirects to <base-url>/presto-ui-<cluster-id>/ui/. It also redirects <coordinator>:dns:8081 to a static resource <base-url>/ui/index.html.

  • PRES-2992: Added presto-tpcds, presto-localfile, and presto-thrift connectors to Presto 0.193 and 0.208 versions.

  • PRES-2924: Engineering updates have been made to support Presto on GCP as a beta offering in R57. Presto on GCP (beta) supports most Qubole Data Service (QDS) features except Presto Notebooks and Big Query Storage Connector.

Bug Fixes
  • PRES-2568: Fixes a problem that caused a carriage return \r to be incorrectly added wherever there was a semicolon in a query.
  • PRES-2810: Fixes a problem that caused failures in query planning when dynamic filtering is enabled.
Spark
New Features
  • SPAR-3510: QDS now supports Apache Spark 2.4.3. It is displayed as 2.4 latest (2.4.3) in the Spark Version field of the Create New Cluster page in the QDS UI. All existing 2.4.0 clusters are automatically upgraded to 2.4.3 in accordance with Qubole Spark versioning policy.
  • SPAR-2937: You can configure Ranger policies for Hive tables, and these are honored by Spark SQL for authorization. Supported on Spark 2.4.0 and later versions. Beta, Via Support.
Enhancements
  • SPAR-3650: Spark computes the size of the input table during query planning, which speeds up queries containing joins by using BroadcastHashJoin. This is supported on Spark 2.4.0 and later versions. Via Support.
  • SPAR-3616: Allows Spark applications to run reliably even in Out-of-Memory cases. This capability can be enabled in Spark 2.4.3 and later versions. Via Support.
  • SPAR-3555: The appendToTable API now supports Hive tables as well as Spark data sources.
  • SPAR-3418: ORC metadata caching in Spark improves query performance by reducing the time spent on reading ORC metadata from an object store. This is supported on Spark 2.4.3 and later versions. Via Support.
  • SPAR-3226: Spark applications handle Spot Node Loss and Spot-blocks using YARN status of Graceful-Decommission. This is supported on Spark version 2.4.0 and later. Via Support.
Bug Fixes
  • SPAR-3730: The ClassNotFoundException error occurred due to the missing Rubix caching jars in the Hive Metastore classpath. With this fix, the Rubix caching jars are now available in the Hive Metastore classpath. This issue is fixed on Spark 2.2.0 and later versions.
  • SPAR-3701: Query run times in few TPCDS queries had increased due to filter pushdown in subqueries that disables subquery reuse. With this fix, the overall query run time is reduced whenever applicable.
  • SPAR-3405: Hive configs such as hive.metastore.uris were not reaching the Spark Hive Authorizer plugin when passed through Spark defaults or -–conf. As a result, connection errors occurred when connecting to the Hive Metastore and Hive Authorization was enabled. This issue is fixed in Spark 2.4.0 and later versions. Via Support.
  • SPAR-3766: During operations like update Table Stats the owner of the table was changed to the user running the command. With this fix, the original owner of the table is retained. This issue is fixed in Spark 2.4.0 and later versions.
Spark Structured Streaming
Enhancements
  • SPAR-3747: Memory leak issues in RocksDB based state store are fixed, and the rocksdb configuration is better tuned to improve read performance. The previous rocksDB state-store provider class is deprecated. Users should use the following Spark configuration to enable RocksDB based state store:

    spark.sql.streaming.stateStore.providerClass = org.apache.spark.sql.execution.streaming.state.RocksDbStateStoreProvider
    

    This is supported on Spark 2.4.0 and later versions.

Cluster Management
New Feature
  • ACM-4294: A new cluster-health tile card appears in Clusters section of the QDS UI, on the details page tor each cluster.
Enhancements
  • ACM-5555: Qubole supports using shared VPCs for GCP clusters.
  • ACM-5529: Users can now create Presto clusters from the UI on GCP.
  • ACM-5237: Cluster creation now accepts 2.3 Beta as a Hive Version in GCP. You can configure it from the Configuration tab of the Clusters UI while creating a cluster.
  • ACM-5253: Information about which user started or terminated a specific cluster will now be available in the Cluster State API.
  • ACM-5240: Autoscaling logs now contain additional details when Qubole falls back to on-demand VMs due to unavailability of preemptible VMs.
  • ACM-5050: Cluster creation now accepts Use Qubole Placement Policy in GCP. You can use it by checking the Use Qubole Placement Policy checkbox in the Composition tab of the Clusters UI while creating a cluster.
  • ACM-4997: Qubole proactively replaces existing preemptible VMs with new ones before the interruption time of 24 hours, thereby preventing the duration limit from having an adverse effect on running queries.
  • ACM-4955: Service accounts created by manual scripts in GCP now allow BigQuery storage reads.
  • ACM-4884: Cluster creation in GCP now accepts the Node Cooldown Period option from both the UI and the API. Users can configure it from the Composition page in the UI while creating or updating clusters.
  • ACM-515: QDS will use account credentials if cluster credentials do not work to terminate the cluster. You can also contact Qubole Support to force-terminate a cluster that is stuck in the terminating state.
  • ACM-1493: The email address of a user who has manually terminated a cluster is now visible on the Clusters page of the QDS UI.
  • ACM-4323: QDS supports these query runtime configurations for a given cluster: Via Support
    • Set the query execution timeout in minutes for a cluster. QDS auto-terminates a query if its runtime exceeds the timeout.
    • Set a warning about the query runtime in minutes for a cluster. QDS notifies the user of the account via email if a query’s runtime exceeds the configured time.
  • ACM-4801: A QDS API provides the details of the last cleanup activity for a given cluster. The API provides the reason why the cluster was selected or skipped for auto termination. For more information, see get-cleanup-information.
  • TOOLS-1178: QDS has removed the deprecated pycrypto package as of R57. Pycryptodome 3.0, a replacement for pycrypto, is included as of QDS R53.
  • ACM-5515: The cluster console index page now displays the latest cleanup status. For example this will show why a cluster was not terminated.
Bug Fixes
  • ACM-5237: Cluster creation now accepts 2.3 Beta as a Hive Version in GCP. You can configure it from the Configuration tab of the Clusters UI while creating a cluster.
  • ACM-5221: If enable-oslogin was set to true at the project level, Qubole cluster start would fail. This issue has been fixed by setting enable-oslogin as false at the instance level to override the project-level setting.
  • ACM-5192: Configuring a bastion node in a GCP cluster worked only when the user called “ec2-user.” Any valid username can now be used.
Applications
Administration
Enhancements
  • INFRA-1724: QDS now supports COM commands (enabled via Support).

  • INFRA-2441: The Account Level Concurrent Command Limit on the Account Settings tab has increased from 20 to 100. Gradual Rollout. Learn more.

  • AD-2637: QDS on GCP no longer requires the following IAM permissions for storage access purposes:

    • storage.objects.getIamPolicy
    • storage.objects.setIamPolicy

    This change only applies to newly-created service accounts.

Bug Fixes
  • AD-2371: Fixes an issue that caused pages to load slowly.

  • AD-2441: The Usage Status dashboard has been revamped.

  • AD-2884: There are changes to the response returned by the Account API View Information for a QDS Account:

    • The defloc attribute has been changed to storage_location.
    • The following attributes have been added: name, idle_session_timeout, and authorized_ssh_key.
Data Engineering
Airflow
New Version

AIR-352: Airflow Version 1.10.2.QDS (with RBAC) is now available on GCP.

Enhancements
  • AIR-357: Airflow clusters now have an SMTP Agent (Exim) pre-installed on the instance. This enables all the mailing features of Airflow to work out of the box. Add the email addresses to the approved senders on your mail client to ensure that the emails are not directed to the spam folder (Cluster Restart Required).
  • AIR-380: To secure the URLs on the internal networks, new password authentication is introduced for the Monit dashboard on Qubole Airflow clusters. The authentication consists of a username and password. The user has to enter Airflow as the Username and the cluster id as the Password (Cluster Restart Required).
Data Science
Notebooks and Dashboards
New Features
  • ZEP-939: Qubole now supports Zeppelin 0.8.0. Via Support.

    The following improvements are available with the Zeppelin 0.8.0 upgrade:

    • You can run notebooks or paragraphs from within a notebook by using z.run(noteId, paragraphId) or z.runNote(noteId) functions respectively. The external notebook might belong to the same cluster or a different cluster. The external notebook runs in the same context as the caller notebook.
    • You can run all paragraphs in the notebook or dashboard sequentially by using the Run all paragraphs option. In addition, scheduled notebooks or dashboards, and the notebooks that are run from the Analyze page also run paragraphs sequentially. In case of an error or abortion of run all, successive paragraphs are not run.
    • You can perform the following actions from the Paragraph drop-down menu.
      • Run all paragraphs above and below, including the current paragraph. At any given time, you can execute only one of run all, run all above, or run all below.
      • Copy paragraph ID.
      • Clone paragraph.
      • Change font size.
    • You can now have multiple table, text and charts, and other types of output.
    • You can use notebook level dynamic input form with z.noteInput or $$.
    • You can pass Zeppelin’s dynamic input variables to Shell and SQL Interpreters.
    • Network type visualization is supported.
    • You can find and replace values anywhere in the notebook code
    • When you add a new paragraph, the new paragraph provides autofill based on the previous paragraph.
    • Memory leaks causing notebooks UI to freeze are fixed.
  • ZEP-3832: QDS now supports enterprise installations of Github and Gitlab. The service running Github or Gitlab can be in either a public or private subnet. You must allow access to Qubole tunnel servers. Via Support.

  • JUPY-1: Jupyterlab notebooks are now supported as a closed Beta feature for interactive workloads. Jupyter notebooks are supported on Spark 2.2 and later. Beta, Via Support. See Jupyter Notebooks.

Enhancements
  • JUPY-245: Users can now open the Resource Manager and Livy pages from the JupyterLab interface. On the JupyterLab interface, navigate to the Spark menu, and click Resource Manager or Livy to open the respective pages.
  • JUPY-273: Added support for UI-based charts in Jupyterlab, which improves the user experience when switching between different charts and chart options.
Package Management
Enhancements
  • ZEP-3602: The Control Panel in the QDS UI now includes the Environments page by default for new users.

    Limitation: In the new version, packages that are installed by default cannot be uninstalled.

Others
Growth
Enhancements
  • GROWTH-157: Changes to the Qubole HelpCenter:
    • Content from the The Knowledge Base section has migrated to the Troubleshooting Guide.
    • Content from the Announcements has moved to the Qubole Support Portal.
    • Searches will no longer return references to Zendesk as there is no relevant Zendesk content.
  • GROWTH-156: The “Submit Ticket” process now more closely follows the Support Portal process. This will help Qubole gather the necessary information at the start and reduce back-and-forth questions which can slow the work of solving the problem.
Security
Enhancements
Qubole Audit Logs
  • SEC-4219: Audit functionality has been expanded for detailed compliance reporting. Audit events contain information about specific actions taken by users, including views, creation, updates, deletes, uploads, downloads, and copying. The information recorded in the log for an audit event includes the action taken, the user’s account details, the time the action was taken, the request the user made, and the response to this request.

    To obtain audit logs, contact Qubole Support.

Version R58 (GCP)

This section of the Release Notes describes new and changed capabilities of the Qubole Data Service (QDS) on Google Cloud Platform, as of Release Version R58.

For information about what has changed in this version, see:

What’s New

Important new features and improvements are as follows.

Note

A link in blue text next to a description in these Release Notes indicates the launch state, availability, and default state of the item (for example, Beta). The link provides more information. Unless otherwise stated, features are generally available, available as self-service (without intervention by Qubole support), and enabled by default.

  • Organize your work as Collections as you iteratively build a query. Gradual Rollout.
  • View memory, CPU usage, and Hive Metastore connectivity when selecting a cluster for a Hive, Presto, or Spark command. Gradual Rollout.
  • Bug fixes.
  • QDS launches a new version of the JupyterLab interface with the following enhancements:

    • Jupyter notebooks integrated with Version Control Systems: GitHub, GitLab, and Bitbucket.
    • You can schedule Jupyter notebooks from the Scheduler page of the QDS UI.
    • You can view and edit Jupyter notebooks when the clusters are down.
    • Account level and object level Access Control.

    Beta, Via Support.

  • Enhancements in Zeppelin 0.8.0.

  • Zeppelin notebooks integrated with Bitbucket.

  • Other enhancements and bug fixes.

Engines
Hadoop
Enhancements
  • HADTWO-2196: Qubole has backported YARN-3933 to fix a race condition in the Fair Scheduler and prevent negative values for resources such as memory and cores.
  • HADTWO-2273: Shell scripts uploaded to the default location <default_location>/qubole_shell_scripts are now deleted after the command completes.
  • HADTWO-2201: GCS connector version for both Hadoop 2 and Hadoop 3 has been upgraded to 2.0.0. MR BigQuery connector version for both Hadoop 2 and Hadoop 3 has been upgraded to 1.0.0.
  • HADTWO-2051: Provided feature to migrate Shuffle Data on preemptible VMs that are lost.
Bug Fixes
  • HADTWO-2191: Fixes an issue that could cause ResourceManager to deadlock while shutting down.
Hive
Enhancements
Bug Fix
  • QHIVE-4807: Fixes an error case in MapJoin conversion when no table is selected as a big table (OSS HIVE-22201).
  • QHIVE-4849: Changes the timezone in the Tez UI to UTC and the time format to D days, H hours. This eliminates differences between ResourceManager and Tez in the timezone and time format.
Hive 1.2 Deprecated

Hive 1.2 is deprecated as of March 2020.

Presto

The new features and key enhancements are:

  • PRES-3178: Support added for Presto notebooks on Google Cloud Platform.
  • PRES-2977: Presto 0.208 (GA) is now available on GCP.
Presto 317 (beta) with New Features

PRES-3070: Presto 317 (Beta) is the latest version that Qubole Presto supports. This version of Qubole Presto supports open-source changes, reading Hive ACID tables, and other changes including: Beta | Cluster Restart Required

  • PRES-2950: Presto 317 now runs on Java version 11.
  • PRES-3139: Presto 317 now supports required-workers.
  • PRES-3066, PRES-3106: Presto 317 now supports Dynamic Filtering.
  • PRES-2967: Presto 317 now supports Qubole’s workload aware autoscaling.
  • PRES-2966: Presto 317 now supports strict mode.
  • PRES-2965: Presto 317 now supports integration with Apache Ranger.
  • PRES-2969: Presto 317 now supports smart query retries.
  • PRES-3141: Presto 317 now supports Kinesis connector.
  • PRES-2963: Presto 317 now supports Rubix cache.
Presto Supports Hive ACID Tables

PRES-2839: Qubole Presto 317 (beta) supports reading Hive ACID tables. It now has read support for:

  • Insert-only ACID table
  • Full ACID table
  • Non-ACID table converted to ACID table
JOIN Reordering and JOIN Type Determination Based on Table Size

PRES-2971: Table-size-based stats for determining JOIN distribution type and JOIN reordering now also work with predicates on partitioned tables. The size is calculated only for partitions that are being queried.

The distribution type of JOINs in a query is also visible in the Presto query info under the joinDistributionStats key name.

Presto Version 0.193 Deprecated

Presto 0.193 is deprecated and is labelled as deprecated on the Clusters page of the QDS UI. You can still create and use Presto 0.193 clusters, but Qubole strongly recommends you upgrade to 0.208 or a later version to take advantage of the many new features.

Presto 0.208 is the new default version.

Proactive Removal of Unhealthy Cluster Nodes

QDS has implemented the following changes to proactively remove unhealthy cluster nodes: Cluster Restart Required

  • PRES-2093: Use ascm.bad-node-removal to enable or disable this service, which when enabled finds and removes unhealthy worker nodes periodically. Configure the periodic interval via ascm.bad-node-removal.interval. Disabled | Cluster Restart Required
  • PRES-3044: The coordinator node periodically fetches open file descriptor counts from the worker nodes and forcibly removes nodes whose open file descriptor count exceeds a threshold.

For more information, see the documentation.

Buffer Capacity in Presto Clusters

PRES-2682: Presto clusters now support configuring buffer capacity. Set ascm.cluster-start-buffer-workers to configure the buffer capacity. Disabled | Cluster Restart Required

This configured buffer capacity will remain free unless the cluster reaches or exceeds its configured maximum size. Note that when this feature is enabled, the cluster uses buffer capacity as the trigger to upscale, as opposed to the triggers described in workload-aware Presto autoscaling.

For more information, see the documentation.

Dynamic Filtering Improvements

PRES-3152 introduces these improvements:

  • Improves efficiency of dynamic partition pruning by preventing listing and creation of Hive splits from partitions, which are pruned at runtime. (PRES-2990)
  • Enables dynamic partition pruning on Hive tables at the account level. (PRES-3112) Gradual Rollout | Cluster Restart Required
  • Resolves the invalid partition value exception, and intermittent ArrayIndexOutOfBoundsException exceptions from queries with Dynamic Filtering enabled. (PRES-3051)
  • Fixes the UnsupportedOperationException that occurred with some complex outer join queries when dynamic filtering was enabled. (PRES-3249)
Other Enhancements
  • PRES-2740: The Presto Server now runs as a Presto user rather than the root user.
  • PRES-3202: The reserved memory pool is disabled by default in Presto version 317. For Presto 0.208, the reserved memory pool is being disabled as part of a Gradual Rollout | Cluster Restart Required.
Bug Fixes
  • PRES-3177: To prevent the Presto Server from starting without applying bootstrap changes, the server will now not start if its bootstrap file fails to download.
Spark
New Features
  • SPAR-3979 and SPAR-2953: QDS implements the following improvements in Dynamic Filtering in Spark 2.4.3 and later versions:

    • Partitions are pruned at the scan level to prevent the overhead of scanning redundant partitions.
    • Filter values generated by dynamic filtering are now pushed down to ORC (Optimized Row Columnar), in addition to Parquet, data sources.

    Gradual Rollout.

  • SPAR-3713: For JOIN operations, Spark automatically detects skew in data; skew join optimization is used to handle the skew. Via Support.

Enhancements
  • SPAR-3616: Out-of-Memory errors that could occur when memory was tuned appropriately are now handled so that Spark applications run reliably. Supported in Spark 2.4.3 and later versions. Via Support.
  • SPAR-3071: Allocates driver memory for Spark commands on the basis of the instance type of the cluster worker nodes so as to optimize memory usage. Supported on homogeneous clusters running Spark 2.3.2 and later versions. (A homogeneous cluster has worker nodes of only one instance type.)
Spark 2.1.0 and 2.0.2 Deprecated

Spark 2.1.0 and 2.0.2 are deprecated for releases after R57.

Bug Fixes
  • SPAR-3862: Driver logs were not displayed when Spark applications were deployed on a cluster. This issue is now fixed.
  • SPAR-3714: Queries with a large number of nested sub-queries ran slowly when Hive Authorization was enabled. This fix improves the performance of such queries. Fixed in Spark 2.4.3.
Spark Structured Streaming
Enhancements
  • SPAR-3755: The appendToTable API now supports mixed case schema and automatic data dataType casting similar to df.insertInto(table) by default. This feature is supported on Spark 2.4.3 and later versions.
Applications
Data Analytics
Explore
New Features
  • AN-2079: You can now organize your work as Collections as you iteratively build to a final query. Each Collection comes with a query composer that auto-saves the work, supports parameterization or macro substitution in runtime, and catalogs past query runs for easy look-up. You can also manually curate by moving past query runs into other or new Collections. You can drop your Collections into a Common folder enabling other users to find and edit them. You can also use the search box to find collections. Gradual Rollout.
Enhancements
  • AN-1584: Workbench displays clusters sorted by Up, Pending, Terminating, and Down. Within each set, cluster labels are sorted alphabetically.
  • AN-2078: When selecting a cluster for a Hive, Presto, or Spark command, you can see its memory and CPU usage and its Hive Metastore connectivity, so you can make a more informed decision on which cluster to choose. Gradual Rollout.
  • AN-2348: Workbench now supports the read-only view for Spark notebook commands. You can use these to view logs, results, and resource links for the selected command.
Bug Fixes
  • AN-2196: Table data preview does not work for Hive and Presto views, so the Preview icon has been removed for those views.
Data Engineering
Airflow
Enhancements
  • AIR-481: The QDS SDK on Airflow clusters has been upgraded to version 1.13.2 (Cluster Restart Required).
  • AIR-474: You can no longer create or clone clusters running the deprecated 1.8.2 version of Airflow. You can still edit existing 1.8.2 clusters. To upgrade, see Upgrading Airflow Clusters.
Scheduler
Enhancements
  • SCHED-259: Allows you to specify a timeout for individual sub-commands in a workflow command. This helps ensure that the workflow command as a whole times out as specified.
Data Science
Notebooks and Dashboards
Jupyter Notebooks (Beta)
New Features
  • JUPY-199: You can now schedule Jupyter notebooks, set custom parameters for the schedule, and view schedules and their execution history from the JupyterLab interface. Via Support.
  • JUPY-197, JUPY-356 and JUPY-289 : Jupyter Notebooks are now integrated with GitHub, GitLab, and BitBucket (Cloud only). You can use these version-control systems to manage notebook versions, synchronize notebooks with public and private repositories, view and compare notebook versions, and create pull requests. Via Support.
  • JUPY-195 and JUPY-334: You can now configure access control for Jupyter notebooks at both the account and object levels. Users with the system-admin role, or other roles with appropriate permissions, can configure the Jupyter Notebook resource at the account level. Notebook users can override these permissions for the objects that they own. Via Support.
  • JUPY-251: Users with appropriate permissions can now gain access to Jupyter notebooks via shareable links. Get a shareable link as follows:
    1. In the UI, navigate to the File Browser sidebar panel.
    2. Select the required notebook and right-click.
    3. From the resulting menu, select Copy Shareable Link.
  • JUPY-193: You can now create and manage Jupyter notebooks using REST APIs.
Enhancements
  • JUPY-319: You should now specify the notebook/folder name when creating a notebook instead of using the default (Untitled*).
  • JUPY-272: You can now use the QDS Object Storage Explorer on the left side bar of the JupyterLab UI to explore Cloud storage, and to perform actions such as uploading or downloading a file.
  • JUPY-271: You can now use the Table Explorer from the left sidebar to explore the Hive metastore, schema, tables, and columns.
  • JUPY-234: Using its context menu (right-click), you can now copy a sample Jupyter notebook and paste it into the File Browser sidebar panel.
  • JUPY-413: To prevent Livy session timeout for long-running notebooks, you can now configure the kernel and Livy session idle timeout using spark.qubole.idle.timeout. Set this in the Override Spark Configuration field under the Advanced Configuration tab of the Clusters page for the attached Spark cluster. You can set it to an integer value (in minutes), or -1 for no timeout.
Bug Fixes
  • JUPY-332: Module not found errors occurred when trying to import code from the bootstrapped custom zip files in notebooks. This issue is fixed.
  • JUPY-308: Spark application startup used to fail when third party JARs were added to the Spark configuration. This issue is fixed.
  • JUPY-229: The scope of Jupyter magic commands for the listing and clearing of sessions has been limited to only the sessions of the user executing the command, so that the sessions of other users remain unchanged.
  • JUPY-214: Spark applications that are stuck in the Accepted state when the session cannot not be started are now terminated when the timeout is reached.
Zeppelin Notebooks
New Features
  • ZEP-493: Bitbucket is now integrated with Notebooks. You can use Bitbucket to manage versions of your notebooks. Learn more.
  • ZEP-3792: New package management architecture (Python 2.7 and 3.7 with R 3.5) is now available on GCP.
Enhancements
  • ZEP-3915: Zeppelin 0.8.0 includes he following enhancements:

    • ZEP-2749: Pyspark and IPyspark interpreters are now supported with IPython as the default shell. To set the Python shell as the default for the Pyspark interpreter, set zeppelin.pyspark.useIPython to false in the Interpreter settings. Via Support.
    • ZEP-4077: Notebooks now support z.run(noteId, paragraphId) and z.runNote(noteId) functions to run paragraphs or notebooks from within the notebook.
    • ZEP-3317: You can now run Markdown (%md) paragraphs in edit mode even when the cluster is down.
    • ZEP-1908: The geolocation graph type is now available in the UI by default.
  • ZEP-4129: To optimize memory usage for homogeneous Spark clusters running version 2.3.2 and later, Spark driver memory is now allocated on the basis of the instance type of the cluster worker nodes. Because of this, when configuring a cluster to attach to a notebook, you should not specify the spark.driver.memory property in the Spark cluster overrides (Override Spark Configuration under the Advanced tab). A homogeneous cluster is one in which all the worker nodes are of the same instance type.

  • ZEP-4169 and ZEP-134: The Zeppelin application and all interpreters started by Zeppelin (including Spark and shell interpreters) can now be run under a YARN user. Before this release, Zeppelin applications and interpreters ran as root, a security concern for many enterprises. Via Support.

Bug Fixes
  • ZEP-3298: In case of failure, scheduled notebooks with retry no.option set in the Scheduler properties were not re-run. This issue is fixed.
  • ZEP-4193: Autocomplete now works for PySpark notebooks in Zeppelin 0.8.0.
  • ZEP-4194: Notebook results were not displayed when clusters running Zeppelin 0.6.0 were upgraded to Zeppelin 0.8.0 or when clusters running Zeppelin 0.8.0 were downgraded to Zeppelin 0.6.0. This issue is fixed.
  • ZEP-3122: The stacked option for graphs and charts in Zeppelin notebooks did not persist after a refresh. This issue is fixed.
  • ZEP-4198: The Notebooks home page was displayed when the cluster was started. This issue is fixed.
  • ZEP-3129: External web links referenced in a Markdown paragraph now open in a separate tab.
  • ZEP-4195 and ZEP-4199: Notebook content was not rendered correctly when a notebook was switched to a different cluster. This issue is fixed.
  • ZEP-4181: The published at field in the Dashboard information on the Notebooks page displayed an incorrect timestamp. This issue is fixed.
  • ZEP-4004 and ZEP-3562: With a large cardinality in multibar charts, a notebook becomes unresponsive. This fix sets a limit of 50 on cardinality. To increase the limit, contact Qubole Support.
Package Management

Package management is a Beta feature.

Enhancement
  • ZEP-3880: From the Environments page of the QDS UI, you can now:
    • Choose the repo (conda or pypi) from which you want to install packages; this reduces installation time.
    • View the package installation logs while installing Python and R packages.
Bug Fixes
  • ZEP-2048: R packages installed via CRAN (along with conda) now appear in the View all user and pre-installed packages dialog.
Security
Enhancements
ACID Transaction Support

Qubole now provides out-of-the-box support for ACID transactions to meet data engineering and privacy requirements. Qubole supports insert, update, and delete transactions for ORC files in Spark, Presto and Hive. Write support is available for Spark and Hive; read access is available for all these engines. Delete and update transaction support allows you to meet right-to-erasure and right-to-rectify requirements stemming from GDPR and CCPA regulations efficiently and at scale.

Enhanced Audit Framework

Qubole auditing now provides access to detailed logs of the actions of privileged or administrative users performing tasks such as managing accounts, managing users and roles, and configuring security settings.

Known Issues

There are no known issues to date.

Version R59 (GCP)

This section of the Release Notes describes new and changed capabilities of the Qubole Data Service (QDS) on Google Cloud Platform (GCP), as of Release Version R59.

For information about what has changed in this version, see:

What’s New

Important new features and improvements are as follows.

Note

A link in blue text next to a description in these Release Notes indicates the launch state, availability, and default state of the item (for example, Beta). The link provides more information. Unless otherwise stated, features are generally available, available as self-service (without intervention by Qubole support), and enabled by default.

  • You can now enable and disable features at the account level through a self-service platform.
  • Qubole has updated its API throttling policy. Gradual Rollout

Learn more.

  • QDS now supports Apache Airflow version 1.10.9QDS. This Airflow version is supported only with Python 3.7.
  • The Airflow CLI is available as Open Source software.
  • GIT is now integrated with Airflow clusters through the DAG explorer.
  • The Jupyter Notebook Command is now available in QuboleOperator as jupytercmd; users can schedule their Jupyter Notebooks in Airflow (Cluster Restart Required).
  • For Airflow version 1.10.9QDS, QDS exports metrics via statsd. Prometheus scrapes metrics from the statsd and QDS displays them on Grafana.

Learn more.

  • Hive 1.2 is deprecated.
  • Jupyter V2 notebooks provide Qviz with the Spark driver. Qviz allows you to visualize dataframes with improved charting options and Python plots. Gradual Rollout.
  • Jupyter V2 provides autocomplete and Intellisense with docstring help.
  • Jupyter V2 provides notebook workflows with %run.
  • The Package Management UI has been redesigned with improvements. Users of existing accounts should contact Qubole Support to enable the new UI. Learn more.
  • Package Management now supports private Conda channels.

Learn more about the new features in Jupyter Notebooks.

  • Dynamic Partition Pruning: Using Dynamic Filtering values, Dynamic Partition Pruning selects the specific partitions within the table that need to be read at runtime. This improves job performance for queries in which the join condition is on the partitioned column, by significantly reducing the amount of data read and processed. Dynamic Partition Pruning is available in Spark 2.4.3 and later versions. Gradual Rollout.
Engines
Hadoop
Gracefully Terminating Shell CLI Commands

HADTWO-2522: Qubole plans to gracefully terminate shellcli commands if the connection to the coordinator node fails. Gradual Rollout | Cluster Restart Required

After the feature is enabled, Qubole waits for a fixed timeout (120 seconds) to connect with the cluster logs’ location. Qubole gracefully terminates the command only when:

  1. The connection to the cluster logs location fails.
  2. The running application is stopped.
End of Life for Hadoop 2.8
  • HADTWO-2375: Hadoop 2.8 has been removed from the cluster AMI.
Bug Fixes
  • HADTWO-2365: QDS now marks a Hadoop worker node as unhealthy if its root disk is full. This unhealthy node is eventually removed from the cluster.
  • HADTWO-2490: Fixes a problem that caused Spark shell commands to fail with the following error when the cluster was being downscaled: Unable to close file because the last block does not have enough number of replicas.
  • HADTWO-2506: Eliminates a race condition which caused data to be written on decommissioned HDFS Datanodes.
Hive
Hive 1.2 is Deprecated

QHIVE-5285: Hive 1.2 is deprecated.

Enhancements
  • QHIVE-5049: Optimizes the loading time of dynamically created partitions in the Hive Metastore. To disable this optimization, set hive.qubole.optimize.dynpart.listing to false. Gradual Rollout | Cluster Restart Required
  • QHIVE-5339: The Hive Metastore Server now uses Java 8 runtime by default.
  • QHIVE-5242: Hive queries that use Hive 2.1.1 or later versions and run on the coordinator node now have their logs uploaded to <defloc>/logs/query_logs/hive/<cmdId>.log.gz.
  • QHIVE-5268: QDS now supports configuring the replication of Application Timeline Server (ATS) v1.5 HDFS Timeline data. The default replication is 2. You can override this in the QDS UI using config yarn.timeline-service.entity-group-fs-store.replication in the Hadoop Overrides field under the cluster’s Advanced Configuration tab. The related open-source Jira is HIVE-16533.
  • QTEZ-477: QDS now supports using RollingLevelDBTimelineStore for the ATS v1.5 Summary Store. Gradual Rollout
Bug Fixes
  • QHIVE-4967: As open-source Hive has deprecated hive.mapred.mode, use hive.strict.checks.* configuration properties instead. Qubole has removed qubole.compatibility.mode which was added to throw an error when hive.mapred.mode is set to strict.
Presto

The new features and key enhancements are:

Other enhancements and bug fixes are listed in:

Presto 317 is Generally Available

PRES-3429: Presto version 317 is generally available. Cluster Restart Required

BigQuery Connector for Presto

PRES-3153: The BigQuery connector is now available in Presto version 317.

Dynamic Concurrency and Hybrid Autoscaling

PRES-3373: Changes to workload-aware autoscaling include dynamic concurrency and queue-aware autoscaling in conjunction with CPU-based autoscaling. Automated workload management and related changes improve performance, reliability, and TCO. Gradual Rollout | Cluster Restart Required

Enhancements in Presto for JDBC and ODBC Drivers

Enhancements in Presto for next generation (v3) JDBC and ODBC drivers are designed to make these drivers as fast as open source drivers and to:

  • Support cluster lifecycle management (auto start cluster when a query is submitted and auto terminate idle clusters)
  • Provide query history available in Analyze and Workbench UI
  • Provide enhanced security (HTTPs) and user authentication (through API token)
Improvements in Dynamic Filtering

PRES-3288: Dynamic filtering (DF) improvements include the following:

  • PRES-3002: A new configuration property, hive.max-execution-partitions-per-scan, limits the maximum number of partitions that a table scan is allowed to read during query execution. Disabled | Cluster Restart Required
  • PRES-3148: Extends DF optimization to semi-joins to take advantage of a selective build side in queries with the IN clause.
  • PRES-3149: Pushes dynamic filters down to ORC and Parquet readers to reduce data scanned on the probe side for partitioned as well as non-partitioned tables. Cluster Restart Required
  • PRES-3404: Improves utilization of dynamic filters on worker nodes and reduces the load on the coordinator when dynamic filtering is enabled.
Improvements in Reading Hive ACID Tables
  • PRES-2840: Because Hive 2.0-versioned ACID transactional tables are not supported in Presto 317, QDS has added checks to fail queries using such tables.
  • PRES-3320: QDS has added checks to fail Presto queries on Hive ACID tables when the Hive metastore server’s version is older than 3.0.
Changes in Datadog Alerts

Qubole has added these Datadog alerts:

  • PRES-3360:Adds a Datadog alert to detect runaway splits occupying execution slots for more than 10 minutes, removes the presto.jmx.qubole.request_failures metric from the default Datadog dashboard, and removes the Datadog alert for CPU utilization over 80%.
  • PRES-3468: Adds a Datadog alert to detect if the Coordinator Average Heap Memory Usage is more than 90%.
  • PRES-3508: Adds a Datadog alert to detect if the coordinator’s Presto server open file descriptor has exceeded its limit.
Enforcing Group Quotas in Resource Group-based Dynamic Cluster Sizing

PRES-3194: In resource-based dynamic cluster sizing, QDS now enforces individual resource group quotas for CPU resources even when the cluster autoscales to the union of two resource group quotas.

Enhancements
  • PRES-3257: Presto now supports removing unhealthy nodes on the basis of disk usage. The coordinator node periodically monitors disk usage on worker nodes and gracefully shuts down nodes that have exceeded a threshold that defaults to 0.9. You can change the threshold value by means of the ascm.bad-node-removal.disk-usage-max-threshold parameter; the supported range is 0.0 - 1.0. Beta | Cluster Restart Required
  • PRES-3273: Improvements in Presto Ranger integration:
    • Support for column masking HASH for Ranger.
    • Support for the Solr audit store. You can enable auditing in the ranger.<catalog>.audit-config-xml as described in ranger-plugin-config. Disabled
  • PRES-3353: QueryHistID is now returned as part of the error message for queries executed through cloud-agnostic drivers if show_on_ui is set to true for these drivers. QueryHistID is useful in debugging. Qubole plans to provide cloud-agnostic drivers shortly.
  • PRES-3469: Backports open-source fixes to improve the performance of inequality JOINs that involve BETWEEN and GROUP BY queries.
Bug Fixes
  • PRES-1799: Presto now returns the number of files written during an INSERT OVERWRITE DIRECTORY (IOD) query in QueryInfo. The Presto client in the QDS Control Plane waits for this information to display the returned number of files at the IOD location. This fixes eventual consistency issues in reading query results through the QDS UI.
  • PRES-3411: Qubole has fixed the UnsupportedOperationException that occurred in certain multi-join queries with dynamic filtering enabled.
  • PRES-3544: Fixes a problem that caused dynamic filtering not to work on SSL-enabled clusters.
Spark
New Feature
  • SPAR-3979: Dynamic Partition Pruning: Using Dynamic Filtering values, Dynamic Partition Pruning selects the specific partitions within the table that need to be read at runtime. This improves job performance for queries in which the join condition is on the partitioned column, by significantly reducing the amount of data read and processed. Dynamic Partition Pruning is available in Spark 2.4.3 and later versions. Gradual Rollout.
Cluster Management
Account-level Node Bootstrap

ACM-6680: QDS now supports an account-level node bootstrap script, which is executed for all clusters in an account. Currently, it is supported only through a REST API. Cluster Restart Required

Gracefully Terminating Commands after Cluster Health Checks Fail

ACM-6659: QDS now gracefully terminates commands running on the cluster when the cluster is terminated either by the user or by Qubole after the cluster’s health check failures.

Handling Cluster Terminations

ACM-6255: Improves handling of cluster terminations, including terminating unhealthy or idle clusters and terminating a cluster when its start command times out. Gradual Rollout | Cluster Restart Required

Enhancements
  • ACM-5560: Improves error messaging to display the email address of a user who terminates the cluster manually.
  • ACM-5723: RubiX for Spark clusters can now be used in GCP. RubiX configuration is done on the advanced configuration page while creating or updating Spark clusters.
  • ACM-6205: Using a custom prefix for VM names is now supported on GCP. Contact Qubole support to configure this feature.
  • ACM-6234: Persistent network tags are now supported on GCP.
  • ACM-6297: New HiveServer2 clusters can now use the associated Hive cluster’s label.
  • ACM-6319: Improves the API call that lists cluster states.
Bug Fixes
  • ACM-6291: Fixes a problem that caused terminated clusters to be incorrectly marked as up. This was due to a race condition between cluster termination and background healing jobs.
  • ACM-6337: Fixes a problem where clusters with an attached public IP were sometimes terminated because of health check failures.
Applications
Data Engineering
Airflow
New Features
  • AIR-523: The Jupyter Notebook Command is now available in QuboleOperator as jupytercmd; users can schedule their Jupyter Notebooks in Airflow (Cluster Restart Required).

  • AIR-501: This feature provides the ability to pass the argument arguments=['true'] in Qubole Operator’s get_result() method to retrieve headers (Cluster Restart Required).

  • AIR-500: Provides support for Apache Airflow version 1.10.9QDS. This version is supported only with Python version 3.7.

    The previous supported version was 1.10.2QDS, and there have been many new features and improvements since. This new version includes improvements such as the newly added GcsToGDriveOperator, support for optional configurations while triggering DAGs, configurable tags on DAGs, persistent serialised DAGs for webserver scalability, and so on. For more information on the new features and improvements, see the Apache Airflow Changelog.

Explore
New Feature
  • EAM-2348: For Airflow version 1.10.9, QDS exports metrics via statsd and uses a new dashboard on Grafana to display them. Prometheus scrapes metrics from statsd and QDS displays them on Grafana.

    For more information on the metrics provided by Airflow, see Metrics.

Data Science
Notebooks and Dashboards
Jupyter Notebooks (Beta)
New Features
  • JUPY-544: Jupyter notebooks provide Qviz with the Spark driver. Qviz allows you to visualize dataframes with improved charting options and Python plots. Gradual Rollout.
  • JUPY-574 and JUPY-573: You can now drag and drop objects such as tables and columns from the Object Storage Explorer to cells in a Jupyter Notebook.
  • JUPY-302: You can now transfer data from the local kernel to Spark using %%send_to_spark magic.
Enhancements
  • JUPY-571: You can now push Jupyter notebooks without output to the VCS system.
  • JUPY-614, JUPY-613, and JUPY-598: You can now view Hot Tables, Database Views, and Partitioned Columns from the Table Explorer.
  • JUPY-650: You can now search for clusters using the Cluster selection widget when offline notebooks are enabled and more than five clusters are in the list.
  • JUPY-662: When running a Jupyter notebook using the REST API command https://api.qubole.com/api/v1.2/commands, you can now specify whether the original notebook should be updated by using the upload_to_source parameter.
  • JUPY-667: In isolated mode, the Livy session name is now unique, to facilitate multiple executions of a notebook in parallel.
Bug Fixes
  • JUPY-694: Fixes a problem that caused the Spark session to be terminated when there were active tasks.
  • JUPY-603: Fixes a problem that caused Spark application startup to fail because of a conflict with some third party JARs.
Zeppelin Notebooks
Enhancements
  • ZEP-3231: Users can view data values for individual bars on top of single bar charts.
Bug Fixes
  • ZEP-4497: Code highlighting in Zeppelin notebooks was lost intermittently in both offline and online modes. This issue is fixed.
  • ZEP-4588: The Run all option submitted only loaded paragraphs instead of submitting all the paragraphs in a Zeppelin notebook. This issue is fixed.
  • ZEP-4303: Old paragraph results were displayed after the new paragraph run had started. This issue is fixed.
Package Management

Package management is a Beta feature.

New Features

The Package Management UI has been redesigned with the following new features:

Note

Users of existing accounts should contact Qubole Support to enable this feature.

  • ZEP-3982: You can install packages from custom channels and upload egg or wheel packages in the Python Conda environment.
  • ZEP-4289: You can upgrade to a new packages version by selecting the packages from the list of packages on the Environments page of the QDS UI.
  • ZEP-3734: The Environments page now displays all types of packages, System, User, and User Package Dependency, with their respective versions in a table on the home page of each environment. You can toggle the visibility of these three package types using the filter in the column header. By default, only User Installed Packages are displayed.
  • ZEP-4213: You can now filter the list of environments by cluster label. Use the Search Filters option on the Environments page.
Enhancement
  • ZEP-4533: The default Python version when you create clusters or environments is 3.5 for existing QDS accounts and 3.7 for new accounts.
Bug Fix
  • ZEP-4040: The Environment status used to remain Pending when you mapped an incorrect Python or R version to the packages. This fix terminates the Conda process for Python packages after 20 mins of inactivity. Contact Qubole Support if you want to modify the default timeout value.
Security

There are no changes as of the publication of these Release Notes.

Known Issues

This is a known issue in version R59:

  • Qubole expired Hive 0.13 in February 2019. As a part of the expiry process, in R59, Qubole upgraded the lowest available Hive version to Hive 1.2. As a result, any existing cluster using Hive 0.13 gets automatically upgraded to Hive 1.2 in R59.

Change Log

Changelog

Changelog for gcp.qubole.com
Date and time of release Version Change type Change
16th Apr, 2021 (11:59 CST) 59.5 Bug fix JUPY-630 Fixed the error in the branch details of the gitlab server, configuring the bastion with the account public SSH key.
14th Sep, 2020 (5:00 AM PST) 59.0.1045 Enhancement

PRES-3435: The QueryTracker link is now available in the Workbench/Analyze UI’s Logs tab for queries run through the third-generation drivers.

PRES-3722: Optimization is added to push null filters to table scans by inferring them from the JOIN criteria of equi-joins and semi-joins in Presto version 317 and later. You can enable it through optimize-nulls-in-joins as a Presto cluster override or optimize_nulls_in_join as a session property. Use this enhancement to reduce the cost of performing JOIN operations when JOIN columns contain a significant number of NULLs.

PRES-3724: Backported hive.ignore-corrupted-statistics into Presto version 0.208 to avoid query failures in case of corrupted Hive metastore statistics and it is enabled by default. Presto version 317 supported this property, which is now enabled by default.

PRES-3748: Presto query retries for memory exceeded exceptions are triggered in a graded manner. Qubole retries the failed query in three steps. First two steps occur on the cluster size that lies between the minimum and the maximum cluster size. The last step of occurs at the maximum cluster size. To know more, see graded-presto-query-retry. This enhancement is part of Gradual Rollout.

PRES-3761: The Presto Mongo Connector now supports querying Cosmos DB using Mongo APIs.

PRES-3788: You can now add a comma-separated list of endpoints and pass them as cluster override values of qubole.bypass-authentication-endpoints if you want to skip authentication of such endpoints. For example, if qubole.bypass-authentication-endpoints= /,query,node, then only endpoints that matches with these are skipped for authentication. Contact Qubole Support to enable this enhancement at the account level.

Bug fix

PRES-3632: Fixed the File '000000' does not match the standard naming pattern error that Presto threw when trying to read bucketed Hive tables. Qubole Hive INSERT commands had written bucketed Hive tables.

PRES-3787: Fixed the Ranger access control for Presto views in Presto version 0.208.

PRES-3790: Fixed the issue that failed queries when there was no space before or after a single line comment in Presto queries.

PRES-3847: Presto query retries were not triggered in case of spot loss as the spotloss notification API call from the worker to the coordinator failed. Fixed the spotloss notification API call to resolve the issue.

RUB-239: Fixed the issue in RubiX that sometimes caused query failures around the cache data invalidation.

PRES-3660: Fixed the Presto query failure with Error opening Hive split : <FILENAME>: null when Rubix was enabled.

PRES-3701: Fixed the connection leak in RetryingPooledThriftClient of RubiX, which caused the slowness in source stages that slowed down the query.

PRES-3708: Fixed the possible deadlock between Hive loadPartitionByName and getTable when the Hive metastore caching is enabled with refresh TTL (time-to-live).

PRES-3543: Fixed the issue where the aggregation node in case of a UNION query (when union sources are tablescan node and values node) did not use the distributed partitioning and caused an OOM exception. It is fixed by disregarding the SINGLE distribution for the UNION query.

PRES-3588: Fixed issues related to updating the table statistics performance. As a result, the bug fix has improved the performance of updating table statistics. In addition, a new configuration property, hive.table-statistics-enabled with its default value set to true is added that you can use to disable updating table statistics.

PRES-3602: Fixed the issue in reading the TEXT file collection delimiter configured in the Hive versions (earlier to Hive 3.0) in Presto version 317.

PRES-3604: Fixed the Ranger access control for Presto views that had earlier failed.

PRES-3618: The Presto catalog configuration for external data sources that skipped validation in Qubole was not added to the cluster earlier. Fixed this issue and now such configuration is added to the cluster.

PRES-3641: Fixed the failure in planning for spatial JOINs with dynamic filtering enabled.

PRES-3662: Fixed the issue where pushing configuration to a cluster corrupted the Presto configuration and failed the Presto server restart.

PRES-3672: Fixed query failures that occurred as too many partitions’ metadata were requested from the metastore in Presto versions 0.208 and 317.

PRES-3673: Fixed the issue where the Presto cluster start failed when resource-groups.user-scaling-limits-enable was turned on and resource groups were configured by a user.

PRES-3677: Fixed the issue where the default location (DefLoc) was picked as the DB location of non-default schemas in the Presto version 317. The correct behavior is that DefLoc should be the DB location of only the default schema.

27th Aug, 2020 (10:22 AM PST) 59.0.1040 Bug fix JUPY-929: Dependencies that were installed through the Environments page were not accessible for the scheduled and API runs of Python notebooks. This issue is fixed.
10th Aug, 2020 (05:32 AM PST) 59.0.1033 Bug fix

ZEP-4789: The paragraph status was not getting updated after the web socket reconnect. This issue is fixed.

ZEP-4590: Interpreter settings were getting lost because of _COPYING_ file present in defloc. This issue is fixed.

ZEP-4130: Notebook commands were failing when the status was NOT_STARTED_RUNNING_NOTEBOOK. To fix this issue, 20 retries at an interval of 10 seconds upto 3 min is added when the notebook command status fetched is NOT_STARTED_RUNNING_NOTEBOOK.

ZEP-4642: Notebook rendering was delayed due to extra web socket calls made for each paragraph to fetch editor settings. This issue is fixed.

Apr 20 2020 8:10 AM PST 58.0.1088 Bug fix Airflow and other bug fixes.
Mar 24 2020 2:12 AM PST 57.0.1055 Bug fix Various bug fixes.
Feb 20 2020 1:32 AM PST 57.0.1051 Bug Fix

PRES-3249: Fixes UnsupportedOperationException occurring in some complex outer join queries when Dynamic Filtering was enabled.

PRES-3282: Adds support for lambdas in ExpressionEquivalence.

PRES-3051: Fixes “Invalid partition value” exception and intermittent ArrayIndexOutOfBoundsException in queries with Dynamic Filtering enabled.

PRES-3112: Enables dynamic partition pruning on Hive tables at the account level.

PRES-3113: Improves autoscaling through better accounting of queued work.

SQOOP-242: For new accounts, QDS does not provide a Pixie cluster by default to run DB import and export commands.

Enhancement PRES-2990: Improves efficiency of dynamic partition pruning by preventing listing and creation of Hive splits from partitions, which are pruned at runtime.
Jan 27 2020 11:34 PM PST 57.0.1044 Bug fix ZEP-4275: QDS now uses HTTPS for Apache Maven access because Maven no longer supports HTTP.
Jan 09 2020 12:51 PM PST 57.0.1039 Bug fix

ACM-6217: Added support for a static private IP address for the coordinator node

ACM-6206: Qubole now supports N2 machine types in the following GCP regions: asia-northeast1, australia-southeast, europe-west-1, europe-west-2, and us-east-1.

Dec 24 2019 12:04PM IST 57.0.1037 Bug fix ZEP-4194: Notebook results are now displayed when clusters running Zeppelin 0.6.0 are upgraded to Zeppelin 0.8.0 or when clusters running Zeppelin 0.8.0 are downgraded to Zeppelin 0.8.0.
Nov 20 2019 03:54 AM PST 57.0.1029 Bug fix AD-3202: GCP Marketplace was broken with RAILS 4.
Nov 15 2019 08:19 AM PST 57.0.1025 Bug fix ACM-6043: The Worker Node Type field was failing to fetch the correct instance types.
Nov 12 2019 03:16 AM PST 57.0.1020 R57 Release
Nov 7 2019 03:54 AM PST 56.0.1083 Bug fix AN-2348: Workbench now supports the read-only view for spark-notebook commands.
Oct 3 2019 02:37 AM PST 56.0.1073 Bug fix

PRES-2999: The NullPointerException when local memory limits are exceeded and a leak in operator peak memory computations in Presto version 0.208 queries have been resolved now.

PRES-3009: The issue where the Presto coordinator disk was filling up due to presence of RubiX logs in the autoscaling log file has been resolved. rubix.log is excluded from autoscaling logs

Sep 12 2019 11:34 PM PST 56.0.1063 Bug fix

ACM-5409: When a GCP cluster is created on a private subnet, a public IP address is not attached to the cluster.

ACM-4855: Configurable local disks are now supported on GCP.

Aug 28 2019 02:21 AM PST 56.0.1051 Bug fix

AN-1327: To make debugging easier, Qubole now displays the Cluster Instance ID under the Processing tab of the Status pane. This enables you to collect logs of the particular command by cluster instance. Beta.

AN-1324: Cluster live health metrics are now available as part of the Clusters drop-down list in Workbench. Via Support. Beta.

AN-2219: Resource links are now clickable in the Logs pane in Workbench. Clicking the link redirects the user to the corresponding cluster dashboard.

AN-2210: You can now tag commands on the History tab in Workbench. You can later use these tags to filter out commands using the Tags field (in the history filter). Beta.

AN-1814: You can resize the command query composer in Workbench for Hive, Presto, and DB Query commands. Beta.

ACM-5529: Users can now create presto clusters from the UI on GCP.

Aug 16 2019 07:00 PM PST 56.0.1047 Bug fix

AN-2240: The cluster selection drop-down list in Workbench now displays Hadoop2 clusters.

ACM-5555: Qubole supports using shared VPCs for GCP clusters.

Jul 31 2019 6:56 AM PST 56.0.1042 Bug fix

PRES-2915: Fixed the issue in which a Presto cluster with idle cluster timeout configuration did not automatically terminate even when it was idle for a longer time.

JDBC-124: Qubole now supports concurrency of multiple statements in Presto FastPath.

ACM-5234: Hive 2.3 is generally available now.

ACM-5237: Cluster creation now accepts 2.3 Beta as a Hive Version in GCP. You can configure it from the Configuration tab of the Clusters UI while creating a cluster.

AD-2776: The gcp_project_id is now included in Update Account requests.

Jun 21 2019 01:52 AM PST 56.0.1023 Bug fix ACM-5240: Autoscaling logs contain more details when Qubole falls back to On-Demand due to unavailability of preemptible VMs.
Jun 13 2019 07:22 PM PST 56.0.1019 Bug fix ACM-5221: When enable-oslogin was set to true at the project-level, Qubole cluster start would fail. This issue has been fixed by setting enable-oslogin as false at the instance level to override the project-level setting.
Jun 3 2019 01:45 AM PST 56.0.1016 R56 Release
Changelog for gcp-eu.qubole.com
Date and time of release Version Change type Change
16th Apr, 2021 (11:59 CST) 59.5 Bug fix JUPY-630 Fixed the error in the branch details of the gitlab server, configuring the bastion with the account public SSH key.
14th Sep, 2020 (7:54 AM PST) 59.0.1045 Enhancement

PRES-3435: The QueryTracker link is now available in the Workbench/Analyze UI’s Logs tab for queries run through the third-generation drivers.

PRES-3722: Optimization is added to push null filters to table scans by inferring them from the JOIN criteria of equi-joins and semi-joins in Presto version 317 and later. You can enable it through optimize-nulls-in-joins as a Presto cluster override or optimize_nulls_in_join as a session property. Use this enhancement to reduce the cost of performing JOIN operations when JOIN columns contain a significant number of NULLs.

PRES-3724: Backported hive.ignore-corrupted-statistics into Presto version 0.208 to avoid query failures in case of corrupted Hive metastore statistics and it is enabled by default. Presto version 317 supported this property, which is now enabled by default.

PRES-3748: Presto query retries for memory exceeded exceptions are triggered in a graded manner. Qubole retries the failed query in three steps. First two steps occur on the cluster size that lies between the minimum and the maximum cluster size. The last step of occurs at the maximum cluster size. To know more, see graded-presto-query-retry. This enhancement is part of Gradual Rollout.

PRES-3761: The Presto Mongo Connector now supports querying Cosmos DB using Mongo APIs.

PRES-3788: You can now add a comma-separated list of endpoints and pass them as cluster override values of qubole.bypass-authentication-endpoints if you want to skip authentication of such endpoints. For example, if qubole.bypass-authentication-endpoints= /,query,node, then only endpoints that matches with these are skipped for authentication. Contact Qubole Support to enable this enhancement at the account level.

Bug fix

PRES-3632: Fixed the File '000000' does not match the standard naming pattern error that Presto threw when trying to read bucketed Hive tables. Qubole Hive INSERT commands had written bucketed Hive tables.

PRES-3787: Fixed the Ranger access control for Presto views in Presto version 0.208.

PRES-3790: Fixed the issue that failed queries when there was no space before or after a single line comment in Presto queries.

PRES-3847: Presto query retries were not triggered in case of spot loss as the spotloss notification API call from the worker to the coordinator failed. Fixed the spotloss notification API call to resolve the issue.

RUB-239: Fixed the issue in RubiX that sometimes caused query failures around the cache data invalidation.

PRES-3660: Fixed the Presto query failure with Error opening Hive split : <FILENAME>: null when Rubix was enabled.

PRES-3701: Fixed the connection leak in RetryingPooledThriftClient of RubiX, which caused the slowness in source stages that slowed down the query.

PRES-3708: Fixed the possible deadlock between Hive loadPartitionByName and getTable when the Hive metastore caching is enabled with refresh TTL (time-to-live).

PRES-3543: Fixed the issue where the aggregation node in case of a UNION query (when union sources are tablescan node and values node) did not use the distributed partitioning and caused an OOM exception. It is fixed by disregarding the SINGLE distribution for the UNION query.

PRES-3588: Fixed issues related to updating the table statistics performance. As a result, the bug fix has improved the performance of updating table statistics. In addition, a new configuration property, hive.table-statistics-enabled with its default value set to true is added that you can use to disable updating table statistics.

PRES-3602: Fixed the issue in reading the TEXT file collection delimiter configured in the Hive versions (earlier to Hive 3.0) in Presto version 317.

PRES-3604: Fixed the Ranger access control for Presto views that had earlier failed.

PRES-3618: The Presto catalog configuration for external data sources that skipped validation in Qubole was not added to the cluster earlier. Fixed this issue and now such configuration is added to the cluster.

PRES-3641: Fixed the failure in planning for spatial JOINs with dynamic filtering enabled.

PRES-3662: Fixed the issue where pushing configuration to a cluster corrupted the Presto configuration and failed the Presto server restart.

PRES-3672: Fixed query failures that occurred as too many partitions’ metadata were requested from the metastore in Presto versions 0.208 and 317.

PRES-3673: Fixed the issue where the Presto cluster start failed when resource-groups.user-scaling-limits-enable was turned on and resource groups were configured by a user.

PRES-3677: Fixed the issue where the default location (DefLoc) was picked as the DB location of non-default schemas in the Presto version 317. The correct behavior is that DefLoc should be the DB location of only the default schema.

24th Aug, 2020 (04:03 PM PST) 59.0.1040 Bug fix JUPY-929: Dependencies that were installed through the Environments page were not accessible for the scheduled and API runs of Python notebooks. This issue is fixed.
10th Aug, 2020 (08:31 AM PST) 59.0.1033 Bug fix

ZEP-4789: The paragraph status was not getting updated after the web socket reconnect. This issue is fixed.

ZEP-4590: Interpreter settings were getting lost because of _COPYING_ file present in defloc. This issue is fixed.

ZEP-4130: Notebook commands were failing when the status was NOT_STARTED_RUNNING_NOTEBOOK. To fix this issue, 20 retries at an interval of 10 seconds upto 3 min is added when the notebook command status fetched is NOT_STARTED_RUNNING_NOTEBOOK.

ZEP-4642: Notebook rendering was delayed due to extra web socket calls made for each paragraph to fetch editor settings. This issue is fixed.

Jan 09 2019 04:24PM IST 57.0.1039 Bug fix

ACM-6217: Added support for a static private IP address for the coordinator node.

ACM-6206: ACM-6206: Qubole now supports N2 machine types in the following GCP regions: asia-northeast1, australia-southeast, europe-west-1, europe-west-2, and us-east-1.

Dec 24 2019 04:32PM IST 57.0.1037 Bug fix ZEP-4194: Notebook results are now displayed when clusters running Zeppelin 0.6.0 are upgraded to Zeppelin 0.8.0 or when clusters running Zeppelin 0.8.0 are downgraded to Zeppelin 0.8.0.
18th Nov, 2019 (3:10 AM PST) 57.0.1025 Bug fix

ACM-6043: The Worker Node Type field now fetches the correct instance types.

ACM-6030: Cluster Push config no longer fails when the cluster is in a private subnet and the coordinatornode doesn’t have a public ip address.

12th Nov, 2019 (6:15 AM PST) 57.0.1020 R57 Release
Globally Rolled Out Features/Enhancements

Qubole enables certain features/enhancements as part of its gradual rollout program to different pods over a period of time. After Qubole rolls out such features/enhancements, it globally enables them on the Qubole platform.

The following table provides the list of features that are globally rolled out for you to use.

Feature/Enhancement Feature Description Qubole Component Supported Cloud Provider QDS Release Version
Auto-population of instances similar to primary worker node type Enhancement in the heterogeneous cluster configuration UI that suggests instances similar to the chosen worker node type but from different generations instead of suggesting the instance of double weight of the same generation. Cluster Management AWS R58 Quick Fix
Optimized version of the Beeline script It is an optimization that reduces the latency of HiveServer2 queries. Hive AWS R58
Hive MapJoin Counters Computation Support of counters that compute the number of joins in Hive, which are converted to MapJoin after a query completion. The query results are visible in the Analyze/Workbench logs. Hive AWS R58
Use of hive-exec Libraries in Tez from the local disk It is a Tez optimization that allows using the hive-exec jar, which is locally available on cluster nodes. This reduces the localization overhead and increases efficiency by avoiding additional HDFS operations. Hive Tez AWS, Azure, GCP, OPC, and Oracle R58
HiveServer2 cluster with private IP address HiveServer2 clusters use private-IP for the inter-process communication. Hive AWS R58
Cleanup of the Partial Data upon a Hive Query Failure In case of a Hive query failure, Qubole cleans up the partial data that completed mappers/reducers write. Hive AWS R57 Quick Fix
Hive Metastore Server with Java 8 Use Java 8 along with G1GC (garbage collector) for the thrift Hive Metastore Server (HMS) JVM. To use this feature, remove any bootstrap code related to Java 8 for HMS. There is no need to restart HMS JVM for Java 8 to be effective. Hive AWS R57
Spot Node Loss and Spot Blocks using graceful Decommissioning Spark applications handle Spot Node Loss and Spot-blocks using YARN status of Graceful-Decommission. This is supported on Spark versions 2.4.0 and later versions. Spark AWS,GCP R57
Private IP usage Private IP addresses are used for all nodes in Spark. As a result of which the executor logs are accessible. Spark AWS R56 Quick Fix
Direct Writes for Dynamic partition overwrite in Datasource flow Support of direct writes for improving performance for data source tables and when OSS flag spark.sql.sources.partitionOverwriteMode is set to dynamic. It is supported from spark version 2.4 and later versions. Spark AWS R57
Distributed Writes for better performance Users can run SQL commands with large result size using Spark. It is supported from spark version 2.4 and later versions. Spark AWS R57
Improved Container Packing for efficient cluster utilization Spark on Qubole improves container packing; by restarting idle executors and thus allowing YARN to move restarted executors to fewer nodes. Spark AWS R56
Direct Writes for Insert Overwrite with dynamic partitions queries Support of direct writes for improving performance for Insert Overwrite with dynamic partitions queries. It is supported from spark version 2.2 and later versions. Spark AWS R56

Driver Release Notes

ODBC and JDBC Drivers Release Notes

Currently, Qubole is not supporting ODBC drivers on the GCP Cloud Provider.

JDBC Driver Release Notes
JDBC Driver Version 2.3.2

Release Date: 24th February 2020

Qubole has released a new version of JDBC driver. For more details, see:

What’s New

These are the changes in this version:

  • There are no major changes in this version. See List of Changes for more information.
List of Changes
Bug Fixes
  • JDBC-184: JDBC Driver was scanning the entire query (including comments) to count ? and it used to set the number of ? occurrences as parameter count. This bug fix ignores comments while scanning the query. Hence the occurrences of ? that appear in comments are not counted as parameters now.
  • JDBC-189: Earlier, there was a warning that JDBC driver used an older version of the Apache HTTP client. So, Qubole has now upgraded the Apache HTTP client version from 4.5.2 to 4.5.11.
JDBC Driver Version 2.3.1

Release Date: 29th January 2020

Qubole has released a new version of JDBC driver. For more details, see:

What’s New

These are the changes in this version:

  • There are no major changes in this version. See List of Changes for more information.
List of Changes
Bug Fixes
  • JDBC-186: In case of IAM-Role based accounts, IAM credentials expire after a set duration of time. Refreshing IAM credentials failed due to a bug in the JDBC driver. This issue is resolved and IAM credentials are now refreshed correctly.
JDBC Driver Version 2.3.0

Release Date: 16th December 2019

Qubole has released a new version of JDBC driver. For more details, see:

What’s New

These are the changes in this version:

  • This version supports using Qubole JDBC driver to connect to Qubole clusters from Tableau Desktop and Server.
  • Qubole has introduced catalog_name as a new connection string in this version to add the catalog name in the query. For more information on connection strings, see Setting the JDBC Connection String.
List of Changes
Enhancements
  • JDBC-154: Earlier, Qubole JDBC driver was using two API calls (Command API followed by Results API to fetch results location) when useS3 was enabled. Now, this enhancement replaces that with a single API call. This helps in reducing the overall command latency.

  • JDBC-156: You can now use Qubole JDBC driver to connect to Qubole clusters from Tableau Desktop and Server.

    Qubole has also introduced catalog_name as a new connection string in this version to add the catalog name in the query.

JDBC Driver Version 2.2.0

Release Date: 18th October 2019

Qubole has released a new version of JDBC driver. For more details, see:

What’s New

This is the change in this version:

  • This version of the JDBC driver supports Qubole-on-GCP (Google Cloud Platform) in addition to other cloud platforms that are being supported earlier. For more information on the features supported by the JDBC driver, see JDBC Driver.
List of Changes
Enhancements
  • JDBC-157: This version of JDBC driver supports Qubole-on-GCP in addition to other cloud platforms.
Bug Fixes
  • JDBC-163: The issue related to useS3 being always true has been resolved.

Availability

Launch Stage and Availability of Features

All the features are classified by the launch stage, availability, and the default state as:

  • Launch stage: (General Availability or Beta)
  • Availability: (Self-serve or via Support)
  • Default state: (Enabled or Disabled)

Note

Unless stated otherwise, features are generally available, available as self-service and enabled by default.

This table describes the two different launch stages of a feature.

Launch Stage Description
Beta It implies that the features are appropriate for limited production use. There is no SLA adherence for a beta feature.
General Availability (GA) It implies that the features can run production workloads with the SLA adherence.

This table describes the three types of the feature availability.

Availability Description
Self-serve It implies that the feature has a user-configurable property.
Via Support It implies that the feature can be enabled by the Support. The feature may have a configurable property to disable it but only the Support can enable it for the first time.

Gradual Rollout

It implies that the feature is enabled to Qubole users through a controlled roll-out process where Qubole monitors the feature performance using real-time dashboards. If and when an issue is observed, Qubole may suspend, rollback, or fix-and-continue the feature to ensure stable operation of the platform for users. As part of the gradual rollout process, Qubole has designated users and accounts into separate pods, which are targeted for the feature. The pod membership and deployment order is a variable and subject to change at any time by Qubole. The timing of a gradual rollout of the feature being deployed or enabled for a specific user or pod is variable and as such, there is no implied change window for these activities.

This table describes the two types of a feature’s default state.

Default State Description
Enabled It implies that Qubole enables this feature by default.
Disabled It implies that the user must enable the feature using the configurable properties.
Cluster Restart Operation

Certain configurations that you change at the cluster level would require a cluster restart for the new configuration to be effective.

How-To (QDS Guides, Tasks, and FAQs)

User Guide

Introduction

Qubole Data Service (QDS) provides a highly integrated set of tools to help your organization analyze data and build reliable data-driven applications in the Cloud. QDS takes care of orchestrating and managing Cloud resources, allowing you to focus on analyzing and using the data.

Use QDS to explore data from various sources, analyze your data, launch and manage compute clusters, run recurring commands, and use notebooks to save, share, and re-run a set of queries on a data source.

Account Settings

This section contains instructions for configuring Qubole Account Settings and related topics. Choose one of the following options:

Configuring Qubole Account Settings for GCP

See Setting up your Qubole Account on GCP

Listing Allowed IP Addresses

Listing allowed IP addresses lets users of an account login only from specific IP addresses.

You can add the IP addresses you want to allow from the Control Panel.

To allow an IP address, perform the following steps:

  1. On the QDS user interface, navigate to the Control Panel. Click the Whitelist IP tab. If there is no IP address added, a No data is available message appears.

    _images/EmptyWhitelist.png
  2. Click the add icon AddIcon to add an IP address to the allowed list.

    A dialog with the OK and Cancel buttons is displayed. Click OK to add a new IP address. The dialog to add a new IP is displayed. Enter the IP address in the IP CIDR text field as illustrated in the following figure.

    _images/AddNewWhitelistIP.png

    Caution

    Once you add IP addresses to the allowed list, logging in to QDS is possible only from the allowed IP addresses.

    Add a description that can contain a maximum of 255 characters.

    Click Add Entry to save the IP address to the allowed list. After clicking Add Entry, the Whitelist IP tab displays the newly-added IP address as shown in the following figure.

    _images/WhitelistIPAdded.png

    Click the refresh icon RefreshIcon to refresh the list. Repeat step 2 to add another IP address to the list.

In the Action column, click Delete to remove an IP address from the list. Select the checkboxes against multiple allowed IP addresses to delete more than one IP address at a time. After you select an IP address, a Delete Selected button appears as shown in the following figure.

_images/BulkDeleteWhitelistIP.png

Click Delete Selected. A dialog with the OK and Cancel buttons is displayed. Click OK to delete the selected IP addresses. After you click the OK button, the deleted IP addresses are not seen in the Whitelist IP tab as shown in the following figure.

_images/WhitelistIPDeleted.png
Configuring Version Control Systems

Qubole supports GitHub, GitLab, and Bitbucket integration. These integrations help in using a central repository to serve as the single point-of-entry for all changes to a project. It provides the following advantages:

  • Ability to track changes
  • Provides a central repository for code changes
  • Effective way to collaborate

To enable GitHub, GitLab, or Bitbucket integration, create a ticket with Qubole Support.

GitHub Version Control

To configure the version control using GitHub, you must perform the following tasks:

  1. Configuring Version Control Settings
  2. Generating a GitHub Token in the GitHub Profile
  3. Configuring a GitHub Token
Configuring Version Control Settings

You must have Account Update privileges to perform this task.

Configuration

Follow the instructions below to configure Version Control System:

  1. Navigate to Home >> Control Panel >> Account Settings.
  2. On the Account Settings page, scroll down to the Version Control Settings section.
  3. From the Version Control Provider drop-down list, select GitHub.
  4. From the Repository Hosting Type drop-down list, select Service-managed.
  5. For Service-managed, the API Endpoint is auto-populated.
  6. Click Save.

The following figure shows a sample Version Control Settings section.

_images/vcs-github.png

The following figure shows a sample Version Control Settings section with the Self-managed and Bastion node options.

_images/github-self-managed.png
Generating a GitHub Token in the GitHub Profile

Follow the instructions below to get the GitHub token:

  1. To create a GitHub token, see the GitHub Documentation.
  2. After you generate a new token, copy it to configure in the Qubole account.
Configuring a GitHub Token

Follow the instructions below to configure a GitHub Token from the My Accounts page.

  1. Navigate to Control Panel >> My Accounts.
  2. For your account, under GitHub Token column, click Configure.
  3. Add the generated GitHub token and click Save.

The GitHub token is configured at per user and per account setting level.

GitLab Version Control

To configure the version control using GitLab, you must perform the following tasks:

  1. Configuring Version Control Settings
  2. Generating a GitLab Token in the GitLab Profile
  3. Configuring a GitLab Token
Configuring Version Control Settings

You must have Account Update privileges to perform this task.

Configuration

Follow the instructions below to configure Version Control System:

  1. Navigate to Home >> Control Panel >> Account Settings.
  2. On the Account Settings page, scroll down to the Version Control Settings section.
  3. From the Version Control Provider drop-down list, select GitLab.
  4. From the Repository Hosting Type drop-down list, select Service-managed.
  5. For Service-managed, the API Endpoint is auto-populated.
  6. Click Save.

The following figure shows a sample Version Control Settings section.

_images/vcs-gitlab.png

The following figure shows a sample Version Control Settings section with the Self-managed and Bastion node options.

_images/gitlab-self-managed.png
Generating a GitLab Token in the GitLab Profile

As a prerequisite, you must obtain a GitLab token. Perform the following steps:

  1. Create a GitLab token by following the GitLab Documentation.
  2. Copy the generated GitLab token to configure it in the Qubole account.
Configuring a GitLab Token

You can configure a GitLab Token from the My Accounts UI.

  1. Navigate to Control Panel >> My Accounts.
  2. For your account, under GitLab Token column, click Configure.
  3. Add the generated GitLab token and click Save.

The GitHub token is configured at per user and per account setting level.

Bitbucket Version Control

To configure the version control for Airfow using Bitbucket, you must perform the following tasks:

  1. Configuring Version Control Settings
  2. Obtaining Bitbucket Credentials
  3. Configuring Bitbucket
Configuring Version Control Settings

You must have the Account Update privileges to perform this task.

  1. Navigate to Home > Control Panel > Account Settings.
  2. On the Account Settings page, scroll down to the Version Control Settings section.
  3. From the Version Control Provider drop-down list, select Bitbucket.
  4. From the Repository Hosting Type drop-down list, select Service-managed. Currently, only Service-managed is supported.
  5. The API Endpoint is auto-populated for Service-managed.
  6. Click Save.

The following figure shows a sample Version Control Settings section.

_images/vcs-bb.png
Obtaining Bitbucket Credentials

Obtain Bitbucket Username and password Basic Auth over HTTPS or create App-specific password which can be used via Basic Auth as username and app-specific-password.

For more information, see Bitbucket Documentation.

Configuring Bitbucket

You can configure Bitbucket from the My Accounts UI.

  1. Navigate to Control Panel >> My Accounts.

  2. For your account, under Bitbucket Credentials column, click Configure.

  3. Add the Bitbucket credentials and click Save. You can either use your Bitbucket credentials or Bitbucket App password.

    _images/bb-creds.png

Creating Notification Channels

You can create Notification Channels (such as Email, Slack, PagerDuty, and Webhook) to receive notifications on success or failure of various Qubole products (Scheduler, Quest, and so on). To create Notification Channels, follow the instructions below:

  1. Navigate to the Control Panel page and click Notification Channels.

    _images/notification.png
  2. Click +New. The Notification Channels window is displayed.

    _images/channel.png
  3. Select the channel type and specify the required details.

    Note

    • Email: You can select Email to configure email notifications. Enter the name, email ids (comma-separated), and description in the Name, Emails, and Description fields, respectively.
    • Slack: You can configure Slack to receive notifications through Slack messages. Enter the name, webhook URL, and description in the Name, Webhook, and Description fields respectively.
    • PagerDuty: You can configure PagerDuty to receive notifications through PagerDuty. Enter the name, secret key to authorise the PagerDuty’s event API call, and description in the Name, Secret, and Description fields respectively.
    • Webhook: You can select Webhook to configure other services which support webhook to receive notifications. Enter the name, webhook URL, and description in the Name, Webhook, and Description fields respectively.
  4. Click Save. You have successfully configured a Notification Channel.

You can click the settings icon at right to perform the following activities:

_images/settings.png
  • Test Now: You can test your channel with this option. After you click Test Now, you will receive a test email at your email address.
  • Edit: You can click Edit to edit your configuration. After you finish editing, click Save.
  • Manage Permissions: Click Manage Permissions to manage user access. After you click Manage Permissions, the Manage Permissions window is displayed. Select the User or Group from the list and add the permissions. To add multiple permissions, click Add New Permission. You can also click the delete icon to delete any entry. Click Save at the end.
_images/permissions.png
  • Disable: You can click Disable to disable any notification channel. After you click Disable, enter alternate email id (optional) and click Save. After disabling, notification on the success or failure of the configured Qubole products will be sent to the mentioned alternate email id. Along with that, it will also say that the primary notification channel is disabled.

Analyze

This section explains how to use the QDS user interface to run commands. It covers the following topics:

About the Workbench Page

The Workbench page is a single-page application for ad hoc analysis of data in a data lake or other connected data source. It supports Hive, Spark, DB query, shell, and Hadoop commands.

The key features of the new page include:

Debugging
Collections
Command Execution Types

There are 3 ways you can execute commands in QDS:

Composing a Hadoop Job

Use the command composer on the Workbench page to compose a Hadoop job.

You can use the query composer for these types of Hadoop job:

Note

Before running a Hadoop job, make sure that the output directory is new and does not exist.

Hadoop and Presto clusters support Hadoop job queries. See Mapping of Cluster and Command Types for more information.

Compose a Hadoop Custom Jar Query

Perform the following steps to compose a Hadoop jar query:

  1. Navigate to the Workbench page and click + Create New.
  2. Select Hadoop from the drop-down list at the top of the page (near the middle). Custom Jar is selected by default in the Job Type drop-down list, and this is what you want.
  3. In the Path to Jar File field, specify the path of the directory that contains the Hadoop jar file.
  4. In the Arguments text field, specify the main class, generic options, and other JAR arguments.
  5. Click Run to execute the query.

You can see the result under the Results tab, and the logs under the Logs tab. For more information on how to download command results and logs, see Downloading Results and Logs.

For information on the REST API, see Submitting a Hadoop Jar Command.

Compose a Hadoop Streaming Query

Perform the following steps to compose a Hadoop streaming query:

  1. Navigate to the Workbench page and click + Create New.
  2. Select Hadoop from the drop-down list at the top of the page (near the middle).
  3. Select Streaming from the Job Type drop-down list.
  4. In the Arguments field, specify the streaming and generic options.
  5. Click Run to execute the query.

You can see the result under the Results tab, and the logs under the Logs tab. For more information on how to download command results and logs, see Downloading Results and Logs.

Compose a Hadoop DistCp Command

Perform the following steps to compose a Hadoop DistCp command:

  1. Navigate to the Workbench page and click + Create New.
  2. Select Hadoop from the drop-down list at the top of the page (near the middle).
  3. Select clouddistcp from the Job Type drop-down list.
  4. In the Arguments text field, specify the generic and DistCp options.
  5. Click Run to execute the query.

You can see the result under the Results tab, and the logs under the Logs tab. For more information on how to download command results and logs, see Downloading Results and Logs.

Composing a Hive Query

Use the command composer on the Workbench page to compose a Hive query.

See Hive in Qubole for more information.

Note

Hadoop clusters support Hive queries. See Mapping of Cluster and Command Types for more information.

Perform the following steps to compose a Hive query:

  1. Navigate to the Workbench page and click + Create New.

  2. Select the Hive tab (near the top of the page).

  3. Either:

    1. To use a stored query, select Query Path from the drop-down list near the top right of the page, then specify the Cloud storage path that contains the Hive query file. Click Run to execute the query.

    Or:

    1. Enter your Hive query in the text field (Query Statement is selected by default in the drop-down list near the top right of the page; this is what you want in this case). Click Run to execute the query.

You can use the Status tab in the bottom half of the screen to monitor the progress of your job; this tab also displays useful debugging information if the query does not succeed.

You can see the result under the Results tab, and the logs under the Logs tab. For more information on how to download command results and logs and logs, see Downloading Results and Logs.

Note

Log for a particular Hive query is available at <Default location>/cluster_inst_id/<cmd_id>.log.gz.

Viewing a Detailed Hive Log

A detailed log for each Hive query executed using HiveServer2 or Hive-on-coordinator can be uploaded to a subdirectory in the default location in Cloud storage within a couple of minutes of query completion. Detailed logs are not available by default. Create a ticket with Qubole Support to enable this capability.

Once it’s enabled, you can find the location of the logs in the Logs tab of the Workbench page.

Viewing Multi-line Column Data in Query Results

Qubole supports newline (\n) and carriage return (\r) characters in Hive query results by escaping them in the Hive result set and then un-escaping in the UI; this prevents any problems with the display of multi-line columns. To enable this capability, create a ticket with Qubole Support. Note that once it’s enabled, even a simple SELECT query requires a cluster start.

Downloading Results and Logs

To download command results, click the Download button near the right corner of the Results section of the Workbench page of the QDS UI; see Downloading Results from the Results Tab.

See also Analyze.

About the Result File Size Limit

By default, for each QDS account, the result file size limit is 20 MB. If you want to increase this limit, create a ticket with Qubole Support.

QDS supports downloading results in the these formats:

  • Supported formats supported when the result file size is within or equal to 20 MB (or configured file size limit):
    • CSV (comma-separated values)
    • TSV (Tab-separated values)
    • RAW
  • Supported formats when the result file size is more than the default 20 MB (or over the threshold file size limit):
    • Complete Raw Results.
Downloading Results from the Results Tab

For all successful commands, the following options are available when the result file size is within 20MB:

  • Download CSV
  • Download TSV
  • Download RAW

Download Complete Raw Results is only available if the result file size is larger than 20 MB and other downloading options are not displayed.

Note

The Include Header option is unavailable with the Download Complete Raw Results option.

Downloading Logs

The query logs are displayed in the Logs tab. The tab also has an Errors and Warnings filter.

_images/logs.png

Click Logs > Download to download logs.

Downloading the Complete Raw Result

You can download a successful query result from the History tab of the Workbench page by selecting the corresponding query. Below are instructions for queries that have result sets larger than the default file size.

Note

Analyze describes the composing different types of commands that you can run. See also Downloading Results and Logs.

You can download the complete raw result by clicking the Download icon near the top right corner of the Results pane.

The Cloud storage bucket is displayed with the corresponding directory and files.

Select a file or expand the directory to select files, or download all folders and files.

Note

You can only download a single file at a time.

After making your selection, click Download. Click Close to go back to the main tab.

Merging Individual Files from a Complete Raw Result Download

When you get multiple files as part of a complete raw result download, use a command such as the following to merge them to a single file.

  • cat file_1 file_2 ... file_n | tr \\1 \\t > output.txt
Composing a DB Query

You can query an existing data store using the query composer on the Workbench page.

Prerequisite

You must have an existing data store to query it. You can create a data store from the Explore page.

Note

You can run these commands without bringing up a cluster. See Mapping of Cluster and Command Types for more information.

Perform the following steps to query an existing data store:

  1. Navigate to the Workbench page and click + Create New. Select Db Query from drop-down list near the center of the top of the page.

  2. Query Statement is selected by default from the drop-down list near the right side of the top of the page. Enter the DB query in the text field.

    -or-

    To run a stored query, select Query Path from the drop-down list, then specify the Cloud storage path that contains the DB query file.

  3. From the Select a Data Store drop-down list, select the data store to which the query is to be applied.

  4. Click Run to execute the query.

  5. The query result is displayed in the Results tab and the query logs in the Logs tab. For more information on how to download command results and logs, see Downloading Results and Logs.

Composing a Presto Query

See Presto for more information about using Presto on QDS.

Note

Presto queries run on Presto clusters. See Mapping of Cluster and Command Types for more information. Presto supports querying tables backed by the storage handlers.

QDS uses the Presto Ruby client, which provides better overall performance, processing DDL queries much faster and quickly reporting errors that a Presto cluster generates. For more information, see the Presto Ruby Client in QDS blog.

Perform the following steps to compose a Presto query:

Caution

Run only Hive DDL commands on a Presto cluster. Running Hive DML commands on a Presto cluster is not supported.

  1. Navigate to the Workbench (beta) page. Select Presto from the command composer tabs.

  2. Enter the Presto query in the text field.

  3. Click Run to execute the query.

  4. The query result is displayed in the Results tab and the query logs in the Logs tab. The Logs tab has the Errors and Warnings filter. For more information on how to download command results and logs, see Downloading Results and Logs.

    Note

    For a given Presto query, a new Presto Query Tracker is displayed in the Logs tab when:

    • A cluster instance that ran the query is still up.
    • The query info is still present in the Presto server in that cluster instance. The query information is periodically purged from the server.

    If any of the above 2 conditions is not met, the older Presto Query Tracker is displayed in the Logs tab.

Composing Spark Commands in the Analyze Page

Use the command composer on the Workbench page to compose a Spark command.

See Running Spark Applications and Spark in Qubole for more information. For information about using the REST API , see Submit a Spark Command.

Spark queries run on Spark clusters. See Mapping of Cluster and Command Types for more information.

Qubole Spark Parameters
  • The Qubole parameter spark.sql.qubole.parquet.cacheMetadata allows you to turn caching on or off for Parquet table data. Caching is on by default; Qubole caches data to prevent table-data-access query failures in case of any change in the table’s Cloud storage location. If you want to disable caching of Parquet table data, set spark.sql.qubole.parquet.cacheMetadata to false. You can do this at the Spark cluster or job level, or in a Spark Notebook interpreter.
  • In case of DirectFileOutputCommitter (DFOC) with Spark, if a task fails after writing files partially, the subsequent reattempts might fail with FileAlreadyExistsException (because of the partial files that are left behind). Therefore, the job fails. You can set the spark.hadoop.mapreduce.output.textoutputformat.overwrite and spark.qubole.outputformat.overwriteFileInWrite flags to true to prevent such job failures.
Ways to Compose and Run Spark Applications

You can compose a Spark application using:

Note

You can read a Spark job’s logs, even after the cluster on which it was run has terminated, by means of the offline Spark History Server (SHS). For offline Spark clusters, only event log files that are less than 400 MB are processed in the SHS. This prevents high CPU utilization on the webapp node. For more information, see this blog.

Note

You can use the --packages option to add a list of comma-separated Maven coordinates for external packages that are used by a Spark application composed in any supported language. For example, in the Spark Submit Command Line Options text field, enter --packages com.package.module_2.10:1.2.3.

Note

You can use macros in script files for Spark commands with subtypes scala (Scala), py (Python), R (R), sh (Command), and sql (SQL). You can also use macros in large inline content and large script files for scala (Scala), py (Python), R (R), @ and sql (SQL). This capability is not enabled for all users by default; create a ticket with Qubole Support to enable it for your QDS account.

Compose a Spark Application in Scala

Perform the following steps to compose a Spark command:

  1. Navigate to the Workbench page and click + Create New. Select the Spark tab near the top of the page. Scala is selected by default:

    _images/ComposeSparkScala-gcp.png
  2. Either:

    1. To use a stored query, select Query Path from the drop-down list near the top right of the page, then specify the Cloud storage path that contains the query file.

    Or:

    1. Enter your query in the text field (Query Statement is selected by default in the drop-down list near the top right of the page; this is what you want in this case).
  3. Optionally enter command options in the Spark Submit Command Line Options field to override the defaults shown in the Spark Default Submit Command Line Options field.

  4. Optionally specify arguments in the Arguments for User Program field.

  5. Click Run to execute the query.

The query result appears the Results tab, and the query logs under the Logs tab. Note the clickable Application UI link to the Spark application UI under the Logs tab. See also Downloading Results and Logs.

Compose a Spark Application in Python

Perform the following steps to compose a Spark command:

  1. Navigate to the Workbench page and click + Create New. Select the Spark tab near the top of the page.

  2. Select Python from the drop-dowm menu (Scala is selected by default).

    _images/ComposeSparkPython-gcp.png
  1. Either:

    1. To use a stored query, select Query Path from the drop-down list near the top right of the page, then specify the Cloud storage path that contains the query file.

    Or:

    1. Enter your query in the text field (Query Statement is selected by default in the drop-down list near the top right of the page; this is what you want in this case).
  2. Optionally enter command options in the Spark Submit Command Line Options field to override the defaults shown in the Spark Default Submit Command Line Options field. You can use the --py-files argument to specify remote files in a Cloud storage location, in addition to local files.

  3. Optionally specify arguments in the Arguments for User Program field.

  4. Click Run to execute the query.

The query result appears under the Results tab, and the query logs under the Logs tab. Note the clickable Application UI link to the Spark application UI under the Logs tab. See also Downloading Results and Logs.

Compose a Spark Application using the Command Line

Note

Qubole does not recommend using the Shell command option to run a Spark application via Bash shell commands, because in this case automatic changes (such as increases in the Application Coordinator memory based on the driver memory, and the availability of debug options) do not occur. Such automatic changes do occur when you run a Spark application using the Command Line option.

Perform the following steps to compose a Spark command:

  1. Navigate to the Workbench page and click + Create New. Select the Spark tab near the top of the page.

  2. Select Command Line from the drop-dowm menu (Scala is selected by default).

    _images/ComposeSparkCmdLine-gcp.png
  3. Click Run to execute the query.

The query result appears under the Results tab, and the query logs under the Logs tab. Note the clickable Application UI link to the Spark application UI under the Logs tab. See also Downloading Results and Logs.

Compose a Spark Application in SQL

Note

You can run Spark commands in SQL with Hive Metastore 2.1. This capability is not enabled for all users by default; create a ticket with Qubole Support to enable it for your QDS account.

Note

You can run Spark SQL commands with large script files and large inline content. This capability is not enabled for all users by default; create a ticket with Qubole Support to enable it for your QDS account.

  1. Navigate to the Workbench page and click + Create New. Select the Spark tab near the top of the page.

  2. Select SQL from the drop-dowm menu (Scala is selected by default).

    _images/ComposeSparkSQL-gcp.png
  3. Either:

    1. To use a stored query, select Query Path from the drop-down list near the top right of the page, then specify the Cloud storage path that contains the query file.

    Or:

    1. Enter your query in the text field (Query Statement is selected by default in the drop-down list near the top right of the page; this is what you want in this case).
  4. Optionally enter command options in the Spark Submit Command Line Options field to override the defaults shown in the Spark Default Submit Command Line Options field.

  5. Click Run to execute the query.

The query result appears under the Results tab, and the query logs under the Logs tab. Note the clickable Application UI link to the Spark application UI under the Logs tab. See also Downloading Results and Logs.

Compose a Spark Application in R

Perform the following steps to compose a Spark command:

  1. Navigate to the Workbench page and click + Create New. Select the Spark tab near the top of the page.

  2. Select R from the drop-dowm menu (Scala is selected by default).

    _images/ComposeSparkR-gcp.png
  3. Either:

    1. To use a stored query, select Query Path from the drop-down list near the top right of the page, then specify the Cloud storage path that contains the query file.

    Or:

    1. Enter your query in the text field (Query Statement is selected by default in the drop-down list near the top right of the page; this is what you want in this case).
  4. Optionally enter command options in the Spark Submit Command Line Options field to override the defaults shown in the Spark Default Submit Command Line Options field.

  5. Optionally specify arguments in the Arguments for User Program field.

  6. Click Run to execute the query.

The query result appears under the Results tab, and the query logs under the Logs tab. Note the clickable Application UI link to the Spark application UI under the Logs tab. See also Downloading Results and Logs.

Known Issue

In a cluster using preemptible nodes exclusively, the Spark Application UI may display the state of the application incorrectly, showing the application as running even though the coordinator node, or the node running the driver, has been reclaimed by GCP. The status of the QDS command will be shown correctly on the Workbench page. Qubole does not recommend using preemptible nodes only.

Composing a Shell Command

Use the command composer on the Workbench page to compose a shell command.

See Running a Shell Command for more information.

Note

Hadoop 2 and Spark clusters support shell commands. See Mapping of Cluster and Command Types for more information. Some Cloud platforms do not support all cluster types.

Qubole does not recommended running a Spark application as a Bash command under the Shell command options as automatic changes, such as increase in the Application Coordinator memory based on the driver memory and debug options’ availability, do not happen. Such automatic changes occur when you run a Spark application through the Command Line option.

Perform the following steps to compose a shell command:

  1. Navigate to the Workbench page and click + Create New.
  2. Select Shell from the drop-down list near the center of the top of the page.
  3. Shell Script is selected by default from the drop-down list near the right side of the top of the page. You can also specify the location of a script in Cloud storage.
  4. To use a shell command, enter it into the text field. If you are using a script, specify script parameters, if any, in the text field.
  5. In the Optional List of Files text field, optionally list files (separated by a comma) to be copied from Cloud storage to the working directory where the command is run.
  6. in the Optional List of Archives text field, optionally list archive files (separated by a comma) to be uncompressed in the working directory where the command is run.
  7. Click Run to execute the query.

You can see the result under the Results tab, and the logs under the Logs tab. For more information on how to download command results and logs, see Downloading Results and Logs.

Data Discovery
Discover data

The Preview tab of the Workbench page makes it easy to browse and discover data sets by displaying information such as table schemas, sample data, and usage statistics (including most-used columns and most-frequent users).

  • You can view nested data types under the Tables tab.
_images/nestedtypes-no-banner.png
  • The Tables tab orders columns alphabetically and displays partitioned columns on top.
_images/orderedcols-no-banner.png
  • There is a table-preview tab for table schemas, data preview, and usage statistics.
_images/richtablepreviews-no-banner.png
Run and analyze results
  • The Results pane displays results and allows you to search them.
_images/resultspane.png
Collaborate

The Workbench page:

  • Provides sample queries to help you get started.
  • Generates a unique URL when you click any table. You can easily share this URL with other users of the account.
_images/tabperma-no-banner.png

You can also generate a URL for a database and share it with other users of the account.

  • Warns you when displaying commands that belong to a different account:
_images/warning-no-banner.png
  • Allows you to format SQL queries with one click.
  • Allows you to schedule queries.
Command Status

The Status pane provides specific error messages, and actionable next steps, making it faster and easier to debug failed queries. (Currently for Hive and Spark queries only.)

_images/statuspane-no-banner.png

The Status pane:

  • Helps you understand command progress by letting you:
  • Visualize the command flow.
  • View command metadata such as the state of the command and its position in the queue.
  • Helps you recover from errors by:
  • Displaying the exact error message.
  • Indicating faulty configuration values (where applicable) such as displaying the maximum memory for out of memory errors.
  • Filtering big data application and job URLs (resources) for commands from the Logs and Status tabs.

Data Exploration

Using QDS for Data Exploration

Use the QDS Explore page to search data from various sources. By default, you see the Qubole Hive database with the expanded default database.

Note

Clicking the Qubole logo in the QDS UI displays the Qubole home page.

For access to your Cloud storage, and to add an external data store, pull down the drop-down list next to Qubole Hive.

Note

Only a system administrator can add a data store; other users do not see the Add Data Store option.

Use QDS Explore to:

  • View data in Hive tables
  • Connect to any supported database and view data from its tables
  • Connect to data buckets in Cloud storage and view the data

The following sections explain the functions of the Explore feature:

Understanding a Data Store

It is often useful to import or export data to and from data stores other than Cloud storage. For example, you may want to run a periodic Hive query to summarize a dataset and export the results to a MySQL database; or you may want to import data from a relational database into Hive. You can identify an external data store for such purposes. QDS supports the data stores shown on the Database Type drop-down list on the Add Data Store page (see below).

Adding a Data Store

Note

You must be a system administrator to add a data store.

To add a new data store, navigate to the Explore page and proceed as follows:

  1. Pull down the drop-down list near the top left of the page (it defaults to Qubole Hive) and choose Add Data Store.
  2. Enter the Data Store Name field.
  3. Choose a Database Type.
  4. Enter the Catalog Name. (This field is only visible when the data store needs to be accessible through Presto and Spark clusters, or when you select Snowflake from the Database Type drop-down list.)

Note

This field is disabled by default for Database Types other than Snowflake. Contact Qubole Support to enable this feature for your account.

  1. Enter the Database Name.

  2. Enter the host address of the database server in the Host Address text field.

  3. Enter the port number in the Port text field or accept the default.

  4. Enter the username (to be used on the host) in the Username text field.

  5. Enter the password (for the username on the host) in the Password text field.

  6. Check the Skip validation check box if you do not want QDS to validate the connection immediately.

  7. Check the Use Bastion Node check box if the data store to be created will be in a VNet, and provide the following additional information to establish a connection from QDS to datastore:

    • The Bastion node’s IP address.
    • The port on which the Bastion node can be reached.
    • The username which QDS should use to log in to the Bastion node.
    • The private key for access to the Bastion node.

Note

Cluster nodes must have direct access to the datastore.

  1. Click Save to add the data store.

    Unless you checked Skip validation, QDS attempts to connect to the database. If QDS can connect to the database, the data store is marked activated, and you should see it on list of data stores in the drop-down list on at the top left of the Explore page. A green dot shows that a data store has been activated; a red dot means that it has not.

Editing a Data Store

You can edit a data store name and give it a name if it does not already have one. To edit a data store:

  1. Navigate to the Explore page and pull down the drop-down list that defaults to Qubole Hive. Select the data store that you want to edit.
  2. Click the gear icon near the top right and choose Edit.
  3. Make edits as needed and click Update to save the new values. (Click Reset to Default to revert to default values, or Cancel to retain the current values).
SSL support for QDS access to Postgres Database

You can use SSL based access to the Postgres database if you configure the database to use SSL. While using SSL, QDS expects you to open database ports to everyone as SSL takes care of the QDS security. If the Postgres database is in us-east-1, you can continue with opening database ports to only QDS.

Adding a Data Store

You need a data store if you want to import or export data from or to an external relational database management system (RDBMS). You must be a system administrator to add a data store.

See Data Store for supported data store types, and instructions for adding a data store.

Adding a Secure Connection to an External Data Store

Qubole allows you to support user-specific, secure connections to external data stores such as Redshift and Snowflake. You can share your connections as a template with other users and groups. This feature is disabled by default and can be enabled via Qubole Support .

Add a New Secure Connection

To add a secure connection to an external data store, navigate to the Explore page and follow the instructions:

  1. Pull down the drop-down list that defaults to Qubole Hive and choose Data Connections. Your existing data connections appear under My Connections and connections shared by other users appear under Shared Connections.

    _images/dataconnections.png
  2. Click +New tab and the following screen appears.

    _images/new.png
  3. Select New Data Connection or New Connection Template.

    • New Data Connection: Use this option to create a data connection and keep it private (default) or share it with other users who can directly use it without providing their own username/password.
    • New Connection Template: Use this option to create a data connection and share it as a template with other users and groups within the same Qubole account. Other users will then see your connection but without the username/password. They can then use their own username/password to create a connection for themselves from this template.
Change Permissions for an Existing Connection

To change the permissions of an existing connection, follow the instructions:

  1. Click the gear icon against the respective connection.

    _images/options.png
  2. Select Manage Preferences. The Manage permissions window appears.

    _images/editoptions.png
  3. Select the user/group and the respective check boxes to provide necessary permissions(All, Create, Read, Update, and Delete).

    Note

    You can click +Add New Permission to provide permissions to multiple groups. You can also delete an entry by clicking the delete icon against the respective entry.

  4. Click Save. You have successfully provided or managed permissions.

Exploring Data in the Cloud

You can upload files to folders in Cloud storage. You configure the storage location from the Account Settings page under Control Panel in the QDS UI.

Select a file to see sample data. A tab with the Rows and Properties subtabs appears. By default, Properties of the file are displayed.

The following figure displays an example of the Properties subtab.

_images/S3props.png
Uploading a File
Prerequisites
Uploading a File Using the QDS Explore Page

Proceed as follows:

  1. In the QDS UI, navigate to the Explore page and choose My GCS from the drop-down list.
  2. Browse to the default storage location for your QDS account.
    • To upload a file:
      • Hover the cursor over the folder, click the gear icon that appears next it, and choose Upload. The file upload dialog appears.
      • Navigate to the location of the file and select the file. Click Upload.
      • You’ll see a confirmation if the upload is successful or an error message if it fails.
      • Check for the file in the location you chose in step 3.

For more information, see Analyze.

Engines

Using Query Engines
Hadoop

Qubole runs applications written in MapReduce, Cascading, Pig, Hive, Scalding, and Spark using Apache Hadoop. Qubole offers two flavors of Hadoop, based on Apache releases commonly referred to as Hadoop 2.

These implementations of Hadoop are compatible with open source APIs and are largely covered by the Apache documentation. Qubole has added optimizations, as well as important capabilities such as autoscaling.

The sections that follow cover Qubole optimizations, and aspects of Hadoop 2 (Hadoop 2.6.x) that are especially important in Qubole clusters.

MapReduce V2 in Qubole

The section that follow covers MapReduce V2 and aspects of it in Qubole Hadoop 2 (Hadoop 2.6.x), which are especially important in Qubole clusters.

MapReduce Configuration in Hadoop 2

Qubole’s Hadoop 2 offering is based on Apache Hadoop 2.6.0. Qubole has some optimizations in the cloud object storage access and has enhanced it with its autoscaling code. Qubole jars have been uploaded in a maven repository and can be accessed seamlessly for developing mapreduce/yarn applications as highlighted by this POM file.

In Hadoop 2, Resource Manager and ApplicationMaster handle tasks and assign them to nodes in the cluster. Map and Reduce slots are replaced by containers.

In Hadoop 2, slots have been replaced by containers, which is an abstracted part of the worker resources. A container can be of any size within the limit of the Node Manager (worker node). The map and reduce tasks are Java Virtual Machines (JVMs) launched within these containers.

This change means that specifying the container sizes become important. For example, a memory-heavy map task, would require a larger container than a lighter map task. Moreover, the container sizes are different for different instance types (for example, an instance with larger memory has larger container size). While Qubole specifies good default parameters for the container sizes per instances, there are certain cases when you would like to change the defaults.

The default Hadoop 2 settings for a cluster is shown in the Edit Cluster page of a Hadoop 2 (Hive) cluster. (Navigate to Clusters and click the edit button that is against a Hadoop 2 (Hive) cluster. See Managing Clusters for more information.) A sample is as shown in the following figure.

_images/Hadoop2Cluster_Settings.png

In MapReduce, changing a task’s memory requirement requires changing the following parameters:

  • The size of the container in which the map/reduce task is launched.
  • Specifying the maximum memory (-Xmx) to the JVM of the map/reduce task.

The two parameters (mentioned above) are changed for MapReduce Tasks/Application Master as shown below:

  • Map Tasks:
mapreduce.map.memory.mb=2240   # Container size
mapreduce.map.java.opts=-Xmx2016m  # JVM arguments for a Map task
  • Reduce Tasks:
mapreduce.reduce.memory.mb=2240  # Container size
mapreduce.reduce.java.opts=-Xmx2016m  # JVM arguments for a Reduce task
  • MapReduce Application Master:
yarn.app.mapreduce.am.resource.mb=2240  # Container size
yarn.app.mapreduce.am.command-opts=-Xmx2016m  # JVM arguments for an Application Master
Locating Logs
  • The YARN logs (Application Master and container logs) are stored at: <scheme><defloc>/logs/hadoop/<cluster_id>/<cluster_inst_id>/app-logs.
  • The daemon logs for each host are stored at: <scheme><defloc>/logs/hadoop/<cluster_id>/<cluster_inst_id>/<host>.
  • The MapReduce Job History files are stored at: <scheme><defloc>/logs/hadoop/<cluster_id>/<cluster_inst_id>/mr-history.

Where:

  • scheme is the cloud-specific URI scheme: gs:// for GCP.
  • defloc is the default storage location for the QDS account.
  • cluster_id is the cluster ID as shown on the Clusters page of the QDS UI.
  • cluster_inst_id is the cluster instance ID. It is the latest folder under <scheme><defloc>/logs/hadoop/<cluster_id>/ for a running cluster or the last-terminated cluster.

To extract a container’s log files, create a YARN command line similar to the following:

 yarn logs \
-applicationId application_<application ID> \
-logsDir <scheme><qubolelogs-location>/logs/hadoop/<clusterid>/<cluster-instanceID>/app-logs \
-appOwner <application owner>
YARN in Qubole

The sections that follow cover Qubole optimizations, and aspects of Apache Hadoop YARN (Hadoop 2.6.x), which are especially important in Qubole clusters.

Significant Parameters in YARN

Qubole offers Spark-on-YARN variant and so, the YARN parameters are applicable to both Hadoop 2 and Spark.

The parameters that can be useful in Hadoop 2 (Hive) and Spark configuration are described in the following sub-topics.

All platforms:

See also:

Note

See Composing a Hadoop Job for information on composing a Hadoop job.

Configuring Job History Compression

mapreduce.jobhistory.completed.codec specifies the codec to use to compress the job history files while storing them in a Cloud location. The default value is com.hadoop.compression.lzo.LzopCodec.

Configuring Job Runtime

Use yarn.resourcemanager.app.timeout.minutes to configure how many minutes a YARN application can run. This parameter can prevent a runaway application from keeping the cluster alive unnecessarily.

This is a cluster-level setting; set it in the Override Hadoop Configuration Variables field under the Advanced Configuration tab of the Clusters page in the QDS UI. See Advanced Configuration: Modifying Hadoop Cluster Settings for more information.

The Resource Manager kills a YARN application if it runs longer than the configured timeout.

Setting this parameter to -1 means that the application never times out.

Enabling Container Packing in Hadoop 2 and Spark

Qubole allows you to pack containers in Hadoop 2 (Hive) and Spark. You must enable this feature; it is disabled by default. When enabled, container packing causes the scheduler to pack containers on a subset of nodes instead of distributing them across all the nodes of the cluster. This increases the probability of some nodes remaining unused; these nodes become eligible for downscaling, reducing your cost.

How Container Packing Works

Packing works by separating nodes into three sets:

  • Nodes with no containers (the Low set)
  • Nodes with memory utilization greater than the threshold (the High set)
  • All other nodes (the Medium set)

When container packing is enabled, YARN schedules each container request in this order: nodes in the Medium set first, nodes in the Low set next, nodes the High set last.

Configuring Container Packing

Configure container packing as an Hadoop cluster override in the Override Hadoop Configuration Variables field on the Edit Cluster page. See Managing Clusters for more information. The configuration options are:

  • To enable container packing, set yarn.scheduler.fair.continuous-scheduling-packed=true.

  • In clusters smaller than the configured minimum size, containers are distributed across all. This minimum number of nodes is governed by the following parameter:

    yarn.scheduler.fair.continuous-scheduling-packed.min.nodes=<value>. Its default value is 5.

  • A node’s memory-utilization threshold percentage, above which Qubole schedules containers on another node, is governed by the following parameter:

    yarn.scheduler.fair.continuous-scheduling-packed.high.memory.threshold=<value>. Its default value is 60.

    This parameter also denotes the threshold above which a node moves to the High set from the Medium set.

Understanding YARN Virtual Cores

As of Hadoop 2.4, YARN introduced the concept of vcores (virtual cores). A vcore is a share of host CPU that the YARN Node Manager allocates to available resources.

yarn.scheduler.maximum-allocation-vcores is the maximum allocation for each container request at the Resource Manager, in terms of virtual CPU cores. Requests higher than this would not get effective and get capped to this value.

The default value for yarn.scheduler.maximum-allocation-vcores in Qubole is set to twice the number of CPUs. This over subscription assumes that CPUs are not always running a thread, and hence, assigning more cores enables maximum CPU utilization.

Configuring Direct File Output Committer

In general, the final output of a MapReduce job is written to a location in Cloud storage or HDFS, but is first written into a temporary location. The output data is moved from the temporary location to the final location in the task’s commit phase.

When DirectFileOutputCommitter (DFOC) is enabled, the output data is written directly to the final location. In this case, a commit phase is not required. DFOC is a Qubole-specific parameter that is also supported by other big-data vendors. Qubole supports DFOC on Amazon S3n and S3a, and Azure Blob and Data Lake storage.

Note

For DFOC on a Spark cluster, see DFOC in Spark.

The pros and cons of setting DFOC are:

Pros:

  • Improves performance when data is written to a Cloud location. (DFOC does not have much impact on performance when data is written into a HDFS location, because in HDFS the movement of files from one directory to another directory is very fast.)

Cons:

  • DFOC does not perform well in case of failure: in these cases, stale data may be left in the final location and workflows are generally designed to delete the final location. Hence Qubole does not enable DFOC by default. If DFOC is disabled, the abort phase of the task deletes the data in the temporary directory and a retry takes care of data deletion; no stale data is left in the final location.
Enabling DFOC

DFOC can be set in the MapReduce APIs mapred and mapreduce as follows:

  • DFOC in Mapred API:

    mapred.output.committer.class=org.apache.hadoop.mapred.DirectFileOutputCommitter

  • DFOC in Mapreduce API: mapreduce.use.directfileoutputcommitter=true

To set these parameters for a cluster, navigate to the Clusters section of the QDS UI, choose the cluster, and enter both strings in the Override Hadoop Configuration Variables field under the Advanced Configuration tab. You can also set them at the job level.

Improving Performance of Data Writes

To improve the speed of data writes, set the following Qubole options to true:

  • mapreduce.use.parallelmergepaths for Hadoop 2 jobs
  • spark.hadoop.mapreduce.use.parallelmergepaths for Spark jobs with Parquet data.
Understanding Workload-based Scaling Limits in YARN-based Clusters

Earlier, Qubole’s YARN-level autoscaling required an admin to monitor to resize the cluster as and when the number of users using that specific cluster increased. In addition to this, there is a need to set the maximum limit on the resources an individual user or a cluster workload can use at a given point in time.

Thus to monitor the scaling limit and the cluster size, Qubole provides Workload Scaling Limits, a new feature that manages user/workload resource limits while autoscaling. This helps an admin to just set a default maximum resources limit for a user or an application. As a result, cluster does not scale up beyond that limit. This removes the need for an admin to manually monitor the cluster size when there is an increase in number of users using that cluster.

To enable this feature, set mapred.hustler.upscale.cap.fairscheduler.max_limits to true in the Qubole cluster’s Hadoop overrides. For information on adding an Hadoop override through the UI, see Managing Clusters.

For information on adding an Hadoop override through a REST API call, see hadoop-settings.

You can directly use fair-scheduler.xml to configure the workload scaling limits. This feature works in conjunction with FairScheduler (FS) configurations described in this table.

Parameter Description
yarn.scheduler.fair.allocation.file It is the path to allocation file. An allocation file is an XML manifest describing queues and queue properties in addition to certain policy defaults. This file must be in the XML format. If a relative path is given, the file is searched for on the classpath (which typically includes the Hadoop’s conf directory). It defaults to fair-scheduler.xml.
yarn.scheduler.fair.user-as-default-queue It denotes whether to use the username associated with the allocation as the default queue name when a queue name is not specified. If it is set to false or unset, all jobs have a shared default queue named default. It defaults to true. If a queue placement policy is given in the allocations file, this property is ignored.
yarn.scheduler.fair.allow-undeclared-pools This parameter defaults to true. When it is set to true, new queues are created at an application’s submission time because queues are specified as the application’s queue by the submitter or because queues are placed there by the user-as-default-queue property. If this parameter is set to false, any time an application would be placed in a queue that is not specified in the allocations file and placed in the default queue instead. If a queue placement policy is given in the allocations file, this property is ignored.
Understanding the Resource Allocation in a FairScheduler

It is recommended to set the root queue’s maxResources value to a large value. Otherwise, the default maximum limit (queueMaxResourcesDefault) is considered as the root queue’s maxResources, which limits the cluster’s upscaling beyond that maximum value. It is specifically applicable to the case where a certain user’s applications or jobs are submitted to their own queues. If you do not set the root queue’s maxResources, the cluster’s upscaling does not occur as desired, which ultimately deprives the cluster resources for such applications or jobs.

Let us consider a sample FairScheduler configuration as given here.

<allocations>
  <queueMaxResourcesDefault>12000 mb, 2 vcores</queueMaxResourcesDefault>
     <clusterMaxAMShare>0.67</clusterMaxAMShare>
          <!-- Set root queue maxResource definition to a large value if jobs of different users are going to have
          their own queue, otherwise queueMaxResourcesDefault would be considered as the root queue's maxResources.-->
      <queue name="root">
         <maxResources>1000000 mb, 100000 vcores</maxResources>
      </queue>
      <queue name="etl">
      <queue name="prod">
        <maxResources>45000 mb, 5 vcores</maxResources>
      </queue>
     <queue name="dev">
       <maxResources>16000 mb, 3 vcores</maxResources>
      </queue>
  </queue>
</allocations>

In the above FS configuration, default maximum resources limit set for a queue is 12000 mb 2 vcores. A new user or an application that goes into its own queue cannot consume resources more than the maximum resources limit. Therefore, autoscaling would not occur beyond the application/user’s maximum resources limit.

Admins can have custom queues set for different workloads or users having different maximum limits configured by modifying the fair-scheduler.xml. Thus, the admin can set resource limits for individual workloads as well.

Unsupported Features in Hadoop 3

Hadoop 3 supports most of the features that are supported with Hadoop 2 with a few exceptions that are listed below:

  • Native S3 filesystem (s3n)
  • AWS V2 Signature. In Hadoop 3, AWS V4 Signature is used by default for S3 calls.
Hive

This section explains how to use Hive on a Qubole cluster. It covers the following topics:

Introduction

Hive is an Apache open-source project built on top of Hadoop for querying, summarizing, and analyzing large data sets using a SQL-like interface. It is noted for bringing the familiarity of relational technology to Big Data processing with its Hive Query Language as well as structures and operations comparable to those used with relational databases such as tables, JOINs and partitions.

Apache Hive accepts Hive Query Language (similar to SQL) and converts to Apache Tez jobs. Apache Tez is an application framework that can run complex pipelines of operators to process data. It replaces the MapReduce engine.

Hive’s architecture is optimized for batch processing of large ETL jobs and batch SQL queries on very large data sets. Hive features include:

  • Metastore: The Hive metastore stores the metadata for Hive tables and partitions in a relational database. The metastore provides client access to the information it contains through the metastore service API.
  • Hive Client and HiveServer2: Users submit HQL statements to the Hive Client or HiveServer2 (HS2). These function as a controller and manage the query lifecycle. After a query completes, the results are returned to the user. HS2 is a long running daemon that implements many features to improve speed of planning and optimizing HQL queries. HS2 also supports sessions, which provide features such as temporary tables, a useful feature for ETL Jobs.
  • Checkpointing of intermediate results: Apache Hive and Apache Tez checkpoint intermediate results of some stages. Intermediate results are stored in HDFS. Checkpointing allows fast recovery when tasks fail. Hive can restart tasks from the previous check point.
  • Speculative Execution: It helps to improve speed of queries by redoing work that is lagging due to hardware or networking issues.
  • Fair Scheduler: Fair scheduling is a method of assigning resources to jobs such that all jobs get, on average, an equal share of resources over time. When there is a single job running, that job uses the entire cluster. When other jobs are submitted, tasks slots that free up are assigned to the new jobs, so that each job gets roughly the same amount of CPU time. Unlike the default Hadoop Scheduler, which forms a queue of jobs, this lets short jobs finish in reasonable time while not starving long jobs. It is also an easy way to share a cluster between multiple users. Fair sharing can also work with job priorities. The priorities are used as weights to determine the fraction of total compute time that each job gets.
About Qubole Hive

Qubole’s Hive distribution is derived from the Apache Hive versions 0.13, 1.2.0, 2.1.1, 2.3, and 3.1. However, there are a few differences in the functionality. Qubole Hive is a self-managing and self-optimizing implementation of Apache Hive.

Qubole Hive:

  • Runs on your choice of popular public cloud providers
  • Leverages the QDS platform’s AIR (Alerts, Insights, Recommendations) capabilities to help data teams focus on outcome, instead of the platform. For more information on AIR, see auto-completion and Getting Data Model Insights and Recommendations.
  • Has agent technology that augments original Hive with a self-managing and self-optimizing platform
  • Is cloud-optimized for faster workload performance
  • Is easier to integrate with existing data sources and tools
  • Provides best-in-class security

Understanding Hive Versions describes the different versions of Hive supported on QDS.

Understanding the Hive Data Model

Data in QDS Hive is organized as tables and table partitions. These tables and partitions can either be created from data that you already have in Cloud storage, or can be generated as an output of running Hive queries. QDS uses HiveQL to query this data. For a primer on Hive, see the Apache Hive wiki.

The following topics are covered in this section:

Types of Hive Tables

Tables in QDS Hive are backed by data residing either in Cloud storage or in HDFS (Hadoop Distributed File System). The table types available in QDS Hive are:

  • External Tables: These tables are assigned an explicit location by the user. When an external table is dropped, Hive does not delete the data in the location that it points to.
  • Regular Tables: These tables do not have an explicit location attribute and are assigned by one by Hive directly. Hive assigns a location relative to a default location that is fixed for an account. When a regular table is dropped, the data in the table is deleted.
  • Temporary Tables: QDS Hive allows a third form of tables that is deleted automatically once the user’s session is deleted.
  • Mongo backed Tables : You can create a Hive table whose underlying data resides in a Mongo database collection. When the table is queried, Qubole dynamically fetches the data from the Mongo database. See Mongo Backed Tables for more information.
External Tables

Given pre-existing data in a bucket in Cloud storage, we can create an external table over the data to begin analyzing it, as in the example(s) that follow.

Example:

CREATE EXTERNAL TABLE
miniwikistats (projcode string, pagename string, pageviews int, bytes int)
PARTITIONED BY(dt string)
ROW FORMAT DELIMITED FIELDS TERMINATED BY ' '
LINES TERMINATED BY '\n'
LOCATION 'gs://paid-qubole/default-datasets/miniwikistats/';

This table is used as a reference in subsequent examples. It assumes the following:

  1. The data is entirely in text and contains four fields that are separated using the space character; the rows are separated by the newline character.
  2. The data lives under <scheme>/miniwikistats/, where <scheme> is the Cloud-specific URI and path.
  3. The dataset is partitioned (by hour for example) so that each hour’s statistics are in a separate subfolder. For example, the statistics for first hour of 20110101 are in  <scheme>/miniwikistats/20110101-01/.

Note

  • The CREATE statement above creates a partitioned table, but it does not populate any partitions in it, so the table is empty (even though this Cloud location has data).
Regular Tables

You might create derived data sets while analyzing the original data sets and might want to keep them for a period. In this case, you can create a regular table in Cloud storage, for instance:

CREATE TABLE q1_miniwikistats
AS
SELECT projcode, sum(pageviews)
FROM miniwikistats
WHERE dt >= '20110101' AND dt <= '20110105'
GROUP BY projcode;
Temporary Tables

You may want to force a table to reside in HDFS. Such tables provide faster throughput, but bear in mind that they are automatically deleted at the end of the session.

You can use either TMP or TEMPORARY when creating temporary tables in QDS. See How can I create a table in HDFS? for a discussion of the differences.

Example:

CREATE TMP TABLE tmp_stats AS
SELECT * FROM miniwikistats
WHERE projcode LIKE 'ab%' AND dt > '20110101';

You can look up the location of all tables created in this manner by using DESCRIBE:

DESCRIBE FORMATTED tmp_stats;
Table Storage
  • You do not need to configure storage for temporary tables (always in HDFS) or external tables (which you can see by explicitly listing the entire location).

  • Regular tables are handled as follows:

    As part of setting up an account, you can set a default location in your Cloud storage, with the credentials needed to access this location (read/write/list). This location is used to store logs and results, and by default, QDS creates regular tables in a subdirectory of this same location, so you don’t need to supply the credentials again to get access to the tables. If you choose to create external tables in a location that is not accessible via the account’s storage credentials, you can specify the credentials as part of the location URL.

Default Tables

To help you get started, QDS creates some read-only tables for each account. These tables are backed by a public requester-pays bucket on cloud object storage. The tables are as follows:

  • default_qubole_demo_airline_origin_destination
  • default_qubole_memetracker: 96 million memes collected between 2008 and 2009
Hive Connectors
Hive-JDBC Connector

QDS provides a Hive connector for JDBC, so you can run SQL queries to analyze data that resides in JDBC tables.

Optimizations such as Support for PredicatePushDown are also available. You can find sample queries and a POM file in Hive JDBC Storage Handler

Note

Qubole has deprecated its JDBC Storage handler. Use the open-source JDBC Storage handler. THE ADD JAR statement is mandatory only for Hive versions 1.2. and 2.1.1 as Hive versions 2.3 and 3.1.1 (beta) contain the required jars.

Adding Required Jars
add jar gs://paid-qubole/jars/jdbchandler/mysql-connector-java-5.1.34-bin.jar;
add jar gs://paid-qubole/jars/jdbchandler/qubole-hive-JDBC-0.0.7.jar;
Creating a Table

An external Hive table connecting a JDBC table can be created as follows, allowing read and write to an underlying JDBC table.

Example

The table can be created in two ways:

  • You can explicitly give column mappings along with the table creation statement.

    DROP TABLE HiveTable;
    CREATE EXTERNAL TABLE HiveTable(
      id INT, id_double DOUBLE, names STRING, test INT
    )
    STORED BY 'org.apache.hadoop.hive.jdbc.storageHandler.JdbcStorageHandler'
    TBLPROPERTIES (
      "mapred.jdbc.driver.class"="com.mysql.jdbc.Driver",
      "mapred.jdbc.url"="jdbc:mysql://localhost:3306/rstore",
      "mapred.jdbc.username"="-----",
      "mapred.jdbc.input.table.name"="JDBCTable",
      "mapred.jdbc.output.table.name"="JDBCTable",
      "mapred.jdbc.password"="------"
    );
    
  • You can specify no table mappings; the SerDe class automatically generates the mappings.

    CREATE EXTERNAL TABLE HiveTable
    row format serde 'org.apache.hadoop.hive.jdbc.storagehandler.JdbcSerDe'
    STORED BY 'org.apache.hadoop.hive.jdbc.storagehandler.JdbcStorageHandler'
    TBLPROPERTIES (
      "mapred.jdbc.driver.class"="com.mysql.jdbc.Driver",
      "mapred.jdbc.url"="jdbc:mysql://localhost:3306/rstore",
      "mapred.jdbc.username"="root",
      "mapred.jdbc.input.table.name"="JDBCTable",
      "mapred.jdbc.output.table.name" = "JDBCTable",
      "mapred.jdbc.password"=""
    );
    
Usage

The Hive-JDBC connector supports almost all types of SQL queries. Some examples of supported queries are:

Reading Data

> select * from HiveTable;
> select count(*) from HiveTable;
> select id from HiveTable where id > 50000;
> select names from HiveTable;
> select * from HiveTable where names like ‘D%’;
> select * FROM HiveTable ORDER BY id DESC;

Joining Tables

> select HiveTable_1.*, HiveTable_2.* from HiveTable_1 a join HiveTable_2 b
   on (a.id = b.id) where a.id > 90000 and b.id > 97000;

Writing Data

Note

Writing data holds good to Qubole-on-Azure/OCI/GCP until Qubole Hive JDBC Storage Handler is deprecated.

> Insert Into Table HiveTable_1 select * from HiveTable_2;
> Insert Into Table HiveTable_1 select * from HiveTable_2 where id > 50;

Group By Queries

> select id, sum(id_double) as sum_double from HiveTable group by id;
Support for PredicatePushDown

To enable/disable PredicatePushDown, add the following configuration.

set hive.optimize.ppd = true/false
Handling Unsuccessful Tez Queries While Querying JDBC Tables

Note

This holds good to Qubole-on-Azure/OCI/GCP until Qubole Hive JDBC Storage Handler is deprecated.

In the Hive JDBC connector, the JDBC Storage handler does not work when Input Splits Grouping is enabled in Hive-on-Tez.

As a result, the following exception message is displayed.

java.io.IOException: InputFormatWrapper can not support RecordReaders that don't return same key & value objects.

HiveInputFormat is enabled by default in Tez to support Splits Grouping.

You can avoid the issue by setting the input format as CombineHiveInputFormat using this command that disables the Splits Grouping.

set hive.tez.input.format=org.apache.hadoop.hive.ql.io.CombineHiveInputFormat;
Mongo Backed Tables

QDS allows you to create Hive tables that point to a MongoDB collection. When such a table is queried, QDS launches a MapReduce job that fetches the data from the collection, and Hive does further processing. Qubole’s implementation of the connector is based on the code at https://github.com/mongodb/mongo-hadoop.

The following is an example of an SQL statement which points a Hive table to the a MongoDB collection.

CREATE EXTERNAL TABLE mongotest (city string, pop int, state string)
STORED BY "com.mongodb.hadoop.hive.MongoHiveStorageHandler"
WITH serdeproperties ("qbol.mongo.input.format"="true")
tblproperties("mongo.input.uri" = "mongodb://<userid>:<password>@<hostname>.mongolab.com:43207/test.zips");

This points the table mongotest to a Mongo collection, and zips and stores the test DB in a Mongolab-hosted instance. Once this table is created, you can query it like an ordinary Hive table.

To get Mongo tables working, add the following setting:

set mongo.input.split.create_input_splits=false

You can now use queries such as the following:

select state, sum(pop) as pop from mongotest group by state

You can also use extract data out of Mongo and store it in the Cloud as a normal Hive table.

To limit the load on the Mongo database QDS limits the number of mappers that can connect to each database. By default, this number is set to 4. That is, at the most 4 simultaneous connections are made per MapReduce job to the Mongo DB. To change this, add the following setting:

set mongo.mapper.count=<n>;

where <n> is the number of mappers you want to allow.

Similarly, to connect MapReduce jobs to the read replicas instead of the coordinator, add the following setting:

set mongo.input.split.allow_read_from_secondaries=true;

MapReduce will now use read replicas whenever possible.

Hive Table Formats

This section describes the table formats available in Hive on a Qubole cluster.

Delimited Tables

These are the delimited tables.

Features

The features of delimited table are:

  • Entire record must be one line in text file
  • Field delimiter configurable through DDL
  • Delimited files may be compressed using gzip, bzip2 or lzo
JSON Tables

The JSON serde is useful for parsing data stored as JSON. The JSON implementation has been borrowed from rcongiu.

Data
{"n_nationkey":"5", "n_name":"ETHIOPIA", "n_regionkey":"0", "n_comment":"ven packages wake quickly. regu" }
{"n_nationkey":"6", "n_name":"FRANCE", "n_regionkey":"3", "n_comment":"refully final requests. regular, ironi" }
{"n_nationkey":"7", "n_name":"GERMANY", "n_regionkey":"3", "n_comment":"l platelets. regular accounts x-ray: unusual, regular acco" }
Features

The features of JSON tables are:

  • Entire JSON document must fit in a single line of the text file.
  • Read the data stored in the JSON format.
  • Convert the data to the JSON format when INSERT INTO table.
  • Arrays and maps are supported.
  • Nested data structures are also supported.
Nested JSON Elements

If your data contains nested JSON elements like this:

{"country":"Switzerland","languages":["German","French","Italian"],"religions":{"catholic":[10,20],"protestant":[40,50]}}

You can declare languages as an array<string> and religions as a map<string,array<int> like this (location omitted).

CREATE EXTERNAL TABLE json_nested_test (
     country string,
     languages array<string>,
     religions map<string,array<int>>)
 ROW FORMAT SERDE 'org.openx.data.jsonserde.JsonSerDe'
 STORED AS TEXTFILE
 LOCATION '<scheme>..'
;

<scheme> is the Cloud-specific URI and path: for example, gs:// is the URI for GCP Cloud Provider.

You can access a nested element like this

select religions['catholic'][0] from json_nested_test;

Which produces the result

10
Avro Tables

Qubole supports creating Hive tables against data in Avro format.

Getting Avro schema from a file

If you have an Avro file, you can extract the schema using Avro tools. Download avro-tools-1.7.4.jar and run the following command to produce the schema. This schema goes into the serdeproperties in the DDL statement.

$ java -jar avro-tools-1.7.4.jar getschema episodes.avro
{
  "type" : "record",
  "name" : "episodes",
  "namespace" : "testing.hive.avro.serde",
  "fields" : [ {
    "name" : "title",
    "type" : "string",
    "doc"  : "episode title"
  }, {
    "name" : "air_date",
    "type" : "string",
    "doc"  : "initial date"
  }, {
    "name" : "doctor",
    "type" : "int",
    "doc"  : "main actor playing the Doctor in episode"
     } ]
}
Understanding Hive Versions

QDS supports the following versions of Hive:

  • Hive 2.1.1 (works with Tez 0.8.4 and Hadoop 2.6.0)
  • Hive 2.3 (works with Tez 0.8.4 and Hadoop 2.6.0)
Using Different Versions of Hive on QDS

The various ways of running Hive using these versions are described in Understanding Different Ways to Run Hive. Hadoop 2 clusters are also known as Hadoop 2 (Hive) clusters.

Use the different versions of Hive on QDS as described below:

Hive Version QDS Server Hive-on-Coordinator HiveServer 2 Multi-instance HiveServer2
2.1.1 Self-serve support Default Self-serve Support Contact QDS Support to enable
2.3 Default Running Hive on the Coordinator Node Self-serve Support Contact QDS Support to enable

Note

  • LLAP from the Hive open source is not verified in Qubole’s Hive 2.1.1.
  • The /media/ephemeral0/hive1.2/metastore.properties file has been deleted from Hive 2.3 onwards. Remove the dependency on the metastore.properties file if you use this version. Hive 2.3 uses Java 8 while running on QDS servers. It is also compatible with Java 7.

To configure a version, select it from the Hive Version drop-down list when you configure a Hadoop 2 (Hive) cluster in QDS. Create a ticket with Qubole Support for an account-wide configuration. This enables all clusters in the account to use Hive on Coordinator. For more information, see hive-on-cluster-master.

To use Hive Server 2, enable Hive Server 2 under the Advanced Configuration tab of the Hadoop 2 (Hive) cluster. For more information, see Configuring a HiveServer2 Cluster. You can set the versions through the API as described in hive-server-api-parameter.

To run Hive queries on Hive-on-coordinator, refer to hive-on-cluster-master-how. If Hive 2.x queries do not run on QDS servers by default, they run on the coordinator by default.

If you enable Hive Server 2, Hive queries run through HiveServer2.

Managing Hive Bootstrap

The Hive bootstrap script is run before each query is submitted to QDS. Use it for set-up needed by every Hive query run in the account, for example:

  • Adding jars. For more information, see Adding Custom Jars in Hive.
  • Defining temporary functions
  • Setting Hive parameters
  • MapReduce settings for Hive queries

For example, to use test.py in all sessions, add a bootstrap command similar to this for GCP:

add file gs://<object store>/defloc/scripts/bootstrap/test.py;

Hive bootstrap settings can be defined in two ways:

  • User Bootstrap Settings: As a user of the account, if you want to override the account-level bootstrap settings, enable this option. Enabling this option fetches the bootstrap from your default location. You can override the bootstrap settings for a specific account by using the Bootstrap editor.

    Bootstrap Editor allows you manually write and edit entries. Settings in the bootstrap editor override the settings in the bootstrap file.

  • Account Bootstrap Settings: Setting account-level bootstrap settings enables all users of that account to use the same Hive bootstrap. The account-level settings can also be set to use a default and custom bootstrap location as described here:

    • Use Default Bootstrap Location: The default cloud location would be in DEFLOC/scripts/hive that contains the bootstrap file. If you modify the bootstrap file in Cloud storage, the change affects all users that use this file.

      Bootstrap Editor allows you manually write and edit entries. Settings in the bootstrap override editor override the settings in the default bootstrap file for the particular account you are logged in to.

    • Use Custom Bootstrap Location: A cloud location other than the default that contains the Hive bootstrap. This custom bootstrap location is useful when you want to use the same bootstrap in multiple accounts.

The user-level Hive bootstrap is loaded after the account-level Hive bootstrap. In case of duplicate entries in the user-level and account-level bootstraps, only the user-level Hive bootstrap becomes valid.

See Hive Bootstrap for more information. set-view-bootstrap-api describes the APIs to set and view a Hive bootstrap.

Using the Hive Bootstrap Tab on Control Panel

To configure a Hive bootstrap script, use Hive BootStrap in the QDS Control Panel.

Clicking Hive Bootstrap displays:

A sample default view of the Hive Bootstrap tab is as shown here.

_images/HiveBootStrap.png
Configuring Account Bootstrap Settings

You can configure the Hive bootstrap using default and custom bootstrap location. By default, Use Default Bootstrap Location is selected.

Using Default Bootstrap Location

Upload the bootstrap file to the Cloud location if you have not already done so. The default location for a Hive bootstrap is <default location configured in your account>/scripts/hive.

By default, the BootStrap Editor area is blank; use it to create a bootstrap for the current account. To do this, click BootStrap Editor, enter bootstrap scripts, and click Save. The following figure shows an example of overriding a bootstrap configuration.

_images/HiveBootStrapSave.png

Click Save after adding a new bootstrap script.

Using Custom Default Location

Choose Use Custom Bootstrap Location if you want to use bootstraps from a non-default location and the same bootstrap for multiple accounts. Enter the path of the bootstrap location. A sample of non-default location for hive bootstrap is illustrated here.

_images/HiveBootstrapCustomLocation.png

Click the model button icon fileicon that is next to the Base Bootstrap Location text box to see the contents of a bootstrap file.

Click Save. Click Cancel to retain the previous bootstrap.

Configuring User Bootstrap Settings

In User Bootstrap Settings, QDS supports a user to override the account-wide bootstrap. By default, the user-hive bootstrap location for the current account is in <S3 default location configured in your account>/scripts/hive/<accountID>/<unique ID for the user>/bootstrap.

By default, the BootStrap Editor area is blank; use it to create a bootstrap to override the contents of the bootstrap file that is only specific to you. To do this, click BootStrap Editor, enter bootstrap scripts, and click Save.

The following figure shows an example of overriding a user-hive-bootstrap.

_images/HiveUserBootstrap.png
Analyzing Data in Hive Tables

Use the Explore page to analyze Hive data. By default, the page displays the Qubole Hive metastore.

From the Hive metastore, select the Hive table that requires data analysis. Click the icon ExpActionIcon.

Choose Analyze Data from the list.

The query composer in opens in a new tab where you can compose your query.

Click Run to execute a query. The query result is displayed in the Results tab.

Connecting to a Custom Hive Metastore

This section covers the following topics:

Creating a Custom Hive Metastore describes how to create a custom Hive metastore from the very beginning.

You can configure QDS access to the metastore either through a Bastion node, or by whitelisting a private IP address.

  • To configure a Bastion node for QDS access, follow these instructions; or
  • To whitelist a private IP address, contact Qubole Support and provide the address to be whitelisted.

Specifying Your Configuration in the QDS UI

  1. From the QDS main menu, choose Explore.
  2. On the resulting page, pull down the menu to the right of Qubole Hive and choose Connect Custom Metastore.
  3. Fill out the fields as follows:
    • Metastore Database Type: MySQL is the only database that is supported.
    • Metastore Database Name: provide the name of the MySQL database hosting the metastore.
    • Host Address:
      • If you are using a Bastion node, enter the Bastion node’s private IP address.
      • If you are whitelisting an address, enter the public IP address corresponding to the private IP address that you provided to Qubole support.
    • Port: Accept the default (3306).
    • User Name: Enter the name of the superuser or administrator user on the host you identified in the Host Address field.
    • Password: Enter the password for the superuser or administrator user.
    • Enable Cluster Access: Check this box to allow direct access between the QDS cluster and the metastore. Qubole recommends that you use direct access.
  4. If you are not using a Bastion node, leave Bastion Node unchecked and click Save to save your changes; otherwise continue with step 5.
  5. Check the box next to Bastion Node to enable access via your Bastion node.
  6. Enter the public IP address or hostname of the Bastion node.
  7. Enter the user name of the superuser or administrative user on the Bastion node.
  8. Enter the private key corresponding to the Bastion node’s public key.
Configuring Thrift Metastore Server Interface for the Custom Metastore

HiveServer2 (HS2) and other processes communicate with the metastore using the Hive Metastore Service through the thrift interface. Once HMS is started on a port, HS2, Presto and Spark can be configured with it to talk to the metastore. It must be configured as: thrift://<host>:<port>. The port that is used for Hive Metastore Service in Qubole Hive is 10000.

You can configure a thrift metastore server interface to access Hive metadata from the custom metastore in Hive, Presto, and Spark engines as mentioned below:

  • As a Hive bootstrap, set hive.metastore.uris=thrift://<URI>:10000;.
  • As a Presto cluster override, set hive.metastore.uri=thrift://<URI>:10000.
  • As a Spark cluster override, set spark.hadoop.hive.metastore.uris=thrift://<URI>:10000.

Note

Qubole supports configuring the thrift socket connection timeout according to the required value based on the schema table count. To configure the thrift socket connection timeout, create a Qubole support ticket.

Creating a Custom Hive Metastore

Qubole supports using a custom Hive metastore on the Qubole account. By default, each account comes with the Qubole Hive metastore but if you want to use a custom metastore, then you create a new metastore (if you do not have a metastore). Qubole Hive 2.1.1 and later versions support MySQL versions 5.7 and 8.0.

Note

Migrating Data from Qubole Hive Metastore to a Custom Hive Metastore describes how to migrate the data from the Qubole-managed Hive metastore to the custom-managed Hive metastore. Connecting to a Custom Hive Metastore describes how to connect to a custom metastore.

If you face any intermittent lock or dead lock issues, see Intermittent Lock and Deadlock Issues in Hive Metastore Migration to resolve them.

Create a custom Hive metastore by performing these steps:

  1. Log into MySQL and create the metastore database and tables as shown in the example below.

    > mysql -uroot
    ...
    mysql> CREATE DATABASE <database-name>;
    mysql> USE <database-name>;
    mysql> SOURCE <metastore-schema-script>
    
  2. Create a MySQL user and grant access to the metastore database as illustrated in the example below.

    mysql> CREATE USER 'hiveUser'@'%" IDENTIFIED BY 'hivePassword';
    mysql> REVOKE ALL PRIVILEGES, GRANT OPTION FROM 'hiveUser'@'%';
    mysql> GRANT ALL ON hive_metastore.* TO 'hiveUser'@'%';
    mysql> FLUSH PRIVILEGES
    
  3. Run the SQL scripts to create default tables for different Hive versions listed below:

  4. To continue configuring the custom Hive metastore, perform the steps described in Connecting to a Custom Hive Metastore.

Upgrading the Current Hive Metastore

To upgrade from a Hive 2.3-or-earlier-version-Hive metastore to version 3.1, use the following scripts as appropriate:

Migrating Data from Qubole Hive Metastore to a Custom Hive Metastore

By default, a QDS account comes with the Qubole-managed Hive metastore.

Qubole provides you an option to switch to a custom Hive metastore as described in Connecting to a Custom Hive Metastore.

Creating a Custom Hive Metastore describes how to create a custom Hive metastore from the very beginning.

To migrate the data from the Qubole-managed Hive metastore when you decide to a custom Hive metastore, the following section provides the steps.

Prerequisites

Ensure that there is a downtime during the data migration from Qubole-managed Hive metastore to the custom Hive metastore.

Migrating Data from the Qubole-managed Hive Metastore

Perform these steps to migrate data from Qubole-managed Hive metastore to the custom Hive metastore:

Note

If you face any intermittent lock or dead lock issues, see Intermittent Lock and Deadlock Issues in Hive Metastore Migration to resolve them.

  1. Ensure that you have a database instance that is appropriate for the workload.
  2. For security reasons, RDS is recommended to be in a private subnet and its bastion node can be in a public subnet.
  3. Create a ticket with Qubole Support requesting for the Qubole Hive metastore data dump.
  4. After receiving the DB dump from the Qubole Support, push the DB dump on the RDS instance that you have identified (refer to step 1).
  5. It is an optional step. For steps to create a metastore from the very beginning, see Creating a Custom Hive Metastore.
Adding Custom Jars in Hive

When adding custom jars in Hive, it is strongly recommended to avoid overriding the hive.aux.jars.path property in HIVE SETTINGS and HADOOP CLUSTER SETTINGS in the Hadoop (Hive) cluster. Instead, you can add jars using any of these ways:

  • A system admin should create a ticket with Qubole Support to change settings to add custom jars. As it is one of Qubole’s security features, a system admin of the account should create the request.

    After Qubole Support enables the settings, use the add jar statement to add custom jars at a query level or through the Hive bootstrap.

  • Add the custom jar by copying it into the /usr/lib/hive1.2/auxlib directory through the cluster’s node bootstrap. This option is unavailable if you run Hive queries on Qubole Control Plane. You can only add custom jars if you run Hive queries on the coordinator node and HiveServer2.

Using Hive on Tez

This section explains how to configure and use Hive on Tez in a Qubole cluster. It covers the following topics:

Running Hive Queries on Tez

To run Hive queries on Tez, you need to:

Note

While running a Tez query on a JDBC table, you may get an exception that you can debug by using the workaround described in Handling Unsuccessful Tez Queries While Querying JDBC Tables.

Configure and Start a Hadoop (Hive) Cluster

A Hadoop (Hive) cluster is configured by default in QDS. The default cluster should work well for Hive queries on Tez, but if you modify it, make sure the instances you choose for the cluster nodes have plenty of local storage; disk space used for queries is freed up only when the Tez DAG is complete.

The ApplicationMaster also takes up more memory for multi-stage jobs than it needs for similar MapReduce jobs, because in the Tez case it must keep track of all the tasks in the DAG, whereas MapReduce processes one job at a time.

Configure ApplicationMaster Memory

To make sure that the ApplicationMaster has sufficient memory, set the following parameters for the cluster on which you are going to run Tez:

tez.am.resource.memory.mb=<Size in MB>;
tez.am.launch.cmd-opts=-Xmx<Size in MB>m;

To set these parameters in QDS, go to the Control Panel and choose the pencil item next to the cluster you are going to use; then on the Edit Cluster page enter the parameters into the Override Hadoop Configuration Variables field.

Do pre-production testing to determine the best values. Start with the value currently set for MapReduce; that is, the value of yarn.app.mapreduce.am.resource.mb (stored in the Hadoop file mapred-site.xml). You can see the current (default) value in the Recommended Configuration field on the Edit Cluster page. If out-of-memory (OOM) errors occur under a realistic workload with that setting, start bumping up the number as a multiple of yarn.scheduler.minimum-allocation-mb, but do not exceed the value of yarn.scheduler.maximum-allocation-mb.

Enable Offline Job History

To enable offline job history, set the following parameter for the cluster on which you are going to run Tez:

yarn.ahs.leveldb.backup.enabled = true

To set this parameter in QDS, go to the Control Panel and choose the pencil item next to the cluster you are going to use; then on the Edit Cluster page enter the parameter into the Override Hadoop Configuration Variables field.

Start or Restart the Cluster

To start the cluster, click on the arrow to the right of the cluster’s entry on the Clusters page in the QDS Control Panel.

Configure Tez as the Hive Execution Engine

You can configure Tez as the Hive execution engine either globally (for all queries) or for a single query at query time.

To use Tez as the execution engine for all queries, enter the following text into the bootstrap file:

set hive.execution.engine = tez

To use Tez as the execution engine for a single Hive query, use the same command, but enter it before the query itself in the QDS UI.

To use Tez globally across your QDS account, set it in the account-level Hive bootstrap file. For more information, see Managing Hive Bootstrap and set-view-bootstrap-api.

Configure YARN ATS Version with Tez

You may want to choose YARN ATS v1.5 instead of the default, ATS v1, as v1.5 provides more scalability and reliability. In particular, you may want to switch to v1.5 if you run many concurrent queries using Tez.

ATS v1.5 for Tez is supported only in Hive versions 2.1.1 and 2.3. To switch to ATS v1.5, create a ticket with Qubole Support.

Configuring Custom Tez Shuffle Handler

Qubole supports custom Tez shuffle handler in Hive 3.1.1 (beta), which can speed up the worker nodes’ downscaling process in a Hadoop (Hive) cluster. It is not available by default. Create a ticket with Qubole Support to enable the custom Tez Shuffle handler.

Qubole Hive supports MapReduce shuffle handler by default. A running Tez application has several DAGs that complete before the application is completed. The shuffle data of the completed DAG is not cleared until the application terminates. As multiple DAGs runs sequentially on an application, the shuffle data of completed DAGs creates a hindrance for the cluster to downscale. So, to speed up downscaling and cut down on running worker nodes’ costs, Qubole recommends switching to custom Tez shuffle handler if you are using Hive 3.1.1 (beta) version on a Hadoop (Hive) cluster.

Supported Tez Versions

Tez is the default query execution engine in Hive. The following table describes the supported Tez version in each Hive version.

Hive Version Supported Tez Version
Hive 2.1.1 Tez 0.8.4
Hive 2.3

Tez 0.8.4 and Tez 0.9.1.

But Hive 2.3 support for Tez version 0.9.1 is not enabled by default. You can enable it in the Account Features UI that is part of Control Panel of the Qubole UI. To know more on how to enable, see Managing Account Features. Tez version 0.9.1 helps in downscaling cluster by cleaning up shuffle data aggressively.

Hive 3.1.1 (beta) Tez 0.9.1. This version helps in downscaling cluster by cleaning up shuffle data aggressively.
Understanding Considerations to Move Existing MapReduce Jobs in Hive to Tez

You can move existing MapReduce jobs in Hive to Tez.

Tez can be enabled for the entire cluster at the bootstrap level or for individual queries at runtime by setting hive.execution.engine = tez. If administrators configure Tez for the entire cluster then individual queries can be reverted to MapReduce by setting hive.execution.engine = mr at the start of the job.

For more information on queries, see Running Hive Queries on Tez.

There are certain considerations that you need to understand before moving MapReduce jobs in Hive to Tez, which are described in these sub-topics:

Understanding Log Pane

The number of Tasks for each of the Mapper or Reducer Vertices is displayed in the Logs pane. The information is displayed as A (+B, -C) ÷ D where:

  • A implies the number of completed tasks
  • B implies the number of running tasks
  • C implies the number of failed tasks
  • D implies the total number of tasks
Understanding Application Memory

Out-of-memory errors may occur when there are an exceptionally large number of tasks being executed in parallel or there are too many files involved in the split computation. Managing the Application Master configuration can ensure that these types of issues do not occur. This memory is controlled with tez.am.resource.memory.mb and a good starting point for this value may be yarn.app.mapreduce.am.resource.mb. The memory available for the Containers (JVMs) is controlled with tez.am.launch.cmd-opts that is typically set to 80% of tez.resource.memory.mb.

Understanding Container Memory

Container Limitation issues may occur if the amount of memory required is more than what is available per the allocation policy. If this occurs, Tez throws an error indicating that it is killing the container in response to the demands of the container. The container size is set with hive.tez.container.size and must be set as a multiple of yarn.scheduler.minimum-allocation-mb. The child java operations are controlled through hive.tez.java.opts and must be set to approximately 80% of hive.tez.container.size.

Understanding Split Sizing

Split computation occur in the Application Master and by default the Maximum Split Size is 1 GB and the Minimum Split Size is 50 MB. You can modify the Split Sizing policy by modifying tez.grouping.max-size and tez.grouping.min-size. Tez uses the HiveInputFormat in conjunction with the grouping settings to ensures that the numbers of Mappers does not become a bottleneck. This is different than MapReduce which uses the CombinedHiveInputFormat by default which can result in less Mapper Tasks. As a result it can be misleading to compare the number of Mapper Tasks between MapReduce and Tez to gauge performance improvements.

Enabling Split Pruning

To enable Split Pruning during the split computation, configure the following:

set hive.optimize.index.filter = true;
set hive.input.format=org.apache.hadoop.hive.ql.io.HiveInputFormat;
Understanding Task Parallelism

The parallelism across the reducers is set by affecting the average reducer size in bytes. hive.exec.reducers.bytes.per.reducer is the configuration option and as this value decreases more reducers are introduced for load distribution across tasks. The parallelism across the mappers is set by affecting tez.am.grouping.split-waves, which indicates the ratio between the number of tasks per vertex compared to the number of available containers in the queue. As this value decreases, more parallelism is introduced but there are less resources allocated to a single job.

Understanding Garbage Collection

While often inconsequential the garbage collection process can lead to increased run time if there are increasingly complex data types and queries used. The amount of time taken for garbage collection can be identified through the Tez Application UI or by enabling hive.tez.exec.print.summary. If garbage collection times are higher than acceptable or expected, consider the components of the Hive functionality, which may be increasing runtime.

Understanding Map Join

When taking advantage of Map Joins in Hive, keep in mind that the larger and more complex the Hash Table used for the Map Join, the greater the burden on the Garbage Collection process. If the Map Join is necessary to avoid a Shuffle Join or due to performance considerations, then it may be necessary to increase the container size so that additional resources are available for Garbage Collection. If the Map Join is not needed, then consider disabling or decreasing the value of hive.auto.convert.join.noconditionaltask.size to force the query to use a Shuffle Join.

Understanding ORC Insert

When inserting into a table, which writes to an ORC file, if there are a large number columns present consider reducing hive.exec.orc.default.bagger.size or increasing the container size.

Understanding Partition Insert

During partitioned inserts, the performance may be impacted if there are a large number of tasks inserting into multiple partitions at the same time. If this is observed consider enabling hive.optimize.sort.dynamic.partition. Only do this, if inserting into more than 10 partitions because this can have a negative impact on performance with a very small number of partitions.

Understanding Hive Statistics

You can run the following command to trigger accurate size accounting by the compiler:

ANALYZE TABLE [table_name] COMPUTE STATISTICS for COLUMNS

After executing the above statement, enable hive.stats.fetch.column.stats, which triggers the Hive physical optimizer to use more accurate per-column statistics instead of the uncompressed file size represented by HDFS. After collecting and calculating statistics, consider enabling the cost-based optimizer (CBO) with hive.cbo.enable.

Understanding Different Ways to Run Hive

On QDS, you can run Hive in three different ways that are as follows:

QDS supports Hive version 2.1.1 on all the above ways of running Hive. For more information on versions, see Understanding Hive Versions.

Pros and Cons of Each Method to Run Hive provides a table that lists pros, cons, and recommended scenario of each method.

Running Hive through QDS Servers

Here is the architecture that depicts how Hive runs through QDS Servers. Pros and Cons of Each Method to Run Hive provides a table that lists pros, cons, and recommended scenario of this method.

_images/HiveviaQDS-Server.png
Running Hive on the Coordinator Node

Here is the architecture that depicts how Hive runs on the cluster’s Coordinator node. Pros and Cons of Each Method to Run Hive provides a table that lists pros, cons, and recommended scenario of this method.

_images/HiveonMaster.png
Running Hive with HiveServer2 on the Coordinator Node

Here is the architecture that depicts how Hive runs on HiveServer2 (HS2). Pros and Cons of Each Method to Run Hive provides a table that lists pros, cons, and recommended scenario of this method.

_images/HS2-Standalone.png
Running Hive with Multi-instance HiveServer2

Here is the architecture that depicts how Hive runs on multi-instance HiveServer2 (HS2). Pros and Cons of Each Method to Run Hive provides a table that lists pros, cons, and recommended scenario of this method.

_images/QuboleHiveMulti-InstanceHS2.png
Pros and Cons of Each Method to Run Hive

This table describes the pros, cons, and recommendation of each method of running Hive.

Method Pros Cons Recommended Scenario
Running Hive through QDS Servers
  • It is scalable as the Hadoop 2 (Hive) cluster autoscales based on the number of queries.
  • In case if the custom Hive metastore is in a different AWS region, the latency is high.
This method is recommended when you:
  • Are a beginner
  • Handle a lower query traffic
  • Use the Qubole Hive Meta Store
Running Hive on the Coordinator Node
  • It secures the data and reduces the latency in case if you use a custom Hive metastore.
  • It requires a large-sized coordinator node.
  • The coordinator node is not scalable.

This method is recommended when you:

  • Handle a low-medium query traffic
  • Can afford a High-memory EC2 instance type
  • Use the custom metastore in a different AWS region
Running Hive with HiveServer2 on the Coordinator Node
  • It secures the data and reduces the latency in case if you use a custom Hive metastore.
  • It requires a suitable HS2 memory configuration.
  • It can have a single point of failure when there is a higher workload and in that case, it is not scalable.

This method is recommended when you:

  • Handle a medium-high query traffic
  • Use the custom metastore in a different AWS region
  • Want to use other HS2 features such as metadata caching
Running Hive with multi-instance HiveServer2
  • It secures the data and reduces the latency in case if you use a custom Hive metastore. It is more reliable and provides more scalable workload handling.
  • It is an additional cost to maintain HS2 cluster.

This method is recommended when you:

  • Handle a high query traffic and higher concurrency
  • Prefer scalability and high availability at a higher cost of maintaining a HS2 cluster
  • Want to use it in a large enterprise
  • Use the custom metastore in a different AWS region
  • Use HS2 features such as metadata caching
Configuring Map Join Options in Hive

Map join is a Hive feature that is used to speed up Hive queries. It lets a table to be loaded into memory so that a join could be performed within a mapper without using a Map/Reduce step. If queries frequently depend on small table joins, using map joins speed up queries’ execution. Map join is a type of join where a smaller table is loaded in memory and the join is done in the map phase of the MapReduce job. As no reducers are necessary, map joins are way faster than the regular joins.

In Qubole Hive, the mapjoin options are enabled by default/have default values.

Here are the Hive map join options:

  • hive.auto.convert.join: By default, this option is set to true. When it is enabled, during joins, when a table with a size less than 25 MB (hive.mapjoin.smalltable.filesize) is found, the joins are converted to map-based joins.
  • hive.auto.convert.join.noconditionaltask: When three or more tables are involved in the join condition. Using hive.auto.convert.join, Hive generates three or more map-side joins with an assumption that all tables are of smaller size. Using hive.auto.convert.join.noconditionaltask, you can combine three or more map-side joins into a single map-side join if size of n-1 table is less than 10 MB. (This rule is defined by hive.auto.convert.join.noconditionaltask.size.)

Outer joins are not always converted to map joins, which are as described below:

  • Full outer joins are never converted to map-side joins.
  • A left-outer join are converted to a map join only if the right table that is to the right side of the join conditions, is lesser than 25 MB in size.
  • Similarly, a right-outer join is converted to a map join only if the left table size is lesser than 25 MB.
Running a Hive Query

The page explains how to run a simple Hive query from the QDS UI. You must have an active QDS account; to create a new account, see Managing Your Accounts.

Step 1: Explore Tables

Navigate to the Workbench page.

Click the Tables tab. You will see a list of databases.

  1. Click on a database to view the list of all the tables in it.
  2. All accounts have access to two pre-configured tables in the default database: default_qubole_airline_origin_destination and default_qubole_memetracker.
  3. To see the list of columns in a specific table, click on the arrow sign to the left of the table name.
Step 2: View Sample Rows

Now, execute a simple query against this table by entering the following text in the query box:

select * from default_qubole_memetracker limit 10

Click Run. Within a few seconds, you should see 10 rows from the table show up in the Results tab.

_images/hive_query_result.png

Figure: View Some Rows

Step 3: Analyze Data

To get the total number of rows in the table corresponding to August 2008, use the following query:

select count(*) from default_qubole_memetracker where month="2008-08".

This query is more complex than the previous one and requires additional resources. Behind the scenes, QDS provisions a Hadoop cluster, which may take a couple of minutes.

Once the cluster is provisioned, you can follow the query’s progress under the Log tab.

When the query completes, you can see the logs and results under the respective tabs.

See also: Composing a Hive Query.

Optimizing Hive Queries

This section describes optimizations related to Hive queries.

Changing SerDe to DelimitedJSONSerDe for Results with Complex Datatypes

Qubole Hive converts all SELECT queries to INSERT-OVERWRITE-DIRECTORY (IOD) format to save results back to a Cloud location.

When writing data to a directory, Apache Hive uses LazySimpleSerde for serialization (writing) of results/data. But LazySimpleSerde does not honor a <key,value> structure and ignores keys. Honoring the keys plays an important role in displaying data-typed columns.

In Qubole Hive, setting hive.qubole.directory.serde=org.apache.hadoop.hive.serde2.DelimitedJSONSerDe changes the SerDe to DelimitedJSONSerDe, which honors more complex datatypes such as Maps and Arrays. This configuration when set, helps you to view and use query results with complex datatypes correctly.

Handling Direct Writes of INSERT OVERWRITE Query Results

For INSERT OVERWRITE queries, Qubole Hive allows you to directly write the results to Google Cloud Storage. Apache Hive normally writes data to a temporary location and then moves it to Google Cloud Storage. In Google Cloud Storage, moving data is expensive because it requires copy and delete operations. So, directly writing the INSERT OVERWRITE query results to Google Cloud Storage is an optimization that Qubole Hive offers.

However, there is an issue that you may face while writing INSERT OVERWRITE query results to Google Cloud Storage. While writing to a partitioned and bucketed table using the INSERT OVERWRITE command, there is a chance that multiple reducers will simultaneously write the result file to a specific table’s location in Google Cloud Storage. As it is an INSERT OVERWRITE command, the existing files written before the current job in the table’s location are deleted before the reducer tasks write new result files. In this situation, a scenario can occur where the delete request sent to Google Cloud Storage by one reduce task (say R1) gets throttled due to the Google Cloud Storage issue and by this time, another reduce task (say R2) deletes the old files and writes a new file. The delete request sent by reduce task R1 is now processed by Google Cloud Storage and it ends up deleting the file written by reduce task R2. To overcome this issue, Qubole provides an enhancement to avoid files from being deleted by reduce tasks. The enhancement is not enabled on the QDS account by default. To enable it on the account, create a ticket with Qubole Support.

When the enhancement is enabled, a prefix which is unique per job is added to the result files. This ensures that only old files which do not have the latest prefix are deleted. Thus it solves the data loss issue which can happen due to Google Cloud Storage throttling when multiple reducers try to simultaneously write query results.

For more information, see Hive Administration and Analyze.

Presto

This section explains how to configure and use Presto on a Qubole cluster. The configuration and usage are also categorized based on the two stable Presto versions.

Introduction

Presto is an open source distributed SQL query engine developed by Facebook. Presto is used for running interactive analytic queries against data sources of all sizes ranging from gigabytes to petabytes.

Presto was designed and written completely for interactive analytics and approaches the speed of commercial data warehouses. Facebook uses Presto for interactive queries against several internal data stores including its 300PB data warehouse. Over 1,000 Facebook employees use Presto everyday to run more than 30,000 queries that in total scan over a petabyte each per day. Learn more at prestosql.io.

The execution model of Presto is fundamentally different from Hive or MapReduce. Hive translates queries into multiple stages of MapReduce tasks that execute one after the other. Each task reads inputs from disk and writes intermediate output back to disk. In contrast, the Presto engine does not use MapReduce. It employs a custom query and execution engine with operators designed to support SQL semantics. In addition to improved scheduling, processing is in memory and pipelined across the network between stages. This avoids unnecessary I/O and associated latency overhead. The pipelined execution model runs multiple stages at once and streams data from one stage to the next as it becomes available. This significantly reduces end-to-end latency for many types of queries. For more information, see Presto’s architecture.

Note

Presto is supported on AWS, Azure, and GCP Cloud platforms; see QDS Components: Supported Versions and Cloud Platforms.

Sample Use Case

Qubole’s Presto-as-a-Service is primarily intended for Data Analysts who need to translate business questions into SQL queries. Since the questions are often ad-hoc, there is some trial and error involved; arriving at the final results may involve a series of SQL queries. By reducing the response time of these queries, the platform can reduce the time to insight and greatly benefit the business.

The typical use case involves a few 10GB-100TB tables in the Cloud. Tables are generally partitioned by date or other attributes. Analyst queries pick a few partitions at a time, typically span a week to a month of data, and involve WHERE clauses. Queries may involve a JOIN with a smaller table, and contain aggregate functions and GROUP-BY clauses.

Presto as a Service

Qubole provides Presto as a service for fast, inexpensive, and scalable data processing.

Note

For the latest information on QDS support for Presto, see QDS Components: Supported Versions and Cloud Platforms.

Supported Data Formats

Presto supports the following data formats:

  • Hive tables in the Cloud and HDFS.
  • Delimited, CSV, RCFile, JSON, SequenceFile, ORC, Avro, and Parquet. Other file formats are also supported by adding relevant jars to Presto through the Presto Server Bootstrap.
  • Data-compressed using GZIP.
Advantages of QDS Presto Clusters
  • You can optimize your clusters by choosing the instance type most suitable to your workload.
  • You can launch clusters in any region or location.
  • QDS provides Cloud-specific optimizations.
  • By default, QDS automatically terminates idle clusters to save cost.
  • QDS starts clusters only when necessary– when a query is run and no Presto cluster is running; otherwise QDS reuses a cluster that is already running.
  • Autoscaling continuously adjusts the cluster size to the Presto workload.
  • You can configure the amount of cluster memory allocated for Presto.
A Better User Experience
  • Multiple QDS users can submit queries to the same Presto cluster.
  • Query logs and results are always available (use the History tab on the Workbench page of the QDS UI).
  • QDS provides detailed execution metrics for each Presto query.
  • Users can create workflows that combine Hadoop jobs, Hive queries, and Presto queries.
Security

QDS can provide table-level security for Hive tables accessed via Presto; to enable it, set hive.security to sql-standard in catalog/hive.properties. See Understanding Qubole Hive Authorization for more information.

Running a First Presto Query

By default, your account has a cluster named presto on which Presto queries run. You can modify this cluster and create others; see Configuring a Presto Cluster for instructions.

Note

Presto is supported on AWS, Azure, and GCP Cloud platforms; see QDS Components: Supported Versions and Cloud Platforms.

Navigate to the Workbench (beta) page and click the Create button. Select Presto Query from the drop-down list.

You can run a query against a pre-populated table that records the number of flight itineraries in every quarter of 2007.

For example, run the following command:

select quarter, count(*) from default_qubole_airline_origin_destination where year='2007' group by quarter;

If the Presto cluster is not active, the query automatically starts it, and that may take a few minutes. You can watch the progress of the job under the Logs tab; when it completes, you can see the query results under the Results tab.

Congratulations! You have just run your first Presto query on QDS!

For more information, see Composing a Presto Query, Presto FAQs, and the other topics in this Presto section.

Inserting Data

Note

Presto is supported on AWS, Azure, and GCP Cloud platforms; see QDS Components: Supported Versions and Cloud Platforms.

You may want to write results of a query into another Hive table or to a Cloud location. QDS Presto supports inserting data into (and overwriting) Hive tables and Cloud directories, and provides an INSERT command for this purpose. It is currently available only in QDS; Qubole is in the process of contributing it to open-source Presto.

Keep in mind that Hive is a better option for large scale ETL workloads when writing terabytes of data; Presto’s insertion capabilities are better suited for tens of gigabytes.

The INSERT syntax is very similar to Hive’s INSERT syntax.

Let us use default_qubole_airline_origin_destination as the source table in the examples that follow; it contains flight itinerary information.

Cloud Directories

You can write the result of a query directly to Cloud storage in a delimited format; for example:

INSERT INTO directory '<scheme>qubole.com-siva/experiments/quarterly_breakdown'
SELECT origin,
       quarter,
       count(*) AS c
FROM default_qubole_airline_origin_destination
WHERE YEAR='2007'
GROUP BY quarter,
     origin;

<scheme> is the Cloud-specific URI scheme: gs:// for GCP.

Here is a preview of what the result file looks like using cat -v. Fields in the results are ^A (ASCII code \x01) separated.

"DFW"^A1^A334973
"LAX"^A1^A216789
"OXR"^A1^A456
"HNL"^A1^A78271
"IAD"^A1^A115924
"ALB"^A1^A20779
"ORD"^A1^A414078
Simple Hive Tables

The target Hive table can be delimited, CSV, ORC, or RCFile. Qubole does not support inserting into Hive tables using custom input formats and serdes. You can create a target table in delimited format using the following DDL in Hive.

CREATE TABLE quarter_origin (quarter string, origin string, count int)
ROW FORMAT DELIMITED
FIELDS TERMINATED BY '|'
STORED AS TEXTFILE;

Run desc quarter_origin to confirm that the table is familiar to Presto. It can take up to 2 minutes for Presto to pick up a newly created table in Hive. Now run the following insert statement as a Presto query.

INSERT INTO TABLE quarter_origin
SELECT quarter,
       origin,
       count(*)
FROM default_qubole_airline_origin_destination
WHERE YEAR='2007'
GROUP BY quarter,
     origin;

You can now run queries against quarter_origin to confirm that the data is in the table.

SELECT *
FROM quarter_origin LIMIT 5;

Similarly, you can overwrite data in the target table by using the following query.

INSERT OVERWRITE TABLE quarter_origin
SELECT quarter,
       origin,
       count(*)
FROM default_qubole_airline_origin_destination
WHERE YEAR='2007'
GROUP BY quarter,
     origin;
Partitioned Hive Tables

You can also partition the target Hive table; for example (run this in Hive):

CREATE TABLE quarter_origin_p (origin string, count int)
PARTITIONED BY (quarter string)
ROW FORMAT DELIMITED
FIELDS TERMINATED BY '|'
STORED AS TEXTFILE;

Now you can insert data into this partitioned table in a similar way. The only catch is that the partitioning column must appear at the very end of the select list. In the below example, the column quarter is the partitioning column.

INSERT INTO TABLE quarter_origin_p
SELECT origin,
       count(*),
       quarter
FROM default_qubole_airline_origin_destination
WHERE YEAR='2007'
GROUP BY quarter,
     origin;

Note that the partitioning attribute can also be a constant. You can use overwrite instead of into to erase previous content in partitions.

Configuring the Concurrent Writer Tasks Per Query

Caution

Use this configuration judiciously to prevent overloading the cluster due to excessive resource utilization. So it is recommended to use higher value through session properties for queries which generate bigger outputs. For example, ETL jobs.

Presto provides a configuration property to define the per-node-count of Writer tasks for a query. You can set it at a cluster level and a session level. By default, when inserting data through INSERT OR CREATE TABLE AS SELECT operations, one Writer task per worker node is created which can slow down the query if there there is a lot of data that needs to be written. In such cases, you can use the task_writer_count session property but you must set its value in power of 2 to increase the number of Writer tasks per node. This eventually speeds up the data writes.

The cluster-level property that you can override in the cluster is task.writer-count. You must set its value in power of 2.

Accessing Hive Views

A view is a logical table that future queries can refer to. Views do not contain any data but it stores the query. The query that is stored by the view is run each time the view is referenced by another query.

Presto can access Hive views but these conditions apply:

  • Hive views are accessible in Presto but only with best effort delivery.
  • Presto can access a Hive view only when its statement contains ANSI SQL dialects that Presto supports.
  • Presto cannot access a Hive view when its statement contains non-ANSI SQL dialects.
Troubleshooting Hive Views Failures in Presto

The examples below are a few Hive view failures with workarounds:

  • A Hive view statement that contains functions undefined in Presto is inaccessible. For example, a Hive view with statement, SELECT QUOTE (MY_STRING_COLUMN) FROM MY_TABLE does not work in Presto as the QUOTE function is not defined. For such statements, add UDF to Presto, which defines QUOTE functions.

  • A Hive view statement that has strings quoted with ” (double quotes) does not work in Presto. For example, SELECT * FROM MY_TABLE WHERE MY_STRING_COLUMN = "EXPECTED_VALUE" does not work in Presto as the string value is in double quotes.

    To make this work in Presto, use ‘ (a single quote) to quote string values. For example, use SELECT * FROM MY_TABLE WHERE MY_STRING_COLUMN = 'EXPECTED_VALUE' (string value in single quotes) as the Hive view statement.

  • A Hive view statement with its syntax undefined in Presto does not work. For example, SELECT cast(MY_INT_COLUMN as String) FROM MY_TABLE does not work in Presto.

    To make this work in Presto, use SELECT cast(MY_INT_COLUMN as varchar) FROM MY_TABLE as the Hive view statement.

In general, when a Hive view with a long SQL statement does not work in Presto, it is not apparent (from reading the SQL statement) to decide on which part of the statement is non-ANSI complaint. In such Hive views, it is recommended to break the Hive view statement into smaller parts and try as individual Presto queries. This helps in tracing the non-ANSI compliant part of the statement. After tracing such part(s) in the Hive view statement, you need to:

  1. Convert the non-ANSI compliant parts of the Hive statement into ANSI compliant.
  2. Recreate the Hive view with a new statement to make it accessible to Presto.
Presto FAQs
  1. How is Presto different from Hive?
  2. How is Qubole’s Presto different from open-source Presto?
  3. Where do I find Presto logs?
  4. Why are new nodes not being used by my query during upscaling?
  5. Where can I find the Presto Server Bootstrap logs?
  6. How can I optimize the Presto query?
How is Presto different from Hive?

As a user, there are certain differences that you should be aware about Presto and Hive, even though they are able to execute SQL-like queries.

Presto:

  • Does not support User-defined functions (UDFs). However, Presto has a large number of built-in UDFs. Qubole provides additional UDFs, which can be added only before the cluster startup and runtime UDF additions such as Hive are not supported.
  • Does not support JOIN ordering. Ensure that a smaller table is to the right of the JOIN token.
How is Qubole’s Presto different from open-source Presto?

Note

Presto is supported on AWS, Azure, and GCP Cloud platforms; see QDS Components: Supported Versions and Cloud Platforms.

While Qubole’s Presto offering is heavily based on open-source Presto, there are a few differences. Qubole’s Presto:

  • Supports inserting data into Cloud Object Storage directories
  • Supports INSERT OVERWRITES
  • Supports autoscaling clusters
  • Supports GZIP compression
  • Supports data traffic encryption among the Presto cluster nodes
  • Supports additional connectors such as Kinesis and SerDes such as AVRO and Openx JSON
Where do I find Presto logs?
  • The coordinator cluster node’s logs are located at:
    • DEFLOC/logs/presto/cluster_id/cluster_start_time/master/
    • DEFLOC/logs/presto/cluster-id/cluster_start_time/master/queryinfo/
  • The worker cluster node’s logs are located at: DEFLOC/logs/presto/cluster_id/cluster_start_time/nodeIP/node_start_time/

Where:

  • DEFLOC refers to the default location of an account.

  • cluster_id is the cluster ID.

  • cluster_start_time is the time you start the cluster. You can fetch Presto logs using the above log location using the approximate start time of the cluster.

    You can also get it by running a Presto command. When you run a Presto command, the log location is reported under the Logs tab.

    For example, you’ll see the path as something like this:

    Log location: gs://mydata.com/trackdata/logs/logs/presto/95907
    Started Query: 20191110_092450_00096_bucas Query Tracker
    Query: 20190810_092450_00096_bucas Progress: 0%
    Query: 20190810_092450_00096_bucas Progress: 0%
    

    95907 is the cluster instance ID; there are sub-directories for the coordinator and worker nodes.

Why are new nodes not being used by my query during upscaling?

New nodes are available only to certain operations (such as TableScans and Partial Aggregations) of queries already in progress when the nodes are added. For more information, see this explanation of how autoscaling works in a Presto cluster.

Where can I find the Presto Server Bootstrap logs?

A GCP user can see the Presto Server Bootstrap logs in /media/ephemeral0/presto/var/log/bootstrap.log.

How can I optimize the Presto query?

The following are some guidelines you can consider adopting to make the most of Presto on Qubole:

  • Check the health of the cluster before submitting a query and ensure that the Presto master is working and the cluster resources such as CPU, memory, and disk are not over utilized (> 95%).
  • Check the storage format of the data you are querying. Prepared data (columnar format, partitioned, statistics, and so on) provide better performance. Contact your Admin for further advice.
  • Ensure that PREDICATES and LIMIT statements are used when querying large datasets, doing large data scans, or joining multiple tables.
  • Consolidate small files into bigger files asynchronously to reduce network overheads.
  • Collect dataset statistics such as file size, rows, and histograms of values to optimize queries with JOIN reordering.
  • Enable runtime filtering to improve the performance of INNER JOIN queries.
  • Enable automatic selection of optimal JOIN distribution type and JOIN order based on table statistics.
  • Use broadcast/replicated JOIN (Map-side JOIN) when build side tables are small.
  • Use aggregations over DISTINCT values to speed up the query execution.
  • Presto Configuration in QDS

  • External Data Source Access

    Accessing Data Stores through Presto Clusters

    Qubole now supports accessing data stores through Presto clusters by adding a catalog parameter while creating a data store using a REST API request. Create a ticket with Qubole Support to enable this feature.

    Create a DbTap and Edit a DbTap describe the catalog parameter. This parameter is supported for MySQL, Postgres, and Redshift.

    Note

    Presto is supported on AWS, Azure, and GCP Cloud platforms; see QDS Components: Supported Versions and Cloud Platforms.

    Connecting to MySQL and JDBC Sources using Presto Clusters

    Note

    Presto is currently supported on AWS, Azure, GCP Cloud platforms; see QDS Components: Supported Versions and Cloud Platforms.

    You can use the MySQL connector to query and create tables in an external MySQL database, and to share data between two different tools such as MySQL and Hive.

    Connecting to MySQL Sources using Presto Clusters

    To connect to a MySQL source using a Presto cluster, configure a MySQL connector by adding a catalog properties file in etc/catalog. You can do this in the QDS UI when adding or editing a Presto cluster. For example, add the following in the Override Presto Configuration text box (see Configuring a Presto Cluster for more information).

    catalog/mysqlcatalog.properties:
    connector.name=mysql
    connection-url=jdbc:mysql://example.net:3306
    connection-user=root
    connection-password=secret
    

    Adding the above properties in the Presto cluster settings creates a new file, mysqlcatalog.properties in etc/catalog when the cluster comes up.

    In addition, add the following in the same text box:

    config.properties:
    datasources=jmx,hive,mysqlcatalog
    

    Note

    The datasources configuration is not required with Presto 317 for configuring connectors and it is described in data-sources-317.

    Now start or restart the cluster to implement the configuration.

    Querying MySQL

    You can query a MySQL database as follows:

    1. The MySQL connector offers a schema for every MySQL database. You can run the following command to see the available MySQL database.

      SHOW SCHEMAS FROM mysqlcatalog;

    2. You can see the tables in a MySQL database by running the SHOW TABLES command. For example, to see the tables in a database named users, run the following command:

      SHOW TABLES FROM mysqlcatalog.users;

    3. To access a table from the MySQL database, run a SELECT query. For example, to access a permanentusers table in the users database, run:

      SELECT * FROM mysqlcatalog.users.permanentusers;

    Connecting to JDBC Sources using Presto Clusters

    In data analytics, integrating data from multiple sources is a common problem. This is because dimensional data such as user information reside in a relational database such as MySQL or PostrgreSQL and large semi-structured data such as clickstream data reside in a Cloud Object Storage. You can use Qubole Scheduler to periodically re-import the data to to a database. Using a Qubole Scheduler only helps when the database does not change very often. When a database is changed very frequently, you can use Hive storage handlers to plug-in other live data sources. A storage handler for databases based on JDBC would suit the purpose appropriately. You can create external Hive tables and map them to a database table. A query to the external Hive table gets rerouted to the underlying database table.

    To use a Storage Handler to integrate data from multiple sources, perform the following steps:

    1. Build a Hive storage handler by using the code at github with an Apache Licence. The code is compatible with Apache Hive and Hadoop 2. The ReadMe file provides instructions to build the storage handler. Alternatively, you can use the Qubole storage handler jar available in the public bucket.

    2. After building the storage handler JAR, or using the Qubole storage handler JAR, connect to a database by adding the JAR and creating an external Hive table with specific tableproperties. tableproperties contains information about the JDBC driver class to use, hostname, username, password, table name, and so on. The following code snippet shows adding the storage handler JAR and creating an external Hive table with specific tableproperties.

      ADD JAR <scheme>paid-qubole/jars/jdbchandler/qubole-hive-jdbc-handler.jar ;
      DROP TABLE HiveTable;
      CREATE EXTERNAL TABLE HiveTable(
                                       id INT,
                                       id_double DOUBLE,
                                       names STRING,
                                       test INT
                                      )
      STORED BY 'org.apache.hadoop.hive.jdbc.storagehandler.JdbcStorageHandler'
      TBLPROPERTIES (
                     'mapred.jdbc.driver.class'='com.mysql.jdbc.Driver',
                     'mapred.jdbc.url'='jdbc:mysql://localhost:3306/rstore',
                     'mapred.jdbc.username'='-',
                     'mapred.jdbc.input.table.name'='JDBCTable',
                     'mapred.jdbc.output.table.name'='JDBCTable',
                     'mapred.jdbc.password'='-'
                    );
      

    <scheme> is the Cloud-specific URI scheme: gs:// for GCP.

  • Cluster Management

  • SQL

  • Migrations

  • Presto Best Practices

  • Using the Presto Query Retrying Mechanism

  • Using the Spill to Disk Mechanism

What’s New in Presto
Spark

This section explains how to configure and use Spark on a Qubole cluster. It covers the following topics:

Introduction

Apache Spark is a fast and general compute engine for Hadoop data. Spark provides a simple and expressive programming model that supports a wide range of applications, including ETL, machine learning, stream processing, and graph computation. Spark’s in-memory data model and fast processing makes it particularly suitable for applications such as:

  • Machine Learning and Graph Processing
  • Stream Processing
  • Interactive queries against In-Memory data

Qubole offers only the Spark-on-YARN variant. Hence, the Apache Hadoop YARN parameters that Qubole offers also apply to Spark. For more information on the YARN parameters, see Significant Parameters in YARN and YARN in Qubole.

For supported Spark versions, see QDS Components: Supported Versions and Cloud Platforms and Spark Version Support.

Supported Interfaces for Spark

You can create and run Spark applications from the Analyze page, Workbench page, Notebooks page, and JupyterLab interface.

Note

Spark 3.0 is supported on Workbench page, Notebooks page, and JupyterLab interface.

For more information about creating and running Spark applications, see the following information:

Understanding Spark Cluster Worker Node Memory and Defaults

The memory components of a Spark cluster worker node are Memory for HDFS, YARN and other daemons, and executors for Spark applications. Each cluster worker node contains executors. An executor is a process that is launched for a Spark application on a worker node. Each executor memory is the sum of yarn overhead memory and JVM Heap memory. JVM Heap memory comprises of:

  • RDD Cache Memory
  • Shuffle Memory
  • Working Heap/Actual Heap Memory

The following figure illustrates a Spark application’s executor memory layout with its components.

_images/SparkExecutorMemory.png

An example of a cluster worker node’s memory layout can be as follows. Let us assume total memory of a cluster slave node is 25GB and two executors are running on it.

5GB - Memory for HDFS, YARN and other daemons and system processes
10GB - Executor 1

    2GB    - yarn overhead          (spark.yarn.executor.memoryOverhead 2048)
    8GB    - JVM heap size          (spark.executor.memory 8GB):

        4.8GB - RDD cache memory (spark.executor.storage.memoryFraction 0.6)

        1.6GB - Shuffle memory        (spark.executor.shuffle.memoryFraction 0.2)
        1.6GB - Working heap
10GB - Executor 2
Understanding Qubole’s Default Parameters Calculation

Qubole automatically sets spark.executor.memory, spark.yarn.executor.memoryOverhead and spark.executor.cores on the cluster. It is based on the following points:

  • Avoid very small executors. There are many overheads involved and with very small executor-memory, the real available processing memory is too small. Large partitions may result in out of memory (OOM)issues.
  • Avoid very large executors as they create trouble in sharing resources among applications. They also cause YARN to keep more resources reserved that results in under-utilising the cluster.
  • Broadly set the memory between 8GB and 16GB. This is an arbitrary choice and governed by the above two points.
  • Pack as many executors as can be assigned to one cluster node.
  • Evenly distribute cores to all executors.
  • If RAM per vCPU is large for some instance type, Qubole’s computed executor is also similar but the ratio of RAM per number of executors (--num-executors) and maximum number of executors (--max-executors) are set to 2. Hence, autoscaling is disabled by default. Maximum number of executors configuration (--max-executors) is preferred when there are lot of jobs in parallel that must share the cluster resources. Similarly, a Spark job can start with an exact number of executors (--num-executors) rather than depending on the default number of executors.
Assigning CPU Cores to an Executor

For example, consider a node has 4 vCPUs according to EC2, then YARN might report eight cores depending on the configuration. If you want to run four executors on this node, then set spark.executor.cores to 2. This ensures that each executor uses 1 vCPU. Also, the executor can run two tasks in parallel.

For more information about the resource allocation, Spark application parameters, and determining resource requirements, see An Introduction to Apache Spark Optimization in Qubole.

Running a Simple Spark Application

This page is intended to guide a new user in running a Spark application on QDS. For more information about composing and running Spark commands from the QDS UI, see Composing Spark Commands in the Analyze Page.

Before You Start

You must have an active QDS account; for instructions on creating one, see Managing Your Accounts. Then sign in to the QDS UI.

Submitting a Spark Scala Application

After signing in, you’ll see the Workbench page.

Proceed as follows to run a Spark command. In this example we’ll be using Scala.

  1. Click + Create New. Select the Spark tab near the top of the page. Scala is selected by default.

  2. Enter your query in the text field (Query Statement is selected by default in the drop-down list near the top right of the page; this is what you want in this case). For example:

    import org.apache.spark._
    object FirstProgram {
      def main(args : Array[String]) {
        val sc = new SparkContext(new SparkConf())
        val result = sc.parallelize(1 to 10).collect()
        result.foreach(println)
       }
    }
    
  1. Click Run to execute the query.

The query result appears the Results tab, and the query logs under the Logs tab. Note the clickable Application UI link to the Spark application UI under the Logs tab. See also Downloading Results and Logs.

For information on running other types of Spark command, see:

Using Notebooks

To run Spark applications in a Notebook, follow this quick-start guide, Running Spark Applications in Notebooks.

Composing Spark Commands in the Workbench Page

Use the command composer on Workbench to compose a Spark command in different languages.

See Running Spark Applications and Spark in Qubole for more information. For information about using the REST API, see Submit a Spark Command.

Spark queries run on Spark clusters. See Mapping of Cluster and Command Types for more information.

Qubole Spark Parameters

As part of a Spark command, you can use command-line options to set or override Qubole parameters such as the following:

  • The Qubole parameter spark.sql.qubole.parquet.cacheMetadata allows you to turn caching on or off for Parquet table data. Caching is on by default; Qubole caches data to prevent table-data-access query failures in case of any change in the table’s Cloud storage location. If you want to disable caching of Parquet table data, set spark.sql.qubole.parquet.cacheMetadata to false. You can do this at the Spark cluster or job level, or in a Spark Notebook interpreter.
  • In case of DirectFileOutputCommitter (DFOC) with Spark, if a task fails after writing files partially, the subsequent reattempts might fail with FileAlreadyExistsException (because of the partial files that are left behind). Therefore, the job fails. You can set the spark.hadoop.mapreduce.output.textoutputformat.overwrite and spark.qubole.outputformat.overwriteFileInWrite flags to true to prevent such job failures.
Ways to Compose and Run Spark Applications

You can compose a Spark application using:

Note

You can read a Spark job’s logs, even after the cluster on which it was run has terminated, by means of the offline Spark History Server (SHS). For offline Spark clusters, only event log files that are less than 400 MB are processed in the SHS. This prevents high CPU utilization on the webapp node. For more information, see this blog.

You can use the --packages option to add a list of comma-separated Maven coordinates for external packages that are used by a Spark application composed in any supported language. For example, in the Spark Submit Command Line Options text field, enter --packages com.package.module_2.10:1.2.3.

You can use macros in script files for Spark commands with subtypes scala (Scala), py (Python), R (R), sh (Command), and sql (SQL). You can also use macros in large inline content and large script files for scala (Scala), py (Python), R (R), @ and sql (SQL). This capability is not enabled for all users by default; create a ticket with Qubole Support to enable it for your QDS account.

About Using Python 2.7 in Spark Jobs

If your cluster is running Python 2.6, you can enable Python 2.7 for a Spark job as follows:

  1. Add the following configuration in the node bootstrap script (node_bootstrap.sh) of the Spark cluster:

    source /usr/lib/hustler/bin/qubole-bash-lib.sh
    qubole-hadoop-use-python2.7
    
  2. To run spark-shell/spark-submit on any node’s shell, run these two commands by adding them in the Spark Submit Command Line Options text field before running spark-shell/spark-submit:

    source /usr/lib/hustler/bin/qubole-bash-lib.sh
    qubole-hadoop-use-python2.7
    
Compose a Spark Application in Scala
  1. Navigate to Workbench and click + New Collection.

  2. Select Spark from the command type drop-down list.

  3. By default, Scala is selected.

  4. Choose the cluster on which you want to run the query. View the health metrics of a cluster before you decide to use it.

  5. Query Statement is selected by default in the drop-down list (upper-right corner of the screen). Enter your query in the text field.

    or

    To run a stored query, select Query Path from the drop-down list, then specify the cloud storage path that contains the query file.

  6. Add macro details (as needed).

  7. Optionally enter command options in the Spark Submit Command Line Options text field to override the default command options.

  8. Optionally specify arguments in the Arguments for User Program text field.

  9. Click Run to execute the query.

Monitor the progress of your job using the Status and Logs panes. You can toggle between the two using a switch. The Status tab also displays useful debugging information if the query does not succeed. For more information on how to download command results and logs, see Get Results. Note the clickable Spark Application UI URL in the Resources tab.

Compose a Spark Application in Python
  1. Navigate to Workbench and click + New Collection.

  2. Select Spark from the command type drop-down list.

  3. By default, Scala is selected. Select Python from the drop-down list.

  4. Choose the cluster on which you want to run the query. View the health metrics of a cluster before you decide to use it.

  5. Query Statement is selected by default in the drop-down list (upper-right corner of the screen). Enter your query in the text field.

    or

    To run a stored query, select Query Path from the drop-down list, then specify the cloud storage path that contains the query file.

  6. Add macro details (as needed).

  7. Optionally enter command options in the Spark Submit Command Line Options text field to override the default command options.

  8. Optionally specify arguments in the Arguments for User Program text field.

  9. Click Run to execute the query.

Monitor the progress of your job using the Status and Logs panes. You can toggle between the two using a switch. The Status tab also displays useful debugging information if the query does not succeed. For more information on how to download command results and logs, see Get Results. Note the clickable Spark Application UI URL in the Resources tab.

Compose a Spark Application using the Command Line

Note

Qubole does not recommend using the Shell command option to run a Spark application via Bash shell commands, because in this case automatic changes (such as increases in the Application Coordinator memory based on the driver memory, and the availability of debug options) do not occur. Such automatic changes do occur when you run a Spark application using the Command Line option.

  1. Navigate to Workbench and click + New Collection.

  2. Select Spark from the command type drop-down list.

  3. By default, Scala is selected. Select Command Line from the drop-down list.

  4. Choose the cluster on which you want to run the query. View the health metrics of a cluster before you decide to use it.

  5. Query Statement is selected by default in the drop-down list (upper-right corner of the screen). Enter your query in the text field.

    or

    To run a stored query, select Query Path from the drop-down list, then specify the cloud storage path that contains the query file.

  6. Add macro details (as needed).

  7. Click Run to execute the query.

Monitor the progress of your job using the Status and Logs panes. You can toggle between the two using a switch. The Status tab also displays useful debugging information if the query does not succeed. For more information on how to download command results and logs, see Get Results. Note the clickable Spark Application UI URL in the Resources tab.

Compose a Spark Application in SQL

Note

You can run Spark commands in SQL with Hive Metastore 2.1. This capability is not enabled for all users by default; create a ticket with Qubole Support to enable it for your QDS account.

You can run Spark SQL commands with large script files and large inline content. This capability is not enabled for all users by default; create a ticket with Qubole Support to enable it for your QDS account.

  1. Navigate to Workbench and click + New Collection.

  2. Select Spark from the command type drop-down list.

  3. By default, Scala is selected. Select SQL from the drop-down list.

  4. Choose the cluster on which you want to run the query. View the health metrics of a cluster before you decide to use it.

  5. Query Statement is selected by default in the drop-down list (upper-right corner of the screen). Enter your query in the text field. Press Ctrl + Space in the command editor to get a list of suggestions.

    or

    To run a stored query, select Query Path from the drop-down list, then specify the cloud storage path that contains the query file.

  6. Add macro details (as needed).

  7. Optionally enter command options in the Spark Submit Command Line Options text field to override the default command options.

  8. Click Run to execute the query.

Monitor the progress of your job using the Status and Logs panes. You can toggle between the two using a switch. The Status tab also displays useful debugging information if the query does not succeed. For more information on how to download command results and logs, see Get Results. Note the clickable Spark Application UI URL in the Resources tab.

Compose a Spark Application in R
  1. Navigate to Workbench and click + New Collection.

  2. Select Spark from the command type drop-down list.

  3. By default, Scala is selected. Select R from the drop-down list.

  4. Choose the cluster on which you want to run the query. View the health metrics of a cluster before you decide to use it.

  5. Query Statement is selected by default in the drop-down list (upper-right corner of the screen). Enter your query in the text field.

    or

    To run a stored query, select Query Path from the drop-down list, then specify the cloud storage path that contains the query file.

  6. Add macro details (as needed).

  7. Optionally enter command options in the Spark Submit Command Line Options text field to override the default command options.

  8. Optionally specify arguments in the Arguments for User Program text field.

  9. Click Run to execute the query.

Monitor the progress of your job using the Status and Logs panes. You can toggle between the two using a switch. The Status tab also displays useful debugging information if the query does not succeed. For more information on how to download command results and logs, see Get Results. Note the clickable Spark Application UI URL in the Resources tab.

Known Issue

The Spark Application UI might display an incorrect state of the application when Spot Instances are used. You can view the accurate status of the Qubole command in the Workbench or Notebooks page.

When the Spark application is running, if the coordinator node or the node that runs driver is lost, the Spark Application UI may display an incorrect state of the application. The event logs are persisted to cloud storage from the HDFS location periodically for a running application. If the coordinator node is removed due to spot loss, the cloud storage may not have the latest application status. As a result, the Spark Application UI may display the application in running state.

To prevent this issue, Qubole recommends using an On-Demand master node.

Spark Integration with BigQuery

Google’s BigQuery is a serverless data warehouse for storing and querying massive datasets. Spark on Qubole is integrated with BigQuery, enabling direct reads of data from BigQuery storage into Spark DataFrames. This allows data engineers to explore BigQuery datasets or join data in Google Cloud Storage and BigQuery to perform complex data transformations and queries. For more information about BigQuery, see the Google BigQuery documentation.

Data scientists can look up BigQuery datasets and build machine learning models using Qubole’s Spark and Notebooks. The data is read in Apache Avro format using parallel streams with dynamic data sharding across streams to support low latency reads. The Spark connector for BigQuery eliminates the need to export data from BigQuery to Google Cloud Storage, improving data processing times.

Viewing BigQuery Datasets in the Qubole UI

Qubole displays BigQuery datasets directly in the Workbench and Notebooks interfaces. This allows data scientists and data engineers to discover BigQuery tables and datasets from within QDS.

Qubole Workbench UI with Data from BigQuery
_images/BigQuery-Workbench.jpg
Qubole Notebooks UI with Data from BigQuery
_images/BigQuery-Notebook.png
Accessing Data Stores through Spark Clusters

Qubole supports accessing data stores through Spark clusters by adding a catalog parameter while creating a data store using a REST API request. Create a DbTap and Edit a DbTap describe the catalog parameter.

For Spark, the catalog parameter supports the following database types for data stores:

  • MySQL on Scala.
  • Redshift on Scala and Python. For more information about Spark Redshift connector, see spark-redshift-connector.

Note

To access data stores, you should have read or update permission on the Data Connections resource. For more information about the resources, see Resources, Actions, and What they Mean. This feature is supported on Spark 2.3.2, or 2.4.0 and later versions. This feature is not enabled for all users by default. Create a ticket with Qubole Support to enable this feature on the QDS account.

To access a data store through a Spark cluster, perform these steps:

  1. Create a ticket with Qubole Support to enable this feature.
  2. Add the catalog parameter in the data store configuration. Create a DbTap and Edit a DbTap describe the catalog parameter.
  3. Access the data store by using its JDBC URL, username, and password.

The QuboleDBTap class and companion object has been copied from com.qubole.QuboleDBTap ``to ``org.apache.spark.sql.qubole.QuboleDBTap for Spark 2.0.0 and later versions.

com.qubole.QuboleDBTap is still maintained to keep backward compatibility for all existing versions of Spark. However, Qubole strongly recommends migrating from com.qubole.QuboleDBTap to org.apache.spark.sql.qubole.QuboleDBTap as the support for com.qubole.QuboleDBTap will be removed starting from Spark 2.3.0. QuboleDBTap and its methods can only be used by importing org.apache.spark.sql.qubole.QuboleDBTap.

The following example shows how to register tables and query information through the API:

import org.apache.spark.sql.qubole.QuboleDBTap // NOTE: If you are using spark 1.6.x, use: import com.qubole.QuboleDBTap.
import org.apache.spark._
import org.apache.spark.sql._
val sqlContext = new  org.apache.spark.sql.hive.HiveContext(sc)
val catalogName = "catalog-name-created-during-create-dbtap" //See step 2 above
val databaseName = "database-name-created-during-create-dbtap" //See step 2 above
val quboleDBTap = QuboleDBTap.get(s"$catalogName",sqlContext)
//list of tables included, supports regex pattern matching
val includes = List()
//list of tables excluded, supports regex pattern matching
val excludes = List()
quboleDBTap.registerTables(s"$databaseName", includes, excludes)

val tableName = "mysql-tablename"
sqlContext.sql(s"select * from `$catalogName.$databaseName.$tableName`").show

//On completion of using the quboleDBTap object
quboleDBTap.unregister()

The following example shows how to create a short-lived DBTap object for a Spark session without using REST APIs as shown in the above example:

import org.apache.spark.sql.qubole.QuboleDBTap // NOTE: If you are using Spark 1.6.x, use: import com.qubole.QuboleDBTap.
import org.apache.spark._
import org.apache.spark.sql._
val sqlContext = new  org.apache.spark.sql.hive.HiveContext(sc)
val catalogName = "any-catalog-name"
val hostName = "<mysql-hostname>"
val databaseType = "mysql"
val jdbcUrl = s"jdbc:$databaseType://$hostName/"
val username = "<username>"
val password = "<password>"
val quboleDBTap = new QuboleDBTap(catalogName, jdbcUrl, username, password, sqlContext)
//list of tables included, supports regex pattern matching
val includes = List()
//list of tables excluded, supports regex pattern matching
val excludes = List()
val databaseName = "<mysql-databasename>"
quboleDBTap.registerTables(s"$databaseName", includes, excludes)

val tableName = "<mysql-tablename>"
sqlContext.sql(s"select * from `$catalogName.$databaseName.$tableName`").show

//On completion of using the quboleDBTap object
quboleDBTap.unregister()
Spark Versions Supportability Matrix

Spark versions support different versions of components related to Spark.

The following table lists the supported components and versions for the Spark 3 and Spark 2.x versions.

Components Supported Version for Spark 3 Supported Version for Spark 2.x
Scala 2.12 (default) 2.11 (default)
Java 8 8
Hive Metastore >= 2.3 (default) 1.2.1, 2.3, 2.1, 3.1
Python 3.x, 2.x 2.x, 3.x
Hadoop 2.x 2.x

Note

  • For Spark 3.0, if you are using a self-managed Hive metastore and have an older metastore version (Hive 1.2), few metastore operations from Spark applications might fail. Therefore, you should upgrade metastores to Hive 2.3 or later version. QDS-managed metastore is upgraded by default.
  • Python 2.x will be deprecated soon for Spark 3.x versions.
Spark Best Practices

This topic describes best practices for running Spark jobs.

Spark Configuration Recommendations
  • Set --max-executors. Other parameters are ideally not required to be set as the default parameters are sufficient.
  • Try to avoid setting too many job-level parameters.
Ensuring Jobs Get their Fair Share of Resources

To prevent jobs from blocking each other, you should use the YARN Fair Scheduler to ensure that each job gets its fair share of resources. For example, if each interpreter can use a maximum of 100 executors, and ten users are running one interpreter each, the YARN Fair Scheduler can ensure that each user’s job gets about ten executors. As the number of running jobs rises or declines, the number of executors each gets will decline or rise inversely– 100 jobs will get one executor each; if there is only one job, it will get all 100 executors.

To configure this, configure notebooks and interpreters for the running Spark cluster as follows. (For information about using notebooks and interpreters, see Running Spark Applications in Notebooks.)

Note

Do not do this if you have enabled interpreter user mode; in that case, QDS configures the Fair Scheduler for you.

  1. Assign an interpreter for each user and note. This ensures that each job is submitted to YARN as a separate application.
  2. Edit the interpreter to use a fairly low value for spark.executor.instances, so that an interpreter that is not in use does not hold on to too many executors.
Specifying Dependent Jars for Spark Jobs

You can specify dependent jars using these two options:

  • In the QDS UI’s Workbench query composer for a Spark command, add a Spark Submit argument such as the following to add jars at the job level:

    --jars gs://bucket/dir/x.jar,gs://bucket/dir2/y.jar --packages "com.datastax.spark:spark-cassandra-connector_2.10:1.4.0-M1"

  • Another option for specifying jars is to download jars to /usr/lib/spark/lib via the node bootstrap script; for example:

    hdfs dfs -get gs://bucket/path/app.jar /usr/lib/spark/lib/
    hdfs dfs -get gs://bucket/path/dep1.jar /usr/lib/spark/lib/
    hdfs dfs -get gs://bucket/path/dep2.jar /usr/lib/spark/lib/
    
Handling Skew in the Join

To handle skew in the join keys, you can specify the hint ` /*+ SKEW ('<table_name>') */ ` for a join that describes the column and the values upon which skew is expected. Based on that information, the engine automatically ensures that the skewed values are handled appropriately.

You can specify the hint in the following formats:

  • Format 1:

    /*+ SKEW('<tableName>') */
    

    This shows that all the columns in a given table are skewed and the value on which they are skewed is not known. With this hint, the Spark optimizer tries to identify the values on which the column involved in the join is skewed. This operation is performed when the Spark optimizer identifies that a column is involved in the join and then it samples the data on the table.

    Example: In a query, suppose there is a table t1 where all columns involved in the join are skewed. But the skew values are unknown. In this case, you can specify the skew hint as ` /*+ SKEW('t1') */ `.

  • Format 2:

    /*+ SKEW ('<tableName>', (<COLUMN-HINT>), (<ANOTHER-COLUMN-HINT>)) */
    

    <COLUMN-HINT> can be either a column name (example, column1) or a column name and list of values on which the column is skewed (example - column1, ('a', 'b', 'c')). The Spark optimizer identifies the skew values from the hint. As a result, the sampling of data is not required.

    Example: Suppose there is a table t1 with 4 columns - c1, c2, c3, c4. Consider that c1 is skewed on value ‘a’ and ‘b’, c2 and c3 are also skewed but the skew values are unknown, and c4 is not a skewed column. In this case, you can specify the hint as ` /*+ SKEW('t1', ('c1', ('a', 'b')), ('c2'), ('c3')) */ `.

Example Query
SELECT /*+ SKEW('t1', ('c1', ('a', 'b')), ('c2'), ('c3')) */ *

FROM

(SELECT t2.c1 as temp_col1 from t1 join t2 on t1.c1 = t2.c1) temp_table1 JOIN

(SELECT t3.c2 as temp_col2 from t1 join t3 on t1.c2 = t3.c2) temp_table2

 WHERE temp_table1.temp_col1 = temp_table2.temp_col2 .
Optimizing Query Execution with Adaptive Query Execution

Spark on Qubole supports Adaptive Query Execution on Spark 2.4.3 and later versions, with which query execution is optimized at the runtime based on the runtime statistics.

At runtime, the adaptive execution mode can change shuffle join to broadcast join if the size of one table is less than the broadcast threshold. Spark on Qubole Adaptive execution also supports handling skew in input data, and optimizes the joins using Qubole skew join optimization. In general, adaptive execution decreases the effort involved in tuning SQL query parameters, and improves the execution performance by selecting a better execution plan and parallelism at runtime.

Configuring the Spark External Shuffle Service

The Spark external shuffle service is an auxiliary service which runs as part of the Yarn NodeManager on each worker node in a Spark cluster. When enabled, it maintains the shuffle files generated by all Spark executors that ran on that node.

Spark executors write the shuffle data and manage it. If the Spark external shuffle service is enabled, the shuffle service manages the shuffle data, instead of the executors. This helps in downscaling the executors, because the shuffle data is not lost when the executors are removed. It also helps improve the behavior of the Spark application in case of error because the shuffle data does not need to be re-processed when an executor crashes.

In open-source Spark, Spark job-level autoscaling (also known as Spark Dynamic Allocation) works in tandem with the external shuffle service and the shuffle service is mandatory for autoscaling to work. See Spark Shuffle Behavior for more information. In Spark on Qubole, on the other hand, the external shuffle service is optional and Qubole-based Spark job-level autoscaling works whether or not the shuffle service is enabled. (If the external shuffle service is disabled, the executors are not removed until the shuffle data goes away.)

Qubole provides the Spark external shuffle service in in Spark 1.5.1 and later supported Spark versions.

The external shuffle service is enabled by default in Spark 1.6.2 and later versions. To disable it, set spark.shuffle.service.enabled to false.

Spark external shuffle service is not enabled by default in Spark 1.5.1 and Spark 1.6.0. To enable it for one of these versions, configure it as follows:

  • Override Hadoop Configuration Variables

    Before starting a Spark cluster, pass the following Hadoop overrides to start Spark external shuffle service:

    yarn.nodemanager.aux-services=mapreduce_shuffle,spark_shuffle
    yarn.nodemanager.aux-services.spark_shuffle.class=org.apache.spark.network.yarn.YarnShuffleService
    

    See Advanced Configuration: Modifying Hadoop Cluster Settings for setting the Hadoop override configuration variables in the QDS UI.

  • Spark Configuration Variable

    Set the configuration to enable external shuffle service on a Spark application, a Spark cluster or a Spark notebook.

    Enabling External Shuffle Service on a Spark Cluster

    Set the following configuration in the Override Spark Configuration Variables text box of the cluster configuration page:

    spark-defaults.conf:
    
    spark.shuffle.service.enabled    true
    

    See Configuring a Spark Cluster for more information.

    Note

    If you set spark.shuffle.service.enabled to false, then the Spark application does not use the external shuffle service.

    Enabling External Shuffle Service on a Spark Command

    Configure the following setting as a Spark-submit option in the command/query composer while composing a Spark application:

    --conf spark.shuffle.service.enabled=true

    See Composing Spark Commands in the Analyze Page for more information.

    For example, sqlContext.sql("select count(*) from default_qubole_memetracker").collect() generates a lot of shuffle data. So, set --conf spark.shuffle.service.enabled=true in the bin/spark-shell

    Enabling External Shuffle Service on a Spark Notebook

    Add spark.shuffle.service.enabled as an interpreter setting and add its Value as true in a Spark notebook’s Interpreter. Bind the Spark Interpreter settings to the notebook that you use if it is not bound already. See Running Spark Applications in Notebooks and Understanding Spark Notebooks and Interpreters for more information.

External shuffle service logs are part of the NodeManager logs located at /media/ephemeral0/logs/yarn/yarn-nodemanager*.log. NodeManager logs are present on each worker node in the cluster.

Continuously Running Spark Streaming Applications

You can continuously run Spark streaming applications by setting the following parameters:

  • Set yarn.resourcemanager.app.timeout.minutes=-1 as an Hadoop override at the Spark cluster level.
  • To avoid all Spark streaming applications on a specific cluster from being timed out, set spark.qubole.idle.timeout -1 as a Spark configuration variable in the Override Spark Configuration Variables text field of the Spark cluster configuration UI page. See Configuring a Spark Cluster for more information.
Using UDFs in Spark SQL

An UDF (user-defined function) is a way of adding a function to Spark SQL. It operates on distributed DataFrames and works row-by-row unless it is created as an user-defined aggregation function. Open-source Spark provides two alternative methods:

  • Using Hive functions
  • Using Scala functions

The following example uses Hive functions to add an UDF and use it in Spark SQL.

import org.apache.spark._
import org.apache.spark.sql._
import org.apache.hadoop.hive.ql.exec.Description;
import org.apache.hadoop.hive.ql.exec.UDF;
import org.apache.hadoop.io.Text;

class SimpleUDFExample extends UDF {
  def evaluate(input: Text) : Text = {
    if (input == null) return null
    return new Text("Hello " + input.toString)
  }
}
object sqltest {
  def main(args: Array[String]) {
    val sc = new SparkContext(new SparkConf())
    val sqlContext = new  org.apache.spark.sql.hive.HiveContext(sc)
    sqlContext.sql("create temporary function hello as 'SimpleUDFExample'")
    val result = sqlContext.sql("""
        select hello(name) from products_avro order by month, name, price
       """)
    result.collect.foreach(println)
  }
}

For an example using Scala functions, see UDF Registration.

Introduction to Sparklens

Sparklens is an open source Spark profiling tool from Qubole, which can be used with any Spark application. Sparklens helps in tuning spark applications by identifying the potential opportunities for optimizations with respect to driver side computations, lack of parallelism, skew, etc. The built-in scheduler simulator can predict how a given spark application will run on any number of executors in a single run.

Sparklens analyzes the given Spark application in a single run, and provides the following information:

  • If the application can run faster with more cores and how to optimize it.
  • If the compute cost can be saved by running the application with less cores and without much increase in wall clock time.
  • The absolute minimum time that the application can take if infinite executors are given.
  • How to run the application below the absolute minimum time.
Using Sparklens

You can analyze your Spark applications with Sparklens by adding extra command line option to spark-submit or spark-shell.

--packages qubole:sparklens:0.3.1-s_2.11
--conf spark.extraListeners
=com.qubole.sparklens.QuboleJobListener

Starting with Spark 2.4.0 version, you can analyze your Spark applications with Sparklens without passing the --packages option externally. This feature is not enabled for all users by default. Create a ticket with Qubole Support to enable this feature on the QDS account.

After the feature is enabled, you should pass the --conf spark.extraListeners=com.qubole.sparklens.QuboleJobListener command line option to run the sparklens reporting.

The open source code is available at https://github.com/qubole/sparklens.

For more information about Sparklens, see the Sparklens blog.

Configuring a Spark Notebook

This page covers the following topics:

For using the Anaconda Python interpreter, see Using the Anaconda Interpreter.

Configuring Per-User Interpreters for Spark Notebooks

Per-user interpreters provide each Spark Notebook user with a dedicated interpreter, ensuring a fair distribution of cluster resources among running interpreters. This is called user mode; see also Using the User Interpreter Mode for Spark Notebooks.

Advantages of User Mode

User mode provides each user with a dedicated interpreter. Advantages of this include:

  • Each user’s customizations of the interpreter properties are preserved; for example:
    • Configured cluster resources, such as driver memory (spark.driver.memory)
    • The default interpreter type (zeppelin.default.interpreter)
    • Dependencies such as Maven artifacts.
  • Bottlenecks are reduced because cluster resources are shared among running interpreters.
    • Each user gets a dedicated Spark session (Spark versions 2.0.0 and later) or SparkContext.

User mode is best suited to an environment in which several users are likely to be using notebooks to run jobs and applications at any given time. But it does not unnecessarily restrict access to cluster resources when one only one or a few jobs are running; see Important Resource Considerations for more discussion.

Qubole recommends that you use user mode because it provides better performance for individual users, and allows the most efficient and cost-effective use of cluster resources.

Enabling User Mode

To enable user mode, proceed as follows.

Note

  • If this is a new QDS account, user mode is enabled by default, so you can skip the steps that follow.
  • All the interpreters that were available in legacy mode continue to be available after you switch the cluster to user mode.
  1. Navigate to the Clusters page in QDS and clone your Spark cluster (select Clone from the drop-down menu next to the cluster at the right of the screen.) Cloning the cluster is not required, but Qubole recommends it.
  2. Select the Edit button for your new Spark cluster.
  3. From the drop-down list next to Zeppelin Interpreter Mode, choose user.
  4. If the cluster is already running, restart it to apply the change.

To switch the cluster back to legacy mode, simply repeat steps 2-4 above, setting Zeppelin Interpreter Mode to legacy instead of user. Any interpreters created in user mode continue to be available; users should make sure the interpreter they want to use is at the top of the list.

You can also set interpreter modes through REST API calls. For more information, see spark_settings.

How User Interpreter Mode Works

When user mode is enabled, an interpreter is created automatically for each user who runs a notebook. Each interpreter is named as follows: user_<user's_email_name>_<user's_email_domain> (user is a literal constant); for example, for a user whose email address is abc@xyz.com, the interpreter name is set to user_abc_xyz. (The email address is also stored in spark.yarn.queue.)

Default properties are set by QDS; users can change the defaults, but there is currently no way for you, as the system administrator, to assign new global defaults.

Users can also create additional interpreters.

Important Resource Considerations
  • Spark Executors: When a user runs a notebook with an interpreter in user mode, the interpreter launches executors as needed, starting with the minimum configured for the interpreter (spark.executor.instances) and scaling up to the configured maximum (spark.dynamicAllocation.maxExecutors). These values vary depending on the instance type, and are derived from the spark-defaults.conf file. You should assess these values, and particularly spark.dynamicAllocation.maxExecutors, in terms of the day-to-day needs of your users and their workflow, keeping the following points in mind:

    • QDS will never launch more than spark.dynamicAllocation.maxExecutors for any interpreter, regardless of how many are running. This means that when only one or a few interpreters are running, cluster resources (that could be employed to launch more executors and speed up jobs) may go unused; so you need to make sure that the default maximum is not set too low.
    • Conversely, because QDS will autoscale the cluster if necessary to meet the demand for executors, you also need to make sure that spark.dynamicAllocation.maxExecutors is not set too high, or you risk paying for computing resources (executors) that are not needed.
    • Once you have determined the best default, you should discourage users from changing it for an individual interpreter without consulting their system administrator.
  • YARN Fair Scheduler: In user mode, QDS configures the YARN Fair Scheduler to allocate executors (with their underlying cluster resources such as memory and processors) among running interpreters; and enables preemption (yarn.scheduler.fair.preemption). These controls come into play when the cluster resources are fully stretched– that is, when the maximum number of nodes are running the maximum number of executors.

    • You do not need to configure the Fair Scheduler manually as described here.
  • Spark cluster coordinator node: Each interpreter takes up memory (2 GB by default) in the Spark driver, which runs on the Spark cluster coordinator node. The load on the coordinator node is likely to be greater in the user mode than in the legacy mode because in user mode each user is running a dedicated interpreter instance.

    As a result, you might have to increase the capacity of the coordinator node by choosing a larger instance type, considering the number of interpreters running at any given time.

Loading Dependent Jars

To make required jars accessible to all Zeppelin notebooks, copy the dependent jar file to /usr/lib/spark/lib on all nodes (through node_bootstrap.sh).

Loading Dependent Jars Dynamically in a Notebook

Zeppelin has introduced an UI option on the Interpreters page to add a dependency. The following figure shows a create interpreter page with the Dependencies text field.

_images/Dependencies.png

Add the artifact in the format <groupID>:<artifactID>:<version> or the local path in the artifact text field. You can exclude artifacts if any. Click + icon to add another dependency. Click Save to add the dependency along with other Spark interpreter changes (if any). The Dependencies UI option is the same for creating and editing an existing Spark interpreter.

You can add remote Maven repositories and add dependencies in the configured remote repositories.

You can also use the %dep or %spark.dep interpreter to load jars before starting the Spark interpreter. You must enable the dynamic interpreter in a paragraph and subsequently use the %spark interpreter in a new paragraph.

The following are interpreter examples.

Example 1

The following example is for loading a Maven Artifact.

Paragraph 1

%dep
z.reset()
z.load("com.google.code.facebookapi:facebook-java-api:3.0.4")

OR

%spark.dep
z.reset()
z.load("com.google.code.facebookapi:facebook-java-api:3.0.4")

Paragraph 2

%spark
import com.google.code.facebookapi.FacebookException;
import com.google.code.facebookapi.FacebookWebappHelper;
class Helloworld{

    def main1(args:Array[String])
    {
       println("helloworld")

}
Example 2

The following example is for loading a Spark CSV jar.

import org.apache.spark._
import org.apache.spark.sql._

val sparkSession = SparkSession
    .builder()
    .appName("spark-csv")
    .enableHiveSupport()
    .getOrCreate()

import sparkSession.implicits._
val squaresDF = sparkSession.sparkContext.makeRDD(1 to 100).map(i => (i, i * i)).toDF("value", "square")
val location ="s3://bucket/testdata/spark/csv1"
squaresDF.write.mode("overwrite").csv(location)
sparkSession.read.csv(location).collect().foreach(println)
Configuring Spark SQL Command Concurrency

In notebooks, you can run multiple Spark SQL commands in parallel. Control the concurrency by setting zeppelin.spark.concurrentSQL to true. The maximum number of commands that can be run concurrently is controlled by zeppelin.spark.sql.maxConcurrency, which is set to a positive integer. The default value of this parameter is 10.

Enabling Python 2.7 in a Notebook

If your cluster is running Python 2.6, you can enable Python 2.7 in a notebook as follows:

  1. Add the following configuration in the node bootstrap script (node_bootstrap.sh) of the Spark cluster:

    source /usr/lib/hustler/bin/qubole-bash-lib.sh
    qubole-hadoop-use-python2.7
    
  2. Navigate to the Interpreter page. Under the Spark interpreter (%spark), set the zeppelin.pyspark.python property to /usr/lib/virtualenv/python27/bin/python.

    After setting the property, restart the Spark interpreter. The default value of this setting is python.

Understanding Spark Notebooks and Interpreters

QDS supports Spark Notebooks; the Spark cluster must be running before you can use them.

To use a Spark notebook, navigate to Notebooks from the main menu of the QDS UI. The topics under Notebooks provide more information about using Notebooks in QDS. Running Spark Applications in Notebooks provides more information on using a Spark Notebook.

Understanding Spark Notebook Interpreters

Note

For information about configuring and using interpreters in user mode, see Configuring a Spark Notebook and Using the User Interpreter Mode for Spark Notebooks.

You can create a Spark interpreter and define custom settings by clicking the Interpreter link near the top right of the page. Notebooks support the spark interpreter among others; spark is a superset of the pyspark, sparksql, and sparkscala interpreters. To see the list of available interpreter types, click Create and then pull down the menu under Interpreters on the resulting page.

Generally, an interpreter name is not editable. You can specify Spark settings. Spark interpreters started by notebooks have the specified settings. The default values are optimized for each instance type. While creating a new Spark interpreter property, after selecting the interpreter, some settings are shown by default with description. You can change these settings as required. One important setting is spark.qubole.idle.timeout. This setting is number of minutes after which a Spark context shuts down, if no job has run in that Spark context.

Note

If you run a local process such as a Python or R command, the Spark cluster shuts down because Qubole does not interpret these actions as running a Spark job. Moreover, because Spark uses lazy evaluation, only actions and not transformations trigger a job. For more information on the difference between an action and a transformation, see the Spark for Data Engineers course on https://university.qubole.com.

This figure shows the default settings that get displayed after selecting the Spark interpreter.

_images/notebook-create-settings.png

Qubole has simplified interpreter properties by setting default values for few of them and hiding them from the UI. In a new Spark interpreter, you can only see fewer set of properties. However, you can always override the value of the properties that are removed.

Note

Simplified interpreter properties for the existing Spark interpreters is not enabled for all users by default. Create a ticket with Qubole Support to enable this feature on the QDS account.

Running Spark Applications in Notebooks explains interpreters and how to associate interpreters with the notebooks.

Note

You can click the stop button on the Interpreters page to stop a Spark interpreter.

Binding an Interpreter to a Notebook

You must bind a notebook to use a specific interpreter.

  1. On the Notebooks page, click on the Gear icon for interpreter binding.

    On the Settings page, list of Interpreters are displayed as shown in the following figure.

    _images/InterpreterBinding.png

    The first interpreter on the list is the default interpreter.

  2. Click on any interpreter to bind it to the notebook.

  3. Click Save.

Interpreter Operations

From the Interpreters page, you can perform the following operations on the Spark interpreters by using the corresponding buttons on the top-right corner against the Spark interpreters:

  • Edit the interpreter properties.
  • Stop the interpreter.
  • Restart the interpreter.
  • Remove the interpreter.
  • Access the log files.

The following illustration displays a sample spark interpreter with the options.

_images/interpreter-operations.png
Viewing the Spark Application UI in a Notebook

Qubole now supports viewing the Spark Application UI that shows a list of jobs for paragraphs run in a notebook.

Perform the following steps to view the Spark Application UI:

  1. Navigate to the Interpreter page in a Spark Notebook.
  2. Expand the interpreter and you can click the spark ui button on the top-right corner against the Spark interpreters as shown in the following figure.
_images/SparkUIinInterpreter.png

Note

The Spark Application UI opens as a new popup. Disable the block pop up windows setting (if it is set) on a browser to see the Spark Application UI.

When Zeppelin starts, sparkcontext is not running by default. Therefore, clicking spark ui shows an alert dialog informing that No application is running when Zeppelin just starts or gets restarted. Later, after a Spark application starts, clicking spark ui directs to the Spark Application UI.

Seeing a Spark application that abruptly stops, in the Spark Application UI redirects you to the last-completed Spark job. Seeing a Spark application that is stopped explicitly, in the Spark Application UI also redirects you to the last-completed Spark job.

Note

In a cluster using preemptible nodes exclusively, the Spark Application UI may display the state of the application incorrectly, showing the application as running even though the coordinator node, or the node running the driver, has been reclaimed by GCP. The status of the QDS command will be shown correctly on the Workbench page. Qubole does not recommend using preemptible nodes only.

Viewing the Spark UI in Notebook Paragraphs

When you run paragraphs in a notebook, you can watch the progress of the job or jobs generated by each paragraph within the paragraph as shown in the following figure.

_images/SparkApplicationUI1.png

Expand Spark Jobs to view the status of the job. Click on the i info icon to open the Spark Application UI/Job UI.

A sample Spark Application UI is as shown in the following figure.

_images/SparkApplicationUI.png
Using the Angular Interpreter

Qubole supports the angular interpreter in notebooks. Using the %angular interpreter and the HTML code/JavaScript(JS) renders a custom UI. See Back-end Angular API for more information.

Note

Unlike other type of interpreters, the Angular interpreter does not honor a property or a dependency that you add.

To run an HTML code/JS using the angular interpreter, perform the following steps:

  1. Navigate to Notebooks on the QDS UI. Select the active notebook from the list on which you want to run the HTML code/JS.

  2. Add the HTML code/JS with %angular in the beginning of the paragraph. For example, you can add this in a paragraph.

    %angular <paragraph>Hello World</paragraph>
    
  3. Run the paragraph and you can see the result on a successful execution. A sample result is as shown in the following figure.

    _images/AngularRun.png
Configuring Bootstrap Spark Notebooks and Persistent Spark Interpreters

An interpreter runs only when the command is run on the notebook. To avoid any issue that may occur due to an absence of a running interpreter, Qubole supports a Spark interpreter (SI) that can be continuously run without any interruption. Qubole automatically restarts such SI in case of driver crashes, manual stop of interpreter or programmatic stop using sc.stop. Such an interpreter is called Persistent SI.

Note

Persistent Spark Interpreter is deprecated in Zeppelin 0.8 version.

Qubole supports a bootstrap notebook configuration in an SI. A bootstrap notebook runs before any paragraph or others using that associated SI, runs.

Configuring a Spark Bootstrap Notebook

Configure the bootstrap notebook feature in a notebook by adding the zeppelin.interpreter.bootstrap.notebook as the interpreter property and add <notebook-id> as its value. <notebook-id> is the system-generated ID of that notebook, in which the bootstrap notebook property is being set. On the UI, notebook ID is the read-only numerical value against the ID text field when you try to edit the notebook or see notebook details.

See Viewing a Notebook Information and Configuring a Notebook for more information. In an opened notebook, on top of all paragraphs, the ID is enclosed in brackets after the notebook name. For example, if Notebook> NewNote(5376) is on the top of an opened notebook, then 5376 is the ID of the notebook with NewNote as its name. See Tagging a Notebook and the explanation below for more illustrations.

In new notebooks, Qubole has set the value of zeppelin.interpreter.bootstrap.notebook to null.

The following figure shows the zeppelin.interpreter.bootstrap.notebook and zeppelin.interpreter.persistent with the default values in a new notebook’s SI.

_images/BootstrapNotebookProperty.png

The following figure shows the zeppelin.interpreter.bootstrap.notebook and zeppelin.interpreter.persistent with the enabled values in a notebook’s SI.

_images/ConfiguredBootstrapNotebook.png
Configuring a Persistent Spark Interpreter

Note

Since a persistent SI is always running, the cluster with a persistent SI is never idle. Hence, that cluster never terminates automatically.

Configure a persistent SI in a notebook by adding the zeppelin.interpreter.persistent as the interpreter property and add true as its value. (Refer to the above figures for the property and its value.) In new notebooks, Qubole has set the value of zeppelin.interpreter.persistent in the SI to false.

The bootstrap notebook and persistent SI properties are independent of each other.

Understanding Examples

Example 1

Suppose if you want to automatically start an SI on a cluster start and also start the Hive thrift server under it. Using the node bootstrap and jobserver would prove to be tedious. Instead, you can use the bootstrap notebook and persistent SI features.

Example 2

When the Spark context (sc) restarts after going down due to idle timeout or any other such reason, you have to manually restart the Hive server and load lot of tables in the cache. This can be avoided if you configure a persistent SI property in the notebook.

Example 3

Using the bootstrap notebook and persistent SI features solves issues that can come up in a long-running streaming application written in an SI in Scala. You can write the bootstrap script in a way to handle checkpoint directories correctly.

Understanding Spark Interpreter Status

The Spark interpreter status provides you an insight on the state of the Spark application.

Interpreters have the following status:

  • opening is displayed when the spark interpreter is starting up.
  • opened is displayed when the spark interpreter is up and ready to use.
  • closed is displayed when the spark interpreter is down.
  • closed (with additional message of idle timeout) is displayed when the spark interpreter was closed due to idle timeout/inactivity.
  • accepted is displayed when the spark interpreter is waiting for resources to be allocated.

For more information, see Understanding Spark Notebooks and Interpreters.

To view the interpreter status, click on the Spark Application widget on the Notebooks UI. The interpreter status is displayed as shown below.

_images/spark-status-notebook-accepted.png
Using the User Interpreter Mode for Spark Notebooks

Qubole supports legacy and user interpreter modes in a Spark cluster. A system administrator can configure the mode at the cluster level via the QDS UI or the REST API.

About User Mode

In user mode, QDS Spark cluster provides a dedicated interpreter for each user who runs a notebook:

  • The interpreter is named as follows: user_<user's_email_name>_<user's_email_domain> (user_ is a literal constant). For example, for a user whose email address is abc@xyz.com, the interpreter name is set to user_abc_xyz.
  • As the user, you can also create additional interpreters.

Note

The Spark Interpreters support per user AWS credentials when started in user mode. This feature is not enabled for all users by default. Create a ticket with Qubole Support to enable this feature. When this feature is enabled, all the new clusters are created in the user interpreter mode.

Using Your Interpreter
  1. From the main menu of the QDS UI, navigate to Notebooks.
  2. Choose a notebook from the left panel, or choose NEW to create a new one. Make sure the cluster on which the notebook runs is configured for user mode, as follows:
    1. Click the gear icon that appears when you mouse over the notebook name in the left panel, then choose View Details. This shows you the name and ID of the cluster.
    2. Pull down the Clusters menu (near the top right of the screen) and find the cluster.
    3. Mouse over the cluster name and click on the eyeball icon that appears on the right. The resulting page should show Notebook Interpreter Mode set to user. If it doesn’t, you can assign the notebook to another cluster (click the gear icon as in step 2a above and choose Configure Notebook); or your system administrator can configure user mode for this cluster.
  3. Click on the name of the notebook in the left panel to load it.
  4. Click the gear icon next to Interpreters to see the list of available interpreters.
  5. If your interpreter (named as described above) is not at the top of the list, click on it to highlight it, then drag it to the top of the list and click Save.

You are now ready to run your notebook with your interpreter. Remember that the Spark cluster must be up and running, as indicated by a green dot next to the cluster name in the Clusters pull-down list.

Creating Your Own Interpreters

When user mode is configured for the cluster, you can create your own interpreters in addition to the interpreter that is automatically created for you.

To create and use an interpreter:

  1. Choose a notebook and make sure user mode is configured for its cluster, as described in steps 1-3 above.

  2. Click the Interpreters link near the top right. The resulting page shows you the current set of available interpreters.

  3. Click Create to create a new interpreter.

  4. On the resulting page, name the interpreter and choose the type and properties as prompted, then click Save.

    If per user AWS credentials is enabled, then specify your email address for the spark.yarn.queue property to create a user level interpreter. You cannot modify the non-user level interpreter settings.

  5. The new interpreter now appears in the list of interpreters, with the properties you have defined. You can change the properties if you need to by clicking on the edit button on the right.

  6. Click the name of the notebook in the left panel to reload it, then configure the notebook to use your new interpreter as described in steps 4-5 above.

Using another User’s Interpreter

In user mode, interpreters can easily be shared. To use another user’s interpreter, simply drag it to the top of the list as described in steps 4-5 above.

Sharing Variable Settings

When you set a variable in one notebook, that variable will have the same value in all notebooks that use the same interpreter, even if another user is using the interpreter. For more information, see Notebook Interpreter Operations.

Effect of Existing Bindings on Interpreter Modes

When user mode is set for a Spark cluster:

  • When you run a notebook that you own, but that is bound to an interpreter in legacy mode, the notebook runs with that legacy interpreter. This is to ensure backward compatibility.
  • When you run a notebook bound to an interpreter owned by another user, QDS rebinds the interpreter to your interpreter and runs it.
Creating a Spark Schedule

Navigate to the Scheduler page and click the New button in the left pane to create a job.

Note

Press Ctrl + / to see the list of available keyboard shortcuts. See Using Keyboard Shortcuts for more information.

See Qubole Scheduler for more information on using the user interface to schedule jobs.

Perform the following steps to create a Spark schedule:

  1. In General Tab:

    • Enter a name in the Schedule Name text field. This field is optional. If it is left blank, a system-generated ID is set as the schedule name.

    • In the Tags text field, add one or a maximum of six tags to group commands together. Tags help in identifying commands. Each tag can contain a maximum of 20 characters. It is an optional field.

    • In the command field, select Spark Command from the drop-down list. By default, Scala is selected as the programming language in the drop-down list that contains Command Line, Python, SQL, R and Notebook. Select a language from the list. Enter the query in the text field. The following figure illustrates a Spark Scala query.

      _images/sparkschedule.png

      You can schedule a notebook to run only if it’s associated with a cluster. Use the Notebooks page to compose notebook paragraphs.

      For more information, see Running Spark Notebooks in a Schedule.

  2. Add macros, and set parameters and notifications by following the steps as described in Creating a New Schedule.

See Viewing a Schedule and Editing and Cloning a Schedule for more information on viewing and editing a job.

Running Spark Notebooks in a Schedule

You can create a schedule to run notebooks at periodic intervals without a manual intervention using Qubole Scheduler.( (Use the Notebooks section of the QDS UI to compose notebook paragraphs.)

Proceed as follows:

  1. Navigate to the Scheduler page, and click the +Create button in the left pane for creating a schedule.

    Note

    Press Ctrl + / to see the list of available keyboard shortcuts. For more information, see

    Using Keyboard Shortcuts.

    For more information on using the QDS user interface to schedule jobs, see Qubole Scheduler.

    Important

    You can schedule a notebook to run even when its associated cluster is down.

  2. Create a schedule for running a notebook from the General Tab:

    1. Enter a name in the Schedule Name text field. This field is optional. If it is left blank, a system-generated ID is set as the schedule name.

    2. In the Tags text field, add one or a maximum of six tags to group commands together. Tags help in identifying commands. Each tag can contain a maximum of 20 characters. It is an optional field.

    3. In the command field, select Spark Command from the drop-down list.

    4. Select a cluster from the drop-down list on the right of the page.

    5. Select Notebook from the the drop-down list that contains Scala, Python, Command Line, SQL, R. and Notebook.

      A list of notebooks appears; these are the notebooks associated with the cluster you selected.

    6. Choose the notebook that you want to schedule to run.

    7. Optionally enter arguments in the Arguments field and pass the values in the Scheduler’s macros, for example:

    _images/ScheduleNote1.png
  3. Now follow the steps under Creating a New Schedule.

For more information about viewing and editing a schedule, see Viewing a Schedule and Editing and Cloning a Schedule.

Spark Streaming

Spark Streaming allows you to use Spark for stream processing. You write a streaming job the same way as you would write a Map job. At execution time, Spark breaks the input stream into a series of small jobs and runs them in batches. Inputs can come from sources such as HDFS, Kafka, Flume, and others. A typical output destination would be a file system, a database, or a dashboard.

Note

Kafka client jars are available in Qubole Spark as part of the basic package.

Running Spark Streaming on QDS

You can run Spark Streaming jobs on a QDS Spark cluster from the Workbench or Notebooks page of the UI.

For more information, see Composing Spark Commands in the Analyze Page and Running Spark Applications in Notebooks.

Points to note::

  • Qubole recommends Spark on YARN in client mode for Streaming; this is the QDS default.

  • Qubole has a 36-hour time limit on every command run. For streaming applications this limit can be removed. Contact Qubole Support to remove this limit.

  • If your application needs to receive multiple streams of data in parallel, create multiple input DStreams. This will create multiple receivers which will simultaneously receive multiple data streams. (But note the points about sizing and resources that follow.)

  • QDS supports basic version of open-source streaming dynamic allocation for Spark streaming applications. The open-source streaming dynamic allocation is available only in Spark 2.0 and later versions. Based on the processing time required by the tasks, executors are added/downscaled. However, executors with receivers are never removed. By default, autoscaling is disabled on Spark streaming applications. Set spark.streaming.dynamicAllocation.enabled=true to enable autoscaling in Spark streaming applications.

    Note

    Ensure to set spark.dynamicAllocation.enabled=false when you set spark.streaming.dynamicAllocation.enabled=true as an appropriate error is thrown if both the settings are enabled.

    These practices are recommended for better autoscaling:

    • It is best to start with a fairly large cluster and number of executors and scale down if necessary. (An executor maps to a YARN container.)
    • The number of executors should be at least equal to the number of receivers.
    • Set the number of cores per executor such that the executor has some spare capacity over and above what’s needed to run the receiver.
    • The total number of cores must be greater than the number of receivers; otherwise the application will not be able to process the data it receives.
  • Setting spark.streaming.backpressure.enabled to true allows Spark Streaming to control the receiving rate (on the basis of current batch scheduling delays and processing times) so that the system receives data only as fast as it can process it.

  • For the best performance, consider using the Kryo Serializer to convert between serialized and deserialized representations of Spark data. This is not the Spark default, but you can change it explicitly: set the spark.serializer property to org.apache.spark.serializer.KryoSerializer.

  • Developers can reduce memory consumption by un-caching DStreams when they are no longer needed.

Spark Structured Streaming

Spark Structured Streaming is a scalable and fault-tolerant stream processing engine built on the Spark SQL engine. Structured Streaming provides fast, scalable, fault-tolerant, end-to-end exactly-once stream processing. You can express your streaming computation the same way you would express a batch computation on static data. You can use the Dataset/DataFrame API in Scala, Java, Python or R to express streaming aggregations, event-time windows, stream-to-batch joins, etc. The Spark SQL engine runs it incrementally and continuously, and updates the final result as streaming data continues to arrive. The computation is executed on the same optimized Spark SQL engine.

Spark Structured Streaming on Qubole:

  • Is supported on Spark 2.2 and later versions.
  • Supports long-running tasks. Spark structured streaming jobs do not have 36 hours timeout limit unlike batch jobs.
  • Allows you to monitor health of the job.
  • Provides end-to-end exactly-once fault-tolerance guarantees through checkpointing.
  • Supports various input data sources, such as Kafka.
  • Supports various data sinks, such as Kafka and Spark tables.
  • Rotates and aggregates Spark logs to prevent hard-disk space issues.
  • Supports Direct Streaming append to Spark tables.
  • Provides optimized performance for stateful streaming queries using RocksDB.
Supported Data Sources and Sinks

Spark on Qubole supports various input data sources and data sinks.

Data Sources
  • Kafka
Data Sinks
  • Kafka
  • Spark tables

Note

Kafka client jars are available in Spark in Qubole as part of the basic package.

For more information about Kafka, see Kafka Integration Guide.

When running structured Spark streaming jobs, you must understand how to run the jobs and monitor the progress of the jobs. You can also refer to some of the examples from various data sources on the Notebooks page.

Running Spark Structured Streaming on QDS

You can run Spark Structured Streaming jobs on a Qubole Spark cluster from the Workbench and Notebooks pages as with any other Spark application.

You can also run Spark Structured Streaming jobs by using the API. For more information, see Submit a Spark Command.

Note

QDS has a 36-hour time limit on every command run. For streaming applications this limit can be removed. For more information, contact Qubole Support.

Running the Job from the Workbench Page
  1. Navigate to the Workbench page.
  2. Click + Create New.
  3. Select the Spark tab.
  4. Select the Spark language from the drop-down list. Scala is the default.
  5. Select Query Statement or Query Path.
  6. Compose the code and click Run to execute.

For more information on composing a Spark command, see Composing Spark Commands in the Analyze Page.

Running the Job from the Notebooks Page
  1. Navigate to the Notebooks page.
  2. Start your Spark cluster.
  3. Compose your paragraphs and click the Run icon for each of these paragraphs in contextual order.

Sample program on the Notebooks page.

Monitoring the Health of Streaming Jobs

You can monitor the health of the jobs or pipeline for long running ETL tasks to understand the following information:

  • Input and output throughput of the Spark cluster to prevent overflow of incoming data.
  • Latency, which is the time taken to complete a job on the Spark cluster.

When you start a streaming query in a notebook paragraph, the monitoring graph is displayed in the same paragraph.

The following figure shows a sample graph displayed on the Notebooks page.

_images/streaming-query-notebooks.png

You can also monitor streaming queries using the Spark UI from the Workbench or Notebooks page, and from Grafana dashboards.

Monitoring from the Spark UI
  1. Depending on the UI you are using, perform the appropriate steps:

    • From the Notebooks page, Click on the Spark widget on the top right and click on Spark UI.
    • On the Workbench page, click on the Logs or Resources tab. Click on the Spark Application UI hyperlink.

    The Spark UI opens in a separate tab.

  2. In the Spark UI, click Streaming Query tab.

The following figure shows a sample Spark UI with details of the streaming jobs.

_images/spark-streaming-cluster-webpage.png
Monitoring from the Grafana Dashboard

Note

Grafana dashboard on Qubole is not enabled for all users by default. Create a ticket with Qubole Support to enable this feature on the QDS account.

  1. Navigate to the Clusters page.
  2. Select the required Spark cluster.
  3. Navigate to Overview >> Resources >> Prometheus Metrics.

The Grafana dashboard opens in a separate tab.

The following figure shows a sample Grafana dashboard with the details.

_images/Grafana-Streaming-Dashboard.png
Examples

You can refer to the examples that show streaming from various data sources on the Notebooks page of QDS or from the Discover Qubole Portal.

You can click on the examples listed in the following table and click Import Notebook. Follow the instructions displayed in the Import Notebook pop-up box.

Data Source Examples
Kafka Source Kafka Structured Streaming

You can also access the examples from the Notebooks page of QDS.

  1. Log in to https://api.qubole.com/notebooks#home (or any other env URL).
  2. Navigate to Examples >> Streaming.
  3. Depending on the data source, select the appropriate examples from the list.
Limitations
  • The Logs pane displays only the first 1500 lines. To view the complete logs, you must log in to the corresponding cluster.
  • Historical logs, events, and dashboards are not displayed.
Spark Structured Streaming on Qubole in Production
Optimize Performance of Stateful Streaming Jobs

Spark Structured Streaming on Qubole supports RocksDB state store to optimize the performance of stateful structured streaming jobs. This feature is supported on Spark 2.4 and later versions.

You can enable RocksDB based state store by setting the following Spark Configuration before starting the streaming query: --conf spark.sql.streaming.stateStore.providerClass = org.apache.spark.sql.execution.streaming.state.RocksDbStateStoreProvider.

Set the --conf spark.sql.streaming.stateStore.rocksDb.localDir=<tmp-path> configuration, tmp-path is a path in a local storage.

The default State Store implementation is memory based and the performance degrades significantly due to JVM GC issues when the number of state keys per executor increases to few millions. In contrast, RocksDB based state storage can easily scale up to 100 million keys per executors.

You cannot change the state storage between query restarts. However, if you want to change the state storage then you must use a new checkpoint location.

Managing Query Engines

See also:

Airflow

Note

See QDS Components: Supported Versions and Cloud Platforms for up-to-date information on Airflow support in QDS.

This section explains how to deploy and use Airflow. It covers the following topics:

Introduction to Airflow in Qubole

Airflow is a platform for programmatically authoring, scheduling, and monitoring workflows. It supports integration with third-party platforms. You can author complex directed acyclic graphs (DAGs) of tasks inside Airflow. It comes packaged with a rich feature set, which is essential to the ETL world. The rich user interface and command-line utilities make it easy to visualize pipelines running in production, monitor progress, and troubleshoot issues as required.

See QDS Components: Supported Versions and Cloud Platforms for up-to-date information on support for Airflow in QDS.

Airflow Principles
  • Dynamic: Airflow pipelines are configured as code (Python). (Pipelines are synonymous for workflow in the ETL world.) Configuration in code allows dynamic pipeline generation and writing code that instantiates pipelines dynamically.
  • Extensible: Anyone can easily define its own Airflow operators, executors, and extend the library to fit the level of abstraction, which would suit an environment.
  • Scalable: Airflow has a modular architecture and uses a message queue to orchestrate an arbitrary number of workers. Airflow can be scaled to infinity.

Qubole Airflow is derived from Apache (Incubator) Airflow versions 1.10.0 and 1.10.2. Airflow as a service provides the following features:

  • A single-click deployment of Airflow and other required services on a Cloud
  • Cluster and configuration management
  • Linking Airflow with QDS
  • Visualize Airflow monitoring dashboards

Note

Qubole supports file and Hive table sensors that Airflow can use to programmatically monitor workflows. For more information, see file-partition-sensors and sensor-api-index.

Qubole Operator

Qubole has introduced a new type of Airflow operator called QuboleOperator. You can use the operator just like any other existing Airflow operator. During the operator execution in the workflow, it submits a command to to QDS and waits until the command completion. You can execute any valid Qubole command from the QuboleOperator. In addition to the required Airflow parameters such as task_id and dag, there are other key value arguments required to submit a command within QDS. For example, for submitting a Hive command within QDS, define QuboleOperator as shown below:

hive_operator = QuboleOperator(task_id='hive_show_table', command_type='hivecmd', query='show tables',
cluster_label='default', fetch_logs=True, dag=dag)

To check different command types and the required parameters that are supported, you can check the detailed documentation on QuboleOperator class inside the Airflow codebase. See Qubole Operator Examples DAG for QuboleOperator with various use cases.

For more information, see Questions about Airflow.

Configuring an Airflow Cluster

Configure an Airflow cluster as described under Configuring the Cluster.

This page also provides information on the following topics:

Configuring the Cluster

Navigate to the Clusters page. Click New to add a new cluster. Select Airflow as the cluster type. See Managing Clusters for detailed instructions on configuring a QDS cluster. For Airflow, note the following:

  • Airflow Version: The default version is 1.10.2 QDS. AWS also supports version 1.10.0. Airflow Version 1.8.2 is the deprecated version which is visible on the cluster UI but you cannot create a new cluster with it.

  • Python Version: Qubole supports Python version 2.7, 3.5, and 3.7 on Airflow clusters. Python version 3.5 and 3.7 are supported with Airflow version 1.8.2 or later. The default Python version is 2.7. However, this field is not visible to you unless you create a ticket with Qubole Support and get this field enabled on the QDS account.

    When you create an Airflow cluster with Python version 3.5, it gets automatically attached to a Package Management environment.

  • Data Store: Select the data store from the drop-down list. Currently, the MySQL and Amazon Aurora-MySQL data stores are supported on Airflow clusters.

  • Fernet Key: Encryption key (32 url-safe base64 encoded bytes) for sensitive information inside the Airflow database, such as user passwords and connections. QDS auto-generates a Fernet key if you do not specify it here.

  • Node Type: An Airflow cluster is actually a single node, so there are no Coordinator or Worker nodes. Select the instance type from the drop-down list.

  • Autoscaling is not supported in Airflow clusters, and, for AWS, only On-Demand clusters are supported.

Under Advanced Configuration, do the tasks described under:

To add more workers in an Airflow cluster, see Configuring a Multi-node Airflow Cluster.

Configuring Airflow Settings

Qubole provides an Airflow Recommended Configuration, as shown in the QDS UI under the Advanced tab. You can override this configuration by adding new values in the Override Airflow Configuration Variables text box. See also Using or Overriding Default Airflow Settings.

Starting an Airflow Cluster

You can start a cluster by clicking the Start button on the Clusters page. See Understanding Cluster Operations for more information.

Afer starting an Airflow cluster, you can find Airflow DAGs and logs, and the configuration file, under usr/lib/airflow.

Populating a Default or Custom Authentication Token in Airflow

After the Airflow cluster is successfully started, a default QDS connection (qubole_default) is created (if it does not exist), with the required host. The host parameter is set to the Qubole API endpoint for your Cloud, with an empty password. A password is the QDS authentication token of a QDS account user. You can decide the default authentication token and populate it using the Airflow Webserver Connection Dashboard.

You can create custom Qubole Airflow connections through the Airflow Webserver Connection Dashboard for different users. You can use them in the Qubole Operator to submit commands in the corresponding accounts.

You can use a custom connection (for example, my_qubole_connection) in the Airflow DAG script by setting the qubole_conn_id parameter in the Qubole Operator. If this parameter is not set, the Qubole Operator uses the qubole_default connection. The following sample code shows how to set the qubole_conn_id parameter.

qubole_task = QuboleOperator(
                                  task_id='hive_show_table',
                                  command_type='hivecmd',
                                  query='show tables',
                                  qubole_conn_id='my_qubole_connection', #*takes qubole_default as default connection*
                                  cluster_label='default',
                                  dag=dag
                                 )
Terminating an Airflow Cluster

An Airflow cluster does not automatically stop when it is left unused. Click the stop button to terminate the cluster. See Understanding Cluster Operations for more information.

Editing an Airflow Cluster

Click the edit button to modify the configuration. See Understanding Cluster Operations for more information. No configuration is pushable in a running Airflow cluster.

Using or Overriding Default Airflow Settings

By default Qubole has set CeleryExecutor as the executor mode. CeleryExecutor allows you to scale the pipeline vertically in the same machine by increasing the number of workers. See also Configuring a Multi-node Airflow Cluster. Celery needs a message broker and backend to store state and results.

Qubole ships rabbitmq pre-installed inside an Airflow cluster, and sets it as the default message broker for Airflow. For the result backend, Qubole uses the configured Airflow datastore for storing Celery data. If you want to use your own message broker and backend, you can configure celery.broker_url and celery.celery_result_backend in the Override Airflow Configuration Variables cluster configuration field.

User Level Privileges

In Qubole, Airflow clusters offer these two simple authorization methods:

Users of Airflow Version 1.10.0
Roles
  • User - A user who can view all the tabs except the Admin tabs on the Airflow UI.
  • Admin - An admin can view all tabs. A user with the Update access on that cluster is considered as Admin while other users with no Update access are considered as User.
Users of Airflow Version 1.10.2QDS

Airflow version 1.10.2QDS comes with Role-Based Access Control (RBAC). RBAC helps to ignore extra work required to create users, manage roles, or policies. If you are using Airflow version 1.10.2QDS, by default, you have access to an Airflow cluster running in your organisation’s account and it can automatically be mapped to a default role within that Airflow cluster. You can have various cluster-level permissions on Qubole. Based on these permissions, your role is mapped to the Airflow web server.

Roles
  • Admin - A user is assigned as admin on airflow webserver when he/she has permission to view, manage, delete, terminate, clone, and update the Airflow cluster.
  • Op - A user is assigned as op on airflow webserver when he/she has permission to only start the Airflow cluster.
  • User - A user is assigned as user on airflow webserver when he/she can only read the Airflow cluster.
  • Viewer - A user is assigned as viewer on airflow webserver when he/she has no permission to perform any of the above mentioned tasks.

To override the roles, the cluster administrator can create a user on Airflow’s web server and assign the desired role or change the default existing permissions for various roles: Administrator, Op, User, Viewer, and Public. In the following example, an Administrator can assign User 2 to the Administrator role, who was mapped to the Op role by default.

_images/user-permission.png

Note

To know more about RBAC in Airflow, see Introducing Role-Based Access Control in Apache Airflow.

Configuring a Multi-node Airflow Cluster

Currently, Airflow clusters contain only a single node by default. If you want more workers, you can scale vertically by selecting a larger instance type and adding more workers, using the cluster configuration override parameter celery.celeryd_concurrency. You can do this while the cluster is running; choose Update and Push on the Clusters page to implement the change.

To scale horizontally, you can use a workaround to add more workers to the existing cluster.

Create a new user in rabbitmq running on the first cluster; enter a shell command on the Workbench page:

sudo /usr/sbin/rabbitmqctl add_user new_user new_password;
sudo /usr/sbin/rabbitmqctl set_user_tags new_user administrator;
sudo /usr/sbin/rabbitmqctl set_permissions -p / new_user ".*" ".*" ".*"

After running the above shell command, go to the Clusters page, clone the parent Airflow cluster, and override the broker details for new cluster as follows:

celery.broker_url=amqp://new_user:new_password@<master_dns_of_first_cluster>//

Once the new cluster is up and running, stop the Airflow scheduler running on the new cluster.

sudo monit stop scheduler

Note the following:

  • The parent Airflow cluster and its cloned cluster must use the same data store and Fernet key
  • You must sync the DAG files on the new cluster.
  • You must allow inbound TCP requests from the cloned cluster over 5672 and 15672 ports to the parent Airflow cluster.

Qubole plans to add multi-node Airflow cluster support in the future.

Deployments on Airflow Clusters
Through Cloud Storage
Managing DAG Explorer Permissions

To use the Dag explorer on the Airflow cluster, you require the following permissions:

  • View Files: Provide Read access on the Object Storage and Airflow Cluster.
  • Download Files: Provide Download access on the Object Storage and Read access on the Airflow cluster.
  • Upload Files: Provide Upload access on the Object Storage and Read access on the Airflow cluster. In the case of Airflow version 1.10.2.QDS, you must need the Cluster Admin access by updating permission on the cluster.
  • Delete Files: Provide Delete access on the Object Storage and Update Permission on the Airflow cluster.

To more information about providing permissions on the Airflow cluster and Object Storage, see Managing Cluster Permissions through the UI and Managing Access Permissions and Roles respectively.

Uploading a DAG on an Airflow Cluster
Uploading a DAG

You can now upload and download Airflow python DAG files to the account’s default storage location, edit them in place, and sync them with Airflow clusters periodically (in the background) from the Airflow cluster page. The files immediately sync (automatically) with the new cluster. However, a cluster restart is required in the existing clusters. Otherwise, the files sync with the clusters within 5 minutes.

Perform the following steps to upload a DAG:

  1. Navigate to the Clusters page and click the Airflow cluster that you want to work with.
  2. Click Dag Explorer from the left pane.
_images/dag.png

The list of dag_logs, dags, plugins, and process_logs appear.

3. Click the uploaddag link against the dags folder and select the file you want to upload. Once the upload is complete, you can view the file under the dags folder.

  1. Verify the File Path and the dag contents in the right pane and click Save.
_images/filepath.png
Deleting a DAG on an Airflow Cluster

For information on how to delete a DAG on an Airflow Cluster, see Deleting a DAG on an Airflow Cluster.

Through CI/CD (GitHub, GitLab, or Bitbucket)
  1. Configure GitHub, GitLab, or Bitbucket as Repository
  2. Create a Configuration File
  3. Deploy DAGs and Plugins using GitHub, GitLab, or Bitbucket as Repository
Configure GitHub, GitLab, or Bitbucket as Repository
Prerequisite

To select Git (GitHub or GitLab) or Bitbucket as the deployment repository for DAGs, you should first configure the Version Control Settings on the Account Settings page. For more information on how to configure GitHub, GitLab, or Bitbucket version control settings, see Configuring Version Control Systems.

Configuration

Follow the instructions below to manage your Airflow DAGs through Git or Bitbucket:

  1. Navigate to Home > Cluster.

  2. On the Clusters page, click Edit for the Airflow cluster to change its deployment repository.

  3. On the cluster details page, select Advanced Configuration tab.

  4. From the Deployment Source drop-down list (under the AIRFLOW CLUSTER SETTINGS section), select GIT Repository.

    _images/github-repo.png
  5. Enter repository location in the Repository URL field.

  6. Enter the branch name in the Repository Branch.

  7. To create a new Airflow cluster or edit an existing one, click Create or Update and Push respectively.

You have successfully configured Git (GitHub or GitLab) or Bitbucket as the deployment repository for Airflow DAGs.

Create a Configuration File

After you configure the repository, you require the configuration file (.yaml) in your repository to enable a set of instructions for the deployment.

  1. Create a configuration file (.yaml) with this name - qubole-airflow.yaml and ensure that it has all the parameters defined as shown in the example below. Presence of this file helps to pass the Config Validation during the deployment.
  2. Specify the requirements parameter in the configuration file (.yaml). It is the relative path to the file with the list of requirements. It installs the packages through the Package Installation and clones your git repository on the cluster through Cloning Repository during the deployment.
  3. Specify the scripts parameter that helps in completion of the Running Scripts during the deployment. scripts is an array of simple commands. They are run before copying the DAGs and plugins.
  4. Specify the dags parameter to provide the relative path to the DAG files. This folder of DAG files is recursively copied to the DAGs folder on the cluster through Copy Dags during the deployment.
  5. Specify the plugins parameter to provide the relative path to the plugin files. This folder of plugin files is recursively copied to the plugins folder on the cluster through Copy Plugin during the deployment.

Note

sync-directory and dags are the mandatory parameters.

Example
steps:
  requirements: tempdir/requirements.txt
  scripts:
    - 'ls'
    - 'python setup.py'
  sync-directory:
    dags: testproject/dags
    plugins: testproject/plugins
Deploy DAGs and Plugins using GitHub, GitLab, or Bitbucket as Repository

After you create the configuration file, follow the instructions below to start deployment from Git:

  1. On the cluster details page, select Deployment tab at the left pane.

    Important

    To start deployment from GitHub, GitLab, or Bitbucket, the respective Airflow cluster must be up and running.

  2. Click Deploy. A popup window appears to ask you for confirmation.

  3. Click OK. QDS starts deployment.

    _images/airflow-git.png
Upgrading Airflow Clusters
Upgrading Airflow Cluster Version on QDS

Airflow 1.8.2 is supported with MySQL 5.6 or later versions and Airflow 1.10.0 or above versions require MySQL 5.6.4 or later versions. If you are not using MySQL on the cluster, ensure that you are using a compatible version before you proceed to upgrade.

Follow these instructions to upgrade the Airflow cluster version:

  1. Log into Airflow cluster.

  2. Execute the following commands to stop Airflow scheduler and workers.

    sudo monit stop scheduler; sudo monit stop worker;
    
  3. Keep a backup of all the DAGs that are present in $AIRFLOW_HOME/dags directory.

  4. Check if all workers have completed running tasks that were being executed. Run ps -ef|grep airflow and wait until there is no airflow run commands running. This is to ensure that there is no task left in an inconsistent state.

  5. Navigate to the Clusters page.

  6. Clone the existing cluster and update the cloned clusters version to the desired version.

  7. Log into the cloned cluster.

  8. Restore the backed up DAGs under $AIRFLOW_HOME/dags. You have successfully upgraded the Airflow cluster version on QDS.

Note

While upgrading the version of Airflow clusters, it is recommended to upgrade intermediate versions and one version of Airflow cluster at a time. For example, if you want to upgrade from version 1.8.2 to 1.10.2, you should first upgrade it to 1.10.0 and finally to 1.10.2. If you try to upgrade it directly from 1.8.2 to 1.10.2, it may cause issues while running database migration and result in an inconsistent state of the database.

Upgrading Long-running Airflow Clusters

If your airflow cluster has been running for a longer time with multiple DAGs, some tables such as task_instance will contain large data. This causes a severe delay in database migration (much more than the allowed cluster start time). As a result, cluster startup is aborted abruptly causing the database to go into an inconsistent state. To avoid this, you must run the database migration manually before you upgrade.

Follow these instructions to upgrade long-running Airflow clusters:

  1. Copy the values of the following settings from your old cluster.

    • core.sql_alchemy_conn
    • celery.result_backend
  2. Take a backup of your DAGs and shut down the old cluster.

  3. Start a new Airflow cluster with the newer version of Airflow and default cluster Datastore.

    Important

    Do not select any external Datastore.

    This is only a temporary cluster to run the migrations. Ensure that the fernet key is the same as the old cluster.

  4. Modify the values of the copied settings (refer to Step 1) in the new cluster’s /usr/lib/airflow/airflow.cfg file.

  5. Run the following commands on the temporary cluster to execute the migration. It is recommended to use tmux to avoid session timeout issues or use Analyze to run a Shell command on this new cluster.

    source /usr/lib/airflow/airflow/qubole_assembly/scripts/virtualenv.sh activate
    airflow upgradedb
    source /usr/lib/airflow/airflow/qubole_assembly/scripts/virtualenv.sh deactivate
    

    This might take some time based on the size of the database.

  6. After completion, shut down the temporary cluster.

  7. Clone the old Airflow cluster to create one with a newer version of Airflow. Change the airflow version while cloning.

  8. Start the new cluster to get things up and running on the new Airflow version. You have successfully upgraded a long-running Airflow cluster.

Deleting a DAG on an Airflow Cluster

You can delete a DAG on an Airflow Cluster from the Airflow Web Server.

Required Permissions

To delete a DAG on an Airflow cluster, you require Delete access on the Object Storage and Update Permission on the Airflow cluster. For more information on the DAG explorer permissions, see Managing DAG Explorer Permissions.

Deleting a DAG on an Airflow Cluster from Airflow Web Server

Before you delete a DAG, you must ensure that the DAG must be either in the Off state or does not have any active DAG runs. If the DAG has any active runs pending, then you should mark all tasks under those DAG runs as completed.

Follow the instructions to delete a DAG on an Airflow cluter from the Airflow Web Server:

  1. From the Clusters page, click on the Resources drop-down list against the airflow cluster, and select Airflow Web Server. The Airflow Web Server is displayed as shown in the illustration.
_images/delete-dag1.png
  1. Click DAGs tab to view the list of DAGs.
  2. Click on the delete button under the Links column against the required DAG.
  3. Click OK to confirm.

By default, it takes 5 minutes for the deleted DAG to disappear from the UI. You can modify this time limit by configuring the scheduler.dag_dir_list_interval setting in the airflow.cfg file. This time limit does not work on sub-DAG operators.

Note

It is recommended not to decrease the time limit value substantially because it might lead to high CPU usage.

Monitoring an Airflow Cluster

You can monitor an Airflow cluster by using the Airflow Web Server and Celery Web Server. The web server URLs are available in the Resources column of a running Airflow cluster.

Qubole supports monit within an Airflow cluster to monitor and automatically start the Webserver, Rabbitmq, and Celery services in case of a failure.

Monitoring through Airflow Web Server

In the Clusters tab, from the cluster Resources of a running Airflow cluster, click Airflow Web Server. The Airflow Web Server is displayed as shown in the following illustration.

_images/AirflowWebServer.png

Click the Qubole Operator DAG as shown in the following figure.

_images/DAGQuboleOperator.png

Click a command from the chart and you can see the link, Goto QDS as shown in the following figure.

_images/DAGQDSCommandChart.png

Qubole Operator tasks are linked with a QDS command. Following features are available to facilitate linking:

  • Goto QDS: An external link pointing to corresponding to QDS command while visualizing Qubole Operator tasks of a DAG run in the web server.
  • Filtering Airflow QDS Commands: Any QDS command triggered through the Airflow cluster contains three tags: dag_id, task_id, run_id. These can be used to filter QDS commands triggered from Airflow at various levels (dag/task/particular execution).
Monitoring through the Celery Dashboard

In the Clusters tab, from the cluster Resources of a running Airflow cluster, click Celery Dashboard to monitor the Celery workers. The Celery server runs on the 5555 port.

Monitoring through Ganglia Metrics

When Ganglia Metrics is enabled, you can see the Ganglia Metrics URL from the cluster Resources of a running Airflow cluster. The dashboard shows system metrics such as CPU and disk usage.

Monitoring through Logs

You can monitor an Airflow cluster using the following types of logs:

  • Airflow logs: Airflow DAG logs are now moved to /media/ephemeral0/logs/airflow/dags, and a symlink is created to the old location, which is $AIRFLOW_HOME/logs. As a result, the local disk space is not consumed by the logs.
  • Airflow services logs: Logs for services such as scheduler, webserver, Celery, and so on are under /media/ephemeral0/logs/airflow.
  • Airflow logs (remote): All Airflow logs are uploaded to the remote storage location provided in the account. These logs can be found at <default location>/airflow/<cluster-id>/dag_logs/ and <default location>/airflow/<cluster-id>/process_logs/<cluster_inst_id>/.
Monitoring through RabbitMQ Dashboard

You can monitor an Airflow cluster through the RabbitMQ dashboard. To use RabbitMQ dashboard, run the following shell command on the Analyze page to create a new user.

sudo /usr/sbin/rabbitmqctl add_user new_user new_password;
sudo /usr/sbin/rabbitmqctl set_user_tags new_user administrator;
sudo /usr/sbin/rabbitmqctl set_permissions -p / new_user ".*" ".*" ".*"

After you create the username and password, you can log in to RabbitMQ dashboard with the newly created credentials (username and password) on the login page.

_images/rabbitmq_dashboard.png
Enabling notifications for Airflow

You can receive alerts through email about Airflow processes by enabling the notifications in the Airflow configuration.

  1. Navigate to the Clusters page.
  2. Click Edit against the required Airflow cluster to edit the configuration.
  3. Click Advanced Configuration on the Edit Cluster Settings page.
  4. Under AIRFLOW CLUSTER SETTINGS section, add the following variables in the Overrride Airflow Configuration Variables field:
    • core.alert_via_email=True
    • core.alert_emails=email1, email2, email3...

A sample configuration is shown in the following illustration.

_images/notification-settings.png
  1. Click Update and Push for the changes to take effect immediately or click Update only for the changes to take effect with the cluster restart.
Using Default or Custom Failure Handling

Airflow executors submit tasks to Qubole and keep track of them. These executors (task-instances) also register heartbeats with the Airflow database periodically. A task-instance is marked as zombie if it fails to register the heartbeat in a configured amount of time.

The Airflow scheduler checks for zombie processes in the system and if necessary invokes the failure-handler for the task in question. Qubole ships a default failure-handler with the Qubole operator; this checks for the corresponding Qubole command and kills it if necessary. If the command succeeds, the handler changes the task-instance state to Success.

You can override this behaviour by providing a custom failure-handler in the task definition, as shown in the following example:

def my_failure_handler(context):
"""
custom logic to handle command failures
"""

hive_task = QuboleOperator(task_id='hive_show_table',
                        command_type='hivecmd',
                        query='show tables',
                        on_failure_callback=my_failure_handler,
                        dag=dag)
Qubole Operator Examples

For Qubole Operator API information, see Qubole Operator API.

For a real ETL use case using Qubole Operator example, see Readme.

The following examples illustrate the use of the Qubole Operator.

# Importing Qubole Operator in DAG
from airflow.contrib.operators.qubole_operator import QuboleOperator

# Hive Command - Inline query, Bonus - Attaching command tags & qubole connection id
QuboleOperator(
  task_id='hive_inline',
  command_type='hivecmd',
  query='show tables',
  cluster_label='default',
  tags='aiflow_example_run',  # Attach tags to Qubole command, auto attaches 3 tags - dag_id, task_id, run_id
  qubole_conn_id='qubole_default',  # Connection ID to submit commands inside QDS, if not set **qubole_default** is used
  dag=dag)


# Hive Command - GS Script location, Bonus - Qubole Macros, Email Notifications
QuboleOperator(
    task_id='hive_gs_location',
    command_type="hivecmd",
    script_location="gs://public-qubole/qbol-library/scripts/show_table.hql",
    notify=True, # Sends email on the command completion, either success or failure, notification settings as set in the Qubole account.
    # Escape the macro values.
    macros='[{"date": "\\"\\""}, {"name" : "\\"abc\\""}]', # Applies Qubole Macros to gs script.
    tags=['tag1', 'tag2'],
    dag=dag)


# Hive Command - Add Hive Resources
QuboleOperator(
    task_id='hive_add_jar',
    command_type='hivecmd',
    query='ADD JAR gs://paid-qubole/jars/json-serde/json-serde-1.1.7.jar',
    cluster_label='default',
    dag=dag)


# Jupyter Notebook Command
QuboleOperator(
  task_id='jupyter_cmd',
  command_type="jupytercmd",
  cluster_label='default',
  path=<path/to/jupyternotebook/on/qds>, # Right click on the notebook in Jupyter and click Copy Path to get the path
  arguments='{"name":"hello world"}',
  dag=dag)

# Shell Command - GS Script Location with arguments
QuboleOperator(
    task_id='shell_cmd',
    command_type="shellcmd",
    script_location="gs://public-qubole/qbol-library/scripts/shellx.sh",
    parameters="param1 param2",
    dag=dag)

# Shell Command - Inline query with files to copy in working directory
QuboleOperator(
    task_id='shell_cmd',
    command_type="shellcmd",
    script="hadoop dfs -lsr gs://paid-qubole/",
    files="gs://paid-qubole/ShellDemo/data/excite-small.sh,gs://paid-qubole/ShellDemo/data/excite-big.sh",
    dag=dag)


# Pig Command with gs script location and arguments
QuboleOperator(
    task_id='pig_cmd',
    command_type="pigcmd",
    script_location="gs://public-qubole/qbol-library/scripts/script1-hadoop-gs-small.pig",
    parameters="key1=value1 key2=value2", # Note these are space separated
    dag=dag)


# Hadoop Command - Inline Custom Jar command
QuboleOperator(
    task_id='hadoop_jar_cmd',
    command_type='hadoopcmd',
    sub_command='jar gs://paid-qubole/HadoopAPIExamples/jars/hadoop-0.20.1-dev-streaming.jar -mapper wc -numReduceTasks 0 -input gs://paid-qubole/HadoopAPITests/data/3.tsv -output gs://paid-qubole/HadoopAPITests/data/3_wc',
    cluster_label='default',
    dag=dag)


# DbTap Query
QuboleOperator(
    task_id='db_query',
    command_type='dbtapquerycmd',
    query='show tables',
    db_tap_id="2064",
    dag=dag)

# Db Export Command in Mode 1 - Simple Mode
QuboleOperator(
    task_id='db_export',
    command_type='dbexportcmd',
    mode=1,
    hive_table='default_qubole_airline_origin_destination',
    db_table='exported_airline_origin_destination',
    partition_spec='dt=20110104-02',
    dbtap_id="2064",
    use_customer_cluster="true",
    customer_cluster_label="default",
    dag=dag)

# Db Export Command in Mode 2 - Advanced Mode
QuboleOperator(
    task_id='db_export_mode_2',
    command_type='dbexportcmd',
    mode=2,
    db_table='mydb.mydata',
    dbtap_id="10942",
    export_dir="gs://mybucket/mydata.csv",
    fields_terminated_by="\\0x9",
    use_customer_cluster="true",
    customer_cluster_label="default",
    dag=dag)


# Db Import Command in Mode 1 - Simple Mode
QuboleOperator(
    task_id='db_import',
    command_type='dbimportcmd',
    mode=1,
    hive_table='default_qubole_airline_origin_destination',
    db_table='exported_airline_origin_destination',
    where_clause='id < 10',
    parallelism=2,
    dbtap_id="2064",
    use_customer_cluster="true",
    customer_cluster_label="default",
    dag=dag)

prog = '''
import scala.math.random

import org.apache.spark._

/** Computes an approximation to pi */
object SparkPi {
def main(args: Array[String]) {
    val conf = new SparkConf().setAppName("Spark Pi")
    val spark = new SparkContext(conf)
    val slices = if (args.length > 0) args(0).toInt else 2
    val n = math.min(100000L * slices, Int.MaxValue).toInt // avoid overflow
    val count = spark.parallelize(1 until n, slices).map { i =>
    val x = random * 2 - 1
    val y = random * 2 - 1
    if (x*x + y*y < 1) 1 else 0
    }.reduce(_ + _)
    println("Pi is roughly " + 4.0 * count / n)
    spark.stop()
    }
}
'''
# Spark Command - Scala Program
QuboleOperator(
    task_id='spark_cmd',
    command_type="sparkcmd",
    program=prog,
    language='scala',
    arguments='--class SparkPi',
    dag=dag)

# Spark Command - Run a Notebook
QuboleOperator(
     task_id='spark_cmd',
     command_type="sparkcmd",
     note_id="36995",
     qubole_conn_id='qubole_prod',
     arguments='{"name":"hello world"}',
     dag=dag)


# Db Import Command in Mode 2 - Advanced Mode
QuboleOperator(
    task_id = "db_import_mode_2" ,
    command_type = "dbimportcmd" ,
    mode = "2" ,
    extract_query = "select id, dt from mydb.mydata where $CONDITIONS and id < 10”,
    boundary_query="select min(id), max(id) from mydata",
    split_column=”id”,
    dbtap_id = "9531" ,
    hive_table = "myhivedb.mydata" ,
    parallelism = "1",
    db_table = "" , # Please note functionally this parameter is not needed in mode 2, but due to some bug this cannot be ignored, so you can set it as empty string
    use_customer_cluster="true",
    customer_cluster_label="default",
    dag=dag)
How to Pass the Results of One QuboleOperator As A Parameter to Another Using get_results And xcom_pull

Here is an example to explain how to pass the results of one QuboleOperator as a parameter to another using get_results and xcom_pull. In the following example, QuboleOperator is used to run a Shell command to print a file which is stored in another cluster. The result of this Shell command is then sent to xcom by a Push command. As the next step, a Hive command is sent. This Hive command uses xcom_pull to fetch the result and run the query.

from airflow import DAG
from datetime import datetime, timedelta
from airflow.operators.python_operator import PythonOperator
from airflow.operators.bash_operator import BashOperator
from airflow.contrib.operators.qubole_operator import QuboleOperator



tod = datetime.now()
d = timedelta(days = 2)

default_args = {
            'owner': 'qubole',
            'depends_on_past': False,
            'start_date': tod - d,
            'retries': 0,
        'schedule_interval': '@once'
}

def push_command(**kwargs):
        ti = kwargs['ti']
            qubole_operator_result = open(qubole_shell_command.get_results(ti), 'r').read()
            ti.xcom_push(key='qubole_shell_command', value=qubole_operator_result)


def print_command(**kwargs):
             ti = kwargs['ti']
             qubole_operator_result = open(qubole_hive_command.get_results(ti), 'r').read()
             print(qubole_operator_result)

with DAG(dag_id="xcom_demo_dag", default_args=default_args, catchup=False) as dag:

             qubole_shell_command = QuboleOperator(
                                     task_id = 'qubole_shell_command',
                                     command_type = 'shellcmd',
                                     cluster_label = 'default',
                                     script = 'cat /usr/lib/temp/xcom_demo',
                                     fetch_logs = True,
                                     dag = dag)

             push_command = PythonOperator(
                 task_id = 'push_command',
                         python_callable = push_command,
                 provide_context = True,
                         dag = dag)

             print_command = PythonOperator(
                          task_id = 'print_command',
                          python_callable = print_command,
                          provide_context = True,
                          dag = dag)

             qubole_hive_command = QuboleOperator(
                                task_id = 'qubole_hive_command',
                                command_type = 'hivecmd',
                                cluster_label = 'default',
                                query = "SELECT * FROM default.salesdata WHERE shop_id = {{ task_instance.xcom_pull(key='qubole_shell_command') }}",
                                dag = dag)

qubole_shell_command >> push_command >> qubole_hive_command >> print_command
Using the Node Bootstrap on Airflow Clusters

In QDS, all clusters share the same node bootstrap script by default, but for an Airflow cluster running on AWS, Qubole recommends you configure a separate node bootstrap script.

Note

A separate, Airflow-specific node bootstrap script is currently supported only on AWS.

Through the node bootstrap script, you can:

Install Packages on Airflow Cluster

Add this code snippet in the node bootstrap to install packages on the Airflow cluster.

# this activates the virtual environment on which airflow is running, so that we can install pacakges in it
source ${AIRFLOW_HOME}/airflow/qubole_assembly/scripts/virtualenv.sh activate

pip install <package name>
source ${AIRFLOW_HOME}/airflow/qubole_assembly/scripts/virtualenv.sh deactivate
Automatically Synchronize DAGs from a GitHub Repository

Add this code snippet in the node bootstrap editor to automatically synchronize DAGs from a GitHub repository.

# clone the repo using github access token
git clone https://{access_token}@github.com/username/airflow-dags.git $AIRFLOW_HOME/dags

# prepare command
command="*/5 * * * * cd $AIRFLOW_HOME/dags; git pull"

# register it on cron
crontab -l | { cat; echo "$command"; } | crontab -
Create a User in RabbitMQ to Access it Through Dashboard

If you are using RabbitMQ, which is installed on the cluster and if you want to access its dashboard through QDS, create a user in RabbitMQ as the default user (guest) cannot access the RabbitMQ dashboard from outside.

Add following code snippet in bootstrap to add a new user (new_user) in RabbitMQ.

/usr/sbin/rabbitmqctl add_user new_user new_password
/usr/sbin/rabbitmqctl set_user_tags new_user administrator;
/usr/sbin/rabbitmqctl set_permissions -p / new_user ".*" ".*" ".*"

To know more about troubleshooting Airflow related issues, see Troubleshooting Airflow Issues.

Qubole Scheduler

The QDS scheduler allows you to configure jobs and specify the intervals at which they run, and provides additional capabilities that make it a powerful tool for automating your workflow. Job characteristics you can specify include:

  • The job type (Hive Query, Hadoop Job, etc.) and query parameters
  • The job start and end date and time, and its time zone and frequency
  • Javascript macros to be included
  • The Hadoop Fair Scheduler queue (Hadoop 2) to be used
  • The number of concurrent jobs to allow
  • Hive and GCP GS dependencies, to ensure that the job runs only when the data it needs is available.

For more information, see the following topics:

Using the Scheduler User Interface

The Qubole Scheduler provides a way to run commands at specific intervals without manual intervention. Navigate to the Scheduler tab to see and modify scheduled jobs and create new ones. See Qubole Scheduler for more information.

Using Keyboard Shortcuts

Press Shift + / (implies ?) anywhere on the Scheduler page to see the list of keyboard shortcuts.

You can disable/enable the keyboard shortcuts in Control Panel > My Profile. By default the shortcuts are enabled. For more information, see Managing Profile.

Understanding the Qubole Scheduler Concepts

This section describes the concepts associated with the Qubole Scheduler.

Schedule Action

One occurrence of the schedule that runs at a particular time period is called a schedule action. A schedule can have many schedule action that can be daily, weekly, monthly, or hourly as it is configured. For more information, see List Schedule Actions.

Action

An action is run by the scheduler. An action can belong to any schedule in the account. For more information, see List All Actions.

Nominal Time

It is the time for which the Schedule Action was processed.

Next Materialized Time

It is the time when the next Schedule Action of the schedule is picked. The Next Materialized Time is calculated after the respective scheduler runs for the next time per the schedule that is after the first run of the respective job. But it is not determined at the time of schedule creation.

Note

While editing the Start Time, ensure that the Start Time is less than the Next Scheduled Time (Next Materialized Time) but more than the current time.

Created At

It is the time at which the Scheduler picked up the schedule.

The Created At time for any schedule action is greater than the Nominal Time. The Scheduler picks up a scheduled job after the Nominal Time is passed. For example, if the Nominal Time is 10:00 AM, the Scheduler picks the job at 10:01 AM. Thus, each schedule is run after the Nominal Time is passed. If a schedule action is skipped when Skip Missed Instance is enabled, that schedule action is never picked because the Created At time is greater than the Nominal Time. So, the Scheduler must run the last schedule action of the schedule.

Skip Missed Instances

When a new schedule is created, the scheduler runs schedule actions from start time to the current time. For example, if a daily schedule is created from Jan 1 2015 on May 1 2015, schedule actions are run for Jan 1 2015, Jan 2 2015, and so on. If you do not want the scheduler to run the schedule actios for months earlier to May, select the check box to skip them on the QDS UI or set no_catchup to true on a scheduler API call.

The main use of skipping a schedule action is if when you suspend a schedule and resume it later, in which case, there will be more than one schedule action and you might want to skip the earlier schedule actions.

Scheduled Job Rerun Limit

Qubole Scheduler has introduced a new limit for schedule reruns to be processed concurrently at a given point of time. Its default value is 20. When the concurrent reruns of a schedule exceeds the limit, you get this error message.

A maximum of 20 reruns are allowed for scheduled job in your account: #<account_number>.

When this limit is 20 (default value), if there are 20 reruns of a schedule to be processed and 2 of them are completed, then you can add 2 new reruns.

You can increase the rerun limit by creating a ticket with Qubole Support. The new value is applicable to all the jobs of a given QDS account.

Viewing a Schedule

In the Schedule tab, select the listed schedule that you want to see. Alternatively, if you know the ID/name of the schedule that you want to see, you can use Filter. See Filtering a Schedule more information.

Note

Press Ctrl + / to see the list of available keyboard shortcuts. See Using Keyboard Shortcuts for more information.

There is a rerun limit for schedule reruns to be processed concurrently at a given point of time. Understanding the Qubole Scheduler Concepts provides more information.

After you select a schedule from the list, the schedule details are visible in the Schedule tab as shown in the following figure.

_images/Scheduler1.png

Click the permalink icon PermalinkIcon to see the schedule’s permalink.

In the schedule details, Scheduled schedule summary and Runs showing instances’ details.

On the top of the schedule summary, you can:

  • Click the Clone button for cloning a schedule.
  • Click the Edit button for editing a schedule.
  • Click the Stop button for killing/stopping a schedule.
  • Click the Suspend button for suspending a schedule.

Note

After you stop a schedule, you cannot resume it. However, you can suspend a schedule and resume it later.

If you click the Suspend button, a dialog to suspend with OK and Cancel buttons is displayed.

Click OK to suspend a schedule. In the schedule details page, the suspended state is shown and the suspend icon replaced with a Resume button.

Note

The default filter shows only active schedules. When a schedule is suspended, it disappears from the list of active schedules. To see the list of suspended schedules, select the status as suspended. See Filtering a Schedule for more information.

You can resume a suspended schedule any time by clicking Resume.

A dialog to resume with OK and Cancel buttons is displayed as shown below.

_images/ResumeJob.png

Select Skip Missed Instances if you want to skip instances that were supposed to have run in the past. This setting would be disabled/enabled and it is as set when the schedule was created previously. Based on the need for skipping instances, you can select it if it remains unselected.

Click OK to resume a schedule.

Note

Use the tooltip Help_Tooltip to know more information on each field.

Viewing Schedule Actions

The Runs tab displays the schedule actions of the schedule as illustrated in the following figure.

_images/ViewInstances1.png

Click Rerun to run the instance again. Click Show More to see the earlier schedule actions.

Note

A _SUCCESS file is created in the output folder for successful schedule. You can set mapreduce.fileoutputcommitter.marksuccessfuljobs to false to disable creation of _SUCCESS file or to true to enable creation of the _SUCCESS file.

Filtering a Schedule

On the Scheduler page, click the filter icon FilterIcon to limit the search result.

After you click the filter icon FilterIcon, the Filter dialog box appears. On the Filter dialog, you can set the following parameters:

  • Job ID
  • Job Name
  • Status
  • User
  • Group
  • Command Type
  • Cluster Label

After the parameters are selected, click Apply to filter the scheduled jobs.

You can select Set filters as default check box to save your desired filter parameters. You will continue to view the saved filter parameters until you change them.

_images/SchedulerFilter11.png
Creating a New Schedule

Navigate to the Scheduler page and click the Create button in the left pane to create a schedule.

Note

Press Ctrl + / to see the list of available keyboard shortcuts. See Using Keyboard Shortcuts for more information.

The schedule fields are displayed that contains General Tab, a query composer, Macros, Schedule, and Notifications.

Perform the following steps to create a schedule:

Note

There is a rerun limit for schedule reruns to be processed concurrently at a given point of time. Understanding the Qubole Scheduler Concepts provides more information.

Setting Parameters in the General Tab

The General tab is as shown in the following figure.

_images/NewJob-General.png

In General Tab:

  • Enter a name in the Schedule Name* text field. This field is optional. If it is left blank, a system-generated ID is set as the schedule name.

  • In the Tags text field, add one or a maximum of six tags to group commands together. Tags help in identifying commands. Each tag can contain a maximum of 20 characters. It is an optional field. To add a tag, follow these steps:

    1. In the Tags field, add a tag as shown below.

      _images/SchedulerTag.png
    2. After adding a tag, press Enter and you can see the tag being added as shown below.

    _images/SchedulerTag1.png

    Similarly, you can add additional tags (total number of tags must be 6). You can add tags in a new schedule and in an existing schedule by editing it.

Adding a Query in Query Composer

To add a query, perform these steps:

  1. Select a query type from the drop-down list. If there is any sub option for query type, select it.

    _images/NewJob_Query11.png
  2. Select a cluster on which you want to run the query.

  3. Select the number of command retries from the Retry drop-down list. This option is available for almost all the command types (except Db Query, Redshift Query, Refresh Table, and Workflow). But Retry is available on subcommand level under the Workflow command.

Caution

Configuring retry initiates a blind retry of the command. This may lead to data corruption if the command execution fails and writes partial data. For example, a retry of failed INSERT INTO query can lead to data corruption.

  1. Select the duration from the Delay (mins.) drop-down list to specify the time interval between the retries when a job fails.
  2. Type the query in the text field.

See Creating a Spark Schedule for more information on how to create a Spark schedule and also schedule running a Spark notebook.

Adding Macros

If you have used macros in the query, click the + button available in the Macros field. Else, proceed to the next step. After you click the + button, the macros are displayed as shown in the following figure.

_images/NewJob_Macros.png

Enter the variable name and value in the corresponding text fields. See Macros in Scheduler for more information. Click + to add another macro. Else, proceed to the next step.

Setting Schedule Parameters

The Schedule field contains Frequency, Time Zone, and Advanced Settings. For more information, see Understanding the Qubole Scheduler Concepts.

The following figure illustrates all parameters in Schedule.

_images/NewJob-Schedule.png

Note

Use the tooltip Help_Tooltip to know more information on each field.

In the Schedule field, set:

  • Frequency: Enter the periodicity or custom or a cron expression from the corresponding drop-down list. The drop-down list of frequency is illustrated in the following figure.

    _images/Frequency.png

    Selecting Cron expression is useful to set exact date/time. A sample cron expression is illustrated in the following figure.

    _images/CronExpression.png

    Enter the values in all the cron expression fields.

  • The start time by selecting the year, month, date and time (HH:MM) from the corresponding drop-down lists.

  • The end time by selecting the year, month, date, and time (HH:MM) from the corresponding drop-down lists.

  • Time Zone by selecting the appropriate timezone from the drop-down list.

  • Command Timeout - You can set the command timeout configurable in hours and minutes. Its default value is 36 hours (129600 seconds) and any other value that you set must be less than 36 hours. QDS checks the timeout for a command every 60 seconds. If the timeout is set for 80 seconds, the command gets killed in the next minute that is after 120 seconds. By setting this parameter, you can avoid the command from running for 36 hours.

  • Advanced Settings when expanded displays:

    • Fair Scheduler pool: Enter the fairscheduler pool name in the text field.

    • Concurrency: Select the number of concurrent schedules allowed from the Concurrency drop-down list if you do not want the default value.

    • Dependencies: It has three options to be set for a schedule:

    • Skip Missed Instances: Select Skip Missed Instances if you want to skip instances that were supposed to have run in the past. By default, this option is unselected. When a new schedule is created, the scheduler runs schedule actions from start time to the current time. For example, if a daily schedule is created from Jan 1 2015 on May 1 2015, schedule actions are run for Jan 1 2015, Jan 2 2015, and so on. If you do not want the scheduler to run the missed schedule actions for months earlier to May, select the check box to skip them.

      The main use of skipping a missed schedule action is if when you suspend a schedule and resume it later, in which case, there will be more than one missed schedule action and you might want to skip the earlier schedule actions.

      For more information, see Understanding the Qubole Scheduler Concepts.

Setting Notifications

Notification is an optional field to be selected if you want to be notified through email about instance failure. Once you select the Send notifications check box, Notification Type, Notification List, and Event are displayed.

Select the Notification Type option, Daily digest to receive daily digests if a schedule periodicity is in minutes or hours. The default notification type is Immediate.

By default, On Failure is selected. Select On Success to be notified about successful schedule actions. You can select both type of events or any one of them.

Select the Notification Channel from the Notification List field. Notification List displays the list of Notification Channels configured. For more information on how to create a Notification Channel, see Creating Notification Channels.

_images/NewJob-Notification.png

After setting parameters, click Save to add a new schedule after you are done with filling the required details. Click Cancel if you do not want to create a schedule.

Configuring GS Files Data Dependency

GS files’ dependency implies that a schedule runs if the data is available in GS buckets. You can create a schedule to run at a specific date and time, either once or on a repetitive basis if the data exists. You can define repeat intervals such as last 6 hours, last 6 days, last 3 weeks, and last 7 months.

To create a schedule at periodic intervals, Qubole Scheduler requires the following information:

  • Start day or time (parameter: window_start)
  • End day or time (parameter: window_end)
  • Day or time interval that denotes when and how often data is generated (parameter: interval)
  • Nominal time which is the logical start time of an instance

The following table shows how to create data in GS files for the previous day’s data with daily interval.

Sequence ID Nominal Time Created At Dependency
1 2015-01-01 00:00:00 2015-04-22 10:00:00 gs://abc.com/data/schedule-2014-12-31-00-00-00
2 2015-01-02 00:00:00 2015-04-22 10:15:00 gs://abc.com/data/schedule-2015-01-01-00-00-00
3 2015-01-03 00:00:00 2015-04-22 10:30:00 gs://abc.com/data/schedule-2014-01-02-00-00-00

Nominal Time is the time when the next instance of the schedule is picked and Created At is the time at which the Scheduler picked up the schedule. For more information, see Understanding the Qubole Scheduler Concepts.

To configure GS files dependency, select the Wait For GS Files option available in Dependencies.

Note

Use the tooltip Help_Tooltip to know more information on each field or check box.

The following steps explain how to set GS File dependency:

  1. Enter the GS location in the format: gs://<bucket>/<folderinGSbucket>/<abc>-%Y-%m-%d-%H-%M-%S. For example: gs://abc.com/data/schedule-2014-12-31-00-00-00.

  2. Window Start and Window End defines the range of interval to wait for. The values are integers in units of time, hour/day/week/month/year.

    Enter the Window Start value. See Hive Datasets as Schedule Dependency for more information on the window start parameter. An instance runs waits for files for the specified time range. Window Start specifies the start of this range. For example if you set -1 as window start time that implies 1 hour before/previous day/week/month/year. If it is 2 hour/day/week/month/year before, the value of window start is -2 and so on.

    Note

    Qubole Scheduler supports strife format and unpadded values for specifying months. For example, January can be specified as only 1 and March can be specified as only 3.

  3. Enter the Window End value. See Hive Datasets as Schedule Dependency for more information on the window end parameter. An instance runs waits for files for the specified time range. Window End specifies the end of this range. For example, if the interval is for 7 days and window start value is -6, the window end time is 0.

    The value 0 implies now, -1 implies 1 day ago, and -2 implies 2 days ago. Correspondingly, for hourly/daily/weekly/monthly/yearly interval (frequency), the value 0 denotes now. -1 denotes 1 hour/day/week/month/year ago. -2 denotes 2 hour/day/week/month/year ago and so on.

    Qubole Scheduler supports waiting for data. For example, waiting for 6 weeks of data implies that window_start is -5 and window_end is 0 when the frequency is weekly.

    An example is illustrated in the following figure.

    _images/WaitForgsFiles.png
  4. Configure Timeout in minutes to change the default/previously-set time.

Note

When the data arrival interval and the scheduler interval are different, then the scheduler interval follows its own frequency to process the data. For example, if the data arrival interval is hourly and the scheduler interval is daily, the scheduler waits for an entire day’s data.

Click +Add More to add a second file. Repeat steps 1-3 to enter the file details. Timeout is set only once as it is applicable to all files.

Click +Add More to add the number of files as per the periodicity/frequency of the schedule.

Configuring Hive Tables Data Dependency

To configure hive partitions dependency, select Wait For Hive Partitions option available in Dependencies. See Hive Datasets as Schedule Dependency for more information.

Note

Use the tooltip Help_Tooltip to know more information on each field or check box.

Perform these steps after selecting Wait for Hive Partitions:

  1. After you select Wait for Hive Partitions, the Schema text field is displayed. Click in the text field and a list of available schema in the account is displayed as illustrated in the following figure. Select a schema from the list.

    _images/HivePartitionSchema.png
  1. After selecting a schema, Table text field is displayed. Select a table that has partitions. The following figure illustrates a table with Hive partitions.

    _images/HivePartitionTable.png
  1. After you select the table, the Table Data settings are displayed as shown in the following figure.

    _images/HivePartitionTableSettings.png

    In Global Settings:

    1. Set the Interval and select an incremental value from the Increment drop-down list. The default value is minutes.

    2. Set the Window Start time.

    3. Set the Window End time.

    4. Select a partition column from the Column drop-down list. The following options are displayed:

      • Set Date Time Mask for the partition: This value is matched with the nominal time format and then the corresponding value is used as a string to check for dependency.

      • Specify dependency on partition column values: This value is used as string to check for dependency.

        Depending on whether you want to set Date Time Mask or specify the dependency, perform the appropriate actions:

        • If you want to set Date Time Mask, select the Specify DateTime Mask for this Partition check box and enter the Date/Time Mask. For example, `%Y-%M` specifies year and month as the dependency value. An example is illustrated in the following figure.

          _images/WaitForHivePartition.png
        • If you want to specify the dependency value, enter values in the Partition Column field.

          Note

          Values of the macros defined in a schedule are not supported for checking dependencies. Therefore, you must not enter these values in the Partition Column field.

          An example is illustrated in the following figure.

          _images/WaitForHivePartition1.png
  2. Configure Timeout in minutes to change the default/previously-set time.

    Large data sets are typically divided into directories. Directories map to partitions in Hive. Currently, partitions in Hive are populated manually using the following command (picking up for the miniwikistats table):

    ALTER TABLE miniwikistats RECOVER PARTITIONS;
    

Note

To add multiple Hive Table dependencies, click +Add New and enter the required information as described above in this topic.

Managing Scheduler Permissions through the UI

QDS supports setting permissions for a specific scheduler in the Scheduler UI page in addition to the Object Policy REST API. For more information on the API, see Set Object Policy for a Scheduler. You can allow/deny the schedule’s access to a user/group.

To allow/deny permissions to a specific schedule through the UI, perform these steps:

  1. Navigate to the Scheduler page and click the gear icon in the schedule to which you want to restrict permissions.

    _images/SchedulerOperations.png

    You must be a system-admin, owner of the schedule, or a user who has the Manage permission to set permissions and even see Manage Permissions.

  2. Click Manage Permissions. The Manage Permissions dialog is displayed as shown here.

    _images/SchedulerPermissions.png
  3. Select a user/group that you want to allow or deny access to. Some permissions are set by default and the current permissions are displayed for the selected user/group. To allow select the check box below the schedule policy action, and to deny, uncheck the check box below the scheule policy action or do not select the check box. The different scheduler permissions are:

    • Read - This permission allows/denies a user to view the specific schedule. The UI will not display a schedule for which a user does not have read permission. This implies that all other actions even if granted are ineffective on the UI.
    • Update - This permission allows/denies a user to edit the schedule configuration.
    • Delete - This permission allows/denies a user to delete the schedule.
    • Clone - This permission allows/denies a user to clone the schedule.
    • Manage - This permission allows/denies a user to manage this schedule’s permissions.

    Note

    If you a allow a user with a permission who is part of the group that has restricted access, then that user is allowed access and vice versa. For more information, see Understanding the Precedence of Scheduler Permissions.

  4. Click Add New Permission to assign permissions to another user/group. Allow/deny schedule permissions as described in the step 3. Specific schedule permissions for a user and a group are illustrated in this sample.

    _images/SchedulerPermissions1.png

    If you want a user/group to not view the schedule, then you can deny the read access group as illustrated above.

  5. Click Save after assigning scheduler permissions to the user(s)/group(s).

Understanding the Precedence of Scheduler Permissions

The precedence of scheduler permissions are mentioned below:

  • The schedule owner and system-admin have all permissions that cannot be revoked.

  • Users take precedence over groups.

  • A user who is not assigned with any specific permissions inherits them from the group that he is part of.

  • If the schedule ACL permissions are defined by a user, who is the current owner, then that user has all access by default. Even if there is a access control set for deny. Basically QDS honors ownership over object ACLs.

  • If the schedule ACL permissions are not defined by a user, who is the current owner, QDS allows that user to do schedule operations if there is no explicit deny permission set for that user. But if a READ permission is denied to the user, then the user cannot see that specific schedule in the Schedule list.

    _images/SchedulerPermissions1.png
Editing and Cloning a Schedule

You can edit an existing schedule’s settings. To retain the same configuration for another schedule, clone it.

Note

Press Ctrl + / to see the list of available keyboard shortcuts. See Using Keyboard Shortcuts for more information.

Editing a Schedule

Navigate to the Scheduler tab to see the schedules. From the list in the left pane, select a schedule that you want to edit. In the top of the schedule information page, click Edit to modify the schedule.

Note

There is a rerun limit for schedule reruns to be processed concurrently at a given point of time. Understanding the Qubole Scheduler Concepts provides more information.

The Edit Schedule page is displayed and all fields are optional.

Perform the following steps to edit a schedule:

Change the Schedule Name if you want to name it differently.

Creating a New Schedule describes how to edit all schedule options.

Click Save after editing the schedule settings. Click Cancel if you do not want to clone a schedule.

Cloning a Schedule

Note

There is a rerun limit for schedule reruns to be processed concurrently at a given point of time. Understanding the Qubole Scheduler Concepts provides more information.

Navigate to the Scheduler tab. From the list in the left pane, select a schedule that you want to clone. In the top of the schedule information page, click Clone to retain the same schedule configuration in another schedule. By default, a schedule gets the same name with Clone of words added before the schedule’s existing name as illustrated in the following figure.

_images/CloneJob.png

You can add a different name to the schedule. Click Save after cloning the schedule settings. A unique system-generated ID is assigned to a cloned schedule after it is saved. Click Cancel if you do not want to clone a schedule.

Changing the Owner of a Schedule

You can change the owner of a schedule if you have access to manage permissions. By default, the system admin can change the owner of a schedule. Navigate to the Scheduler tab to see the schedules. From the list in the left pane, select a schedule where you want to change the ownership and click the gear icon. Select Change Owner from the drop-down list. The Change Owner window is displayed.

_images/change_scheduler_owner.png

Select the owner from the drop-down list of owners and click Save.

_images/change_owner.png

The ownership of the schedule is changed.

Hive Datasets as Schedule Dependency

This section describes how schedules can be set up to run only if the data is available in Apache Hive tables. Typically, schedules which run Hive commands depend on data in Hive tables.

CREATE EXTERNAL TABLE daily_tick_data (
    date2 string,
    open float,
    close float,
    high float,
    low float,
    volume INT,
    average FLOAT)
PARTITIONED BY (
    stock_exchange STRING,
    stock_symbol STRING,
    year STRING,
    date1 STRING)
ROW FORMAT DELIMITED
FIELDS TERMINATED BY ','
LOCATION '<scheme>/stock_ticker';

date1 is a date with format YYYY-MM-DD

<scheme> is the Cloud-dependent URI and path: for example gs://gs-bucket/default-datasets

The dataset is available from 2012-07-01. For this example, let us assume that the dataset is updated everyday at 1AM UTC and jobs are scheduled everyday at 2AM UTC.

The following query has to be executed every day:

SELECT stock_symbol, max(high), min(low), sum(volume) FROM daily_tick_data WHERE date1='$yesterday$' GROUP BY stock_symbol

The following sub-topics provide more information:

Partition Column Values

Qubole has to be informed about the new partitions that are added every day.

In the example, the following partitions are added on 2013-01-02:

stock_exchange = nasdaq stock_exchange = nasdaq stock_symbol = ibm stock_symbol = orcl year = 2013 year = 2013 date1=2013-01-01 date1=2013-01-01
stock_exchange = nyse stock_symbol = ibm year = 2013 date1=2013-01-01
stock_exchange = nyse stock_symbol = ibm year = 2013 date1=2013-01-01

For example, the partition columns can have the following values:

stock_exchange [nasdaq, nyse]
stock_symbol [ibm, orcl]
year %Y
date1 %Y-%m-%d

The above information has to be entered while submitting a job either through the UI or API.

The format of the partition columns, year and date1, does not change from one job to another. These are stored in the Hive metastore and do not need to be specified every time.

The format for date partition columns can be entered through the QDS UI or the API. For more information on Store Table Properties, see Store Table Properties.

See Configuring Hive Tables Data Dependency for more information on setting Hive table data dependency using the QDS UI.

Dataset Interval

In this example, the job runs every day and the dataset is generated every day. It is possible that the job runs at a frequency different from the interval at which the dataset is generated. For example, the following query is run once in seven days while the dataset is generated once a day.

SELECT stock_symbol, max(high), min(low), sum(volume) FROM daily_tick_data WHERE date1>'$sevendaysago$' AND date1 < '$today$' GROUP BY stock_symbol

Qubole needs additional information to schedule this job, as follows:

interval How often the data is generated.
window_start, window_end Defines the range of intervals to wait for. Each is an integer which is multiple of the interval.

For the purposes of this example, the values for interval, window\_start and window\_end are:

interval 1 day
window_start -6 (inclusive of seven days ago)
window_end 0 (inclusive of today )

As with the date formats of the partition columns, the interval at which the dataset is generated does not change often. interval can also be stored in the Hive metastore and need not be specified every time.

For more information, see Configuring GS Files Data Dependency.

Initial Instance

Initial instance specifies the first instance of the data that is available. This is useful when a new dataset is introduced. It is possible that some jobs at the beginning may not have all instances available and should not be generated.

Let us understand the dependency of data in GCP GS files and Hive partitions required by the Qubole Scheduler for scheduling jobs. Dependencies are the prerequisites that must be met before a job can

Understanding GCP GS Files Dependency

GS files dependency implies that a job runs if the data has arrived in GS buckets. You can schedule a job to run at a specific date and time, either once or on a repetitive basis if the data exists. You can define repeat intervals such as last 6 hours, last 6 days, last 3 weeks, and last 7 months. For more information, see Configuring GS Files Data Dependency.

To schedule jobs at periodic intervals, Qubole Scheduler requires the following information:

  • Start day or time (parameter: window_start)
  • End day or time (parameter: window_end)
  • Day or time interval that denotes when and how often data is generated (parameter: interval)
  • Nominal time which is the logical start time of an instance

The dependency must be defined as: gs://<bucket>/<folderinGSbucket>/<abc>-%Y-%m-%d-%H-%M-%S, for example: gs://abc.com/data/schedule-2014-12-31-00-00-00.

See Time class for more information on date and time placeholders.

The following table shows how to create data in GS files for the previous day’s data with a daily interval.

Sequence ID Nominal Time Created At Dependency
1 2015-01-01 00:00:00 2015-04-22 10:00:00 gs://abc.com/data/schedule-2014-12-31-00…
2 2015-01-02 00:00:00 2015-04-22 10:15:00 gs://abc.com/data/schedule-2015-01-01-00…
3 2015-01-03 00:00:00 2015-04-22 10:30:00 gs://abc.com/data/schedule-2015-01-02-00…

The window_start and window_end parameters are relative to Nominal Time.

Nominal Time is the time for which the Schedule Action was processed and Created At is the time at which the Scheduler picked up the schedule. For more information, see Understanding the Qubole Scheduler Concepts.

Interpreting window_start Parameter Values

The value 0 implies now, -1 implies 1 day ago, and -2 implies 2 days ago.

Similarly, for an hourly/daily/weekly/monthly/yearly interval (frequency), the value 0 denotes now. -1 denotes 1 hour/day/week/month/year ago. -2 denotes 2 hour/day/week/month/year ago and so on.

Interpreting window_end Parameter Values

The Qubole Scheduler supports waiting for data. For example, waiting for 6 weeks of data implies that window_start is -5 and window_end is 0.

Note

When the data arrival interval and the scheduler interval are different, then the scheduler interval follows its own frequency to process the data. For example, if the data arrival interval is hourly and the scheduler interval is daily, the scheduler waits for an entire day’s data.*

Data and the scheduler can be in two different timezones. For example,

  {
  window_start => -48
  window_end => -24
  frequency => hourly
  time_zone => Americas/Los_Angeles
  }
scheduler_frequency => daily
time_zone => Americas/New_York
Understanding Hive Partition Dependency

The Qubole Scheduler allows data units to have Hive partitions. Data in Hive tables can be categorized by Hive partitions such as country or date. The Hive query example on this page contains Hive partitions. The scheduler recognizes the Hive partitions from the corresponding table properties in the Hive metastore. See Partitions for more information.

Timezone can be specified as an optional parameter in a hive query with daylight savings as on/off.

Hive tables can be partitioned by date and country. Dependency is expressed as %Y/%M/%d/["US", "CAN", "IRE"].

Macros in Scheduler

In the Qubole Scheduler, commands need access to the context of the instance. The scheduler provides access to the context through macros. For more information, see Macros.

New macros can be defined using the Javascript language. The daily_tick_data table is used in the example given below. The example query is:

SELECT stock_symbol, max(high), min(low), sum(volume)
FROM daily_tick_data
WHERE date1 > '$sevendaysago$' AND date1 <= '$yesterday$'
GROUP BY stock_symbol

Macros can be accessed or defined using the Javascript language. Only assignment statements are valid. Loops, function definitions, and all other language constructs are not supported. Assignment statements can use all operators and functions defined for the objects used in the statements. Defined macros can be used in subsequent statements.

Javascript Language and Modules

The following Javascript libraries are available.

Library Description Link to Documentation
Moment.js Provides many date/time related functions. Ensure that the moment.js timezone functionality matches with the timezone used by the scheduler. Qubole uses Moment JS version 2.6.0. Moment.js
Moment-tokens Provides strftime formats Moment-tokens

The macros shown in the query are defined as follows:

Note

Ensure that the moment.js timezone functionality matches with the timezone used by the scheduler.

sevendaysago = Qubole_nominal_time.clone().subtract('days', 7).strftime('%Y-%m-%d')
yesterday = Qubole_nominal_time.clone().subtract('days', 1).strftime('%Y-%m-%d')

The following examples shows adding timezone function in the Moment Javascript.

India Standard Time Zone

sevendaysago = Qubole_nominal_time.clone().subtract('days', 7).tz('Asia/Kolkata').format('YYYY-MM-DD')
yesterday = Qubole_nominal_time.clone().subtract('days', 1).tz('Asia/Kolkata').format('YYYY-MM-DD')

US Pacific Standard Time Zone

sevendaysago = Qubole_nominal_time.clone().subtract('days', 7).tz('America/Los_Angeles').format('YYYY-MM-DD')
yesterday = Qubole_nominal_time.clone().subtract('days', 1).tz('America/Los_Angeles').format('YYYY-MM-DD')

See Moment JS Timezones for information on other time zones.

System Variables

The system variables are described in the following table.

Qubole_nominal_time A moment object representing the time when this instance is supposed to run.
Qubole_nominal_time_iso Qubole_nominal_time is in ISO 8601 format.

For more information, see Understanding the Qubole Scheduler Concepts.

See Creating a New Schedule for more information on setting Macros using the Qubole user interface.

Clusters

This section explains how to configure and manage QDS clusters. It covers the following topics:

Introduction to Qubole Clusters

Qubole Data Service (QDS) provides a unified platform for managing different types of compute clusters.

QDS can run queries and programs written with tools such as SQL, MapReduce, Scala, and Python. These run on distributed execution frameworks such as Hadoop and Spark, on multi-node clusters comprising one coordinator node and one or more worker nodes.

Cluster Basics

Each QDS account has pre-configured clusters of different Types (Hadoop, Spark, etc.) You can configure additional clusters. Each cluster can have one or more unique Cluster Labels.

A new account is pre-configured with one cluster of each of the following types:

  1. Spark (labelled as spark)
  2. Hadoop 2 (labelled as hadoop2)
  3. Presto (labelled as presto; currently AWS and Azure only)

Navigate to Control Panel > Clusters in The QDS UI to see the list of clusters.

Note

The clusters are configured but are not active. A red status icon indicates that a cluster is down.

You can configure several clusters of a single cluster type as needed. (Trial accounts are limited to four clusters.)

Cluster Life Cycle Management

See Understanding the QDS Cluster Lifecycle.

Cluster Labels and Command Routing

You must assign at least one unique label to each QDS cluster; you can assign more than one label. Each new QDS account has a default Hadoop cluster with the label default.

Qubole commands are routed to clusters using these rules:

  • If a command includes a cluster label, the command is routed to the cluster with the corresponding label.
  • If no cluster label is included, the command is routed to the first matching cluster; for example:
    • Hive and Hadoop commands are routed to the first matching Hadoop cluster.
    • Spark commands are routed to the first matching Spark cluster.
    • Presto commands are routed to the first matching Presto cluster.
Understanding Cluster Operations

This section explains cluster operations such as start, terminate, add, clone, and delete a cluster.

How to Start or Terminate a Cluster

The current state of the cluster is indicated by the icon on the extreme left of each row in the cluster table. It is a green circle for a running cluster, a red circle for a terminated cluster and an arrow pointing up or down for clusters that are starting up or terminating, respectively.

  • Click the Start button to start a cluster or restart a stopped cluster.

    A dialog box prompts for confirmation. Click OK to start a cluster. Click Cancel if you do not want to start a cluster.

    Caution

    You must not manually launch instances in the security groups created by Qubole when a cluster is active. Launching instances in the Qubole security groups would result in the incorrect display of the number of nodes in the Clusters UI page.

    Ensure to provide a persistent security group when you configure outbound communication from cluster nodes to pass through a Internet proxy server. You can configure a persistent security group in Advanced Configuration tab of that cluster’s UI.

  • Click the Stop button to terminate a running cluster.

    A dialog box prompts for confirmation. Click OK to terminate a cluster. Click Cancel if you do not want to terminate a cluster.

Click the refresh icon RefreshIcon that is on the top-right corner of the Clusters page to refresh the clusters status.

How to Add a Cluster

To add a cluster, click the New button on the Clusters page.

The Create New Cluster page is displayed. Enter the following details to create a new cluster:

  • Cluster Labels: A list of comma-separated labels to uniquely identify the cluster. This is a mandatory field.
  • For more information on the rest of the parameters, see Configuring Clusters and Managing Clusters.

Click Save to create the new cluster.

How to Clone a Cluster

Cloning may be preferable to creating a new cluster in many cases since most of the fields are copied from an existing cluster.

To clone a cluster, click the ellipse icon listed against the cluster.

Select Clone from the list of options as shown in the following figure.

_images/cloneNdefaultCluster1.png

The Clone a Cluster page is displayed. Enter a new label for the cluster. Make the required modifications and click Save to clone the cluster. The label is the only mandatory field to be changed when you clone a cluster.

Clicking Clone takes you to the Edit Clusters page.

How to Modify a Cluster

To edit a cluster, click the Edit button available in the Action column.

The Edit Cluster page is displayed. The current configuration of the cluster is displayed on this page. You can make the desired modifications to it. See Managing Clusters for more information. Click Save to save the modifications.

How to Push Configuration Changes to a Cluster

Most cluster changes take effect only when a cluster is restarted, but some can be pushed to a running cluster. Changes to the following cluster attributes can be pushed to a running cluster:

  • The maximum size of the cluster
  • The minimum size of the cluster
  • The Fair Scheduler configuration
  • The default Fair Scheduler queue
  • The Hadoop configuration variables

To push configuration changes to a running cluster, click the ellipse icon next to the cluster.

Select Push from the list of options:

_images/cloneNdefaultCluster1.png

The resulting Edit Cluster page shows all settings of the cluster; pushable fields are marked with a P. QDS allows modifying editable fields besides the pushable fields but changes to those fields (without a P) take effect only after a cluster restart.

Note

In case of a Presto cluster, the P icon is marked for the Presto Overrides but it is not applicable to all properties except a few autoscaling properties listed in Autoscaling properties. If you try to push configuration properties (that you had removed), the value of such configuration properties do not get refreshed in the running cluster as it continues to be the same value used before.

After making changes, you can click one of the two options:

  • Update and Push - To push the changes of all pushable fields to the running cluster. When you click Update and Push, a dialog prompts you for confirmation; click Update and Push again.
  • Update only - To make changes that takes effect only after the cluster restarts.
How to Delete a Cluster

Note

Running clusters, and the cluster labeled default, cannot be deleted.

To delete a cluster, click the ellipse icon next to the cluster.

Select Delete from the list of options as shown in the following figure.

_images/ClusterPush.png

A dialog box prompts for confirmation. Click Ok to delete the cluster or Cancel to keep it.

Warning

Once a cluster has been deleted, it cannot be retrieved.

Switching Clusters

Clusters are identified by labels. When you want to switch a workload from one cluster to another, you can reassign the clusters’ labels. To do this, hover the mouse on the 3 vertical dots to the left of the cluster label and drag the label to another cluster.

You can also view the commands that have run on a specific cluster by clicking on the label name.

Understanding the QDS Cluster Lifecycle

This section covers these topics:

Cluster Bringup

You can start a cluster in the following ways:

  • Start a cluster automatically by running a job or query. For example:
    • To run a Hadoop MapReduce job, QDS starts a Hadoop cluster
    • To run a Spark application, QDS starts a Spark cluster
  • You can start a cluster manually by clicking the Start button on the Cluster page (see Understanding Cluster Operations).

Note

Many Hive commands (metadata operations such as show partitions) do not need a cluster. QDS detects such query operations automatically.

On the Clusters page, against a running cluster, click Resources to see the available list as shown in this figure.

_images/ClusterResources2.png

Click Cluster Start Logs to monitor the logs related to a cluster bringup.

You can also use the REST API call to check the cluster state as described in get-cluster-state.

Cluster Autoscaling

QDS supports autoscaling of cluster nodes; see Autoscaling in Qubole Clusters.

Cluster Termination

Clusters can be manually or automatically terminated as explained below:

  • You can terminate a cluster manually from the Cluster tab UI (see Understanding Cluster Operations) or by running a cluster termination API call. Check the command status and job status in the Usage page to ensure that no command is running before terminating a cluster. For more information, see Command Status and Job Instance Status.
  • QDS keeps a cluster running as long as there are active sessions using the cluster. See Downscaling for more information. QDS auto-terminates clusters under the following conditions:
    • No job, application, or query is running on the cluster. Once this is true, Qubole waits for a grace period of 5 minutes before considering the cluster a candidate for termination.
    • No QDS command is running on a cluster. You can see the command status and job status in the Usage page. For more information, see Command Status and Job Instance Status.
    • No active session is attached to the cluster. An active session is a user session that has recently run queries against the cluster. An active session stays alive for two hours of inactivity and can run for any amount of time as long as commands are running. Use the Sessions tab in Control Panel to create new sessions, terminate sessions, or extend sessions. For more information, see Managing Sessions.
    • For a Spark cluster, spark.qubole.idle.timeout is a Spark interpreter property set for the Spark applications’ timeout/termination. The default value of spark.qubole.idle.timeout is 60 minutes. A Spark cluster cannot terminate until the Spark applications terminate.
Using the Cluster User Interface

Navigate to the Clusters page in the QDS UI to see the list of active and inactive clusters.

Note

Only an administrator can see all the UI options described in this page. A system user can see most of the UI options except a few that are only accessible to the administrator. The options are managed by the roles and groups configuration.

For more information on cluster configuration through the QDS UI, see Managing Clusters.

The icon next to each cluster ID indicates what type of cluster it is. Here is how the cluster types are indicated by the icon:

  • A Hadoop 2 (Hive) cluster has an H2 within the round icon. You can enable Hive Server 2 in an Hadoop 2 cluster as described under Configuring a HiveServer2 Cluster.
  • A Spark cluster has an S within the round icon.
  • An Airflow cluster has an A within the round icon.

Active and inactive clusters are listed as two separate categories on the Clusters page. The active clusters have a green icon and inactive clusters have a pale-red icon.

There is another category, Transitioning Clusters that shows clusters that were just started and are still in a pending state (not fully in the UP state); and also clusters that have just stopped, and are still in the the process of being terminated.

Next to each cluster, there are four buttons:

  • Resources: Contains the list of resources such as cluster start logs, Spark Job Server (for a Spark cluster), and so on, for a running cluster.
  • Start: Click this button to start the cluster. For a running cluster, this button is replaced by a Stop button.
  • Edit: Click this button to edit a cluster’s configuration.
  • (An elipse): This button has a sub-list of options such as editing the cluster’s node bootstrap script, cloning, deleting, and setting the cluster as the default.

Click an active cluster to see the resources and public and private IP addresses of its coordinator and worker nodes.

The Node Bootstrap Logs are also available on the Clusters page of the QDS UI, as part of Nodes table for a running cluster. On the Clusters page, below a running cluster, the number of nodes in the cluster is displayed next to Nodes. Click the number to see the Nodes table.

Searching and Filtering Clusters

You can use the search box to enter a cluster ID or label to find the closest matching cluster (useful if the list of clusters is too long to be displayed in full).

Click the filter icon on the clusters UI page to filter the cluster by its status or type.

In the filter, select the cluster type or status as required and click Apply.

Understanding the UI of an Active Cluster

Click an active cluster and it provides you the cluster-related information such as:

  • Cluster ID and the cluster type
  • The cluster up time
  • The cluster start time
  • Number of nodes
  • Coordinator DNS
  • All its available resources
  • Node details such as instance type, role, public and private DNS, Spot Instance, and the uptime of the node.

When you hover the mouse over the coordinator DNS of the cluster, you see a copy button. Click it to copy the cluster’s coordinator DNS. You can also click to view the commands that are run on a specific cluster by clicking the label name in the list of active/inactive clusters.

Note

You can reassign a cluster’s label to another cluster by dragging it and dropping it from one cluster to the other; do this on the Clusters page of the QDS UI. Make sure that the clusters are not running, so as not to interfere with active queries.

Viewing Deleted Clusters

Click View Deleted Clusters, at the bottom of the Clusters page, to see the list of deleted clusters.

Understanding a Node Bootstrap Script

Bootstrap scripts allow installation, management, and configuration of tools useful for cluster monitoring and data loading. A node bootstrap script runs on all cluster nodes, including autoscaling nodes, when they come up.

Node bootstrap scripts must be placed in the default location, for example, something similar to:

gs://test-vs/scripts/hadoop/node_bootstrap.sh

The logs written by the node bootstrap script are saved in node_bootstrap.log in /media/ephemeral0/logs/others.

The Node Bootstrap Logs are also available in the cluster UI as part of the Nodes table for a running cluster. In the cluster UI, below a running cluster, the number of nodes in the cluster is displayed next to Nodes. Click the number to see the Nodes table. For more information on Resources, see Using the Cluster User Interface.

Note

Qubole recommends you install or update custom Python libraries after activating Qubole’s virtual environment and installing libraries in it. Qubole’s virtual environment is recommended as it contains many popular Python libraries and has the advantages described in Using Pre-installed Python Libraries from the Qubole VirtualEnv.

You can install or update Python libraries in Qubole’s virtual environment by adding a script to the node bootstrap file as in the following example:

source /usr/lib/hustler/bin/qubole-bash-lib.sh
qubole-use-python2.7
pip install <library name>

The node bootstrap script is invoked as a root user. It does not have a terminal (TTY or text-only console); note that many programs do not run without a TTY. In Hadoop clusters, a node bootstrap script is invoked after the HDFS daemons have been bought up in case of Worker nodes but before MapReduce and YARN daemons have been initialized. However, in case of the coordinator node, a node bootstrap script is invoked after the ResourceManager is started. This means that Hadoop applications are run only after the node bootstrap completes.

The node bootstrap process is executed via code resident on the node. This code is executed only on the first boot cycle, not on reboot.

The cluster launch process waits without limit for the node bootstrap script to complete. Specifically, worker daemons and task execution daemons – for example, NodeManager (Hadoop2) waits for the script to execute.

Qubole provides a library of certified bootstrap functions for use in node bootstraps. It is recommended to use those certified bootstrap functions to avoid compatibility issues with future versions of Qubole Software.

Running Node Bootstrap Scripts on a Cluster describes how to run node bootstraps on a cluster and Run Utility Commands in a Cluster describes how to run utility commands to get the node-related information such as seeing if a node is a Worker or Coordinator, or getting the coordinator node’s IP address. You can also see How do I check if a node is a coordinator node or a worker node?.

Running Node Bootstrap and Ad hoc Scripts on a Cluster

Qubole allows you to run node bootstrap scripts, and other scripts ad hoc as needed, on cluster nodes. The following topics describe running node bootstrap and ad hoc scripts:

Running Node Bootstrap Scripts on a Cluster

You can edit the default node bootstrap script from the cluster settings page: in the QDS UI, navigate to Clusters and click Edit against a specific cluster. Managing Clusters provides more information.

Note

Qubole recommends installing or updating custom Python libraries after activating Qubole’s virtual environment and installing libraries in it. Qubole’s virtual environment is recommended as it contains many popular Python libraries and has advantages as described in Using Pre-installed Python Libraries from the Qubole VirtualEnv.

You can install and update Python libraries in Qubole’s virtual environment by adding code to the node bootstrap script, as follows:

source /usr/lib/hustler/bin/qubole-bash-lib.sh
qubole-use-python2.7
pip install <library name>

The node bootstrap logs are written to node_bootstrap.log under /media/ephemeral0/logs/others. You can also find them from the QDS UI in the Nodes table for a running cluster: in the Clusters section of the UI, below the active/running cluster, the number of nodes on the cluster is displayed against Nodes; click the number to see the Nodes table. For more information on Resources, see Using the Cluster User Interface.

Understanding a Node Bootstrap Script provides more information.

Examples of Bootstrapping Cluster Nodes with Custom Scripts
Example: Installing R and RHadoop on cluster
  1. Create a file named node_bootstrap.sh (or other name you choose) with the content:
sudo yum -y install R
echo "install.packages(c(\"rJava\", \"Rcpp\", \"RJSONIO\", \"bitops\", \"digest\",
               \"functional\", \"stringr\", \"plyr\", \"reshape2\", \"dplyr\",
               \"R.methodsS3\", \"caTools\", \"Hmisc\"), repos=\"http://cran.uk.r-project.org\")" > base.R
Rscript base.R
wget https://github.com/RevolutionAnalytics/rhdfs/blob/master/build/rhdfs_1.0.8.tar.gz
echo "install.packages(\"rhdfs_1.0.8.tar.gz\", repos=NULL, type=\"source\")" > rhdfs.R
Rscript rhdfs.R
wget https://github.com/RevolutionAnalytics/rmr2/releases/download/3.3.1/rmr2_3.3.1.tar.gz
echo "install.packages(\"rmr2_3.3.1.tar.gz\", repos=NULL, type=\"source\")" > rmr.R
Rscript rmr.R
cd /usr/lib/hadoop
wget http://www.java2s.com/Code/JarDownload/hadoop-streaming/hadoop-streaming-1.1.2.jar.zip
unzip hadoop-streaming-1.1.2.jar.zip
  1. Edit the cluster in the QDS UI and enter the name of the bootstrap file into the Node Bootstrap File field, so as to place the file in the appropriate location in Cloud storage.

The above example installs R, RHadoop and RHDFS on the cluster nodes. You can now run R commands as well as RHadoop commands. A sample R script using RHadoop is as given below.

Sys.setenv(\"HADOOP_STREAMING\"=\"/usr/lib/hadoop/hadoop-streaming-1.1.2.jar\")
library(rmr2)
small.ints = to.dfs(1:1000)
  mapreduce(
    input = small.ints,
    map = function(k, v) cbind(v, v^2))
Running Ad hoc Scripts on a Cluster

You may want to execute scripts on a cluster in an ad hoc manner. You can use a REST API to execute a script located in Cloud storage. See Run Adhoc Scripts on a Cluster for information about the API.

The Run-Adhoc Script functionality uses the pssh to spawn adhoc scripts on the cluster nodes. It has been tested under the following conditions:

  • Works in clusters that are being set up using a proxy tunnel server
  • Even if the script execution time is longer than the pssh timeout, the script still executes on the node.
Limitations of Running Ad hoc Scripts

If a script is running and you try to execute the same script on the same cluster, the second instance will not run. To work around this, you can tweak the path of the script, and then run it as a separate instance of the API.

Health Checks for Clusters

This section explains the various health checks configured for the clusters.

Cluster HDFS Disk Utilization

This alert checks the free space allotted to HDFS and sends an alert if the free space is lower than a configurable limit.

Node Disk Utilization

This alert checks the free space allotted to HDFS on each node of the cluster and sends an alert if the free space is lower than a configurable limit.

Simple Hadoop Job Probe

This alert probes a simple end-to-end hadoop job in the cluster to check the overall health of the cluster.

For more information, see Cluster Administration.

Data Engineering

Data Science

This section explains how to use Notebooks and Dashboards.

Notebooks

Notebooks are becoming increasingly popular among data scientists, who often use them for quick exploration tasks. Once set up, a notebook provides a convenient way to save, share, and re-run a set of queries on a data source– for example to track changes in the underlying data over time, or to provide different views using different parameters.

Qubole provides notebook user interfaces based on Zeppelin and Jupyter for Zeppelin notebooks and Jupyter notebooks, respectively.

Zeppelin Notebooks

QDS provides a notebook interface based on Apache Zeppelin.

Important

QDS launches a new version of the Notebooks page with usability enhancements. This version of the Notebooks page is not enabled for all users by default. Create a ticket with Qubole Support to enable this feature on the QDS account.

Note

Unless otherwise specified, the term notebook refers to Zeppelin notebooks in the documentation.

The following topics help you understand how to use the notebook interface to create and manage Zeppelin notebooks:

Notebooks V2

QDS launches a new version of the Notebooks page with usability enhancements.

Note

This version of the Notebooks page is not enabled for all users by default. You can enable this feature from the Control Panel >> Account Features page.

For more information about enabling features, see Managing Account Features.

The new version of the Notebooks page has the following enhancements:

  • Numbered paragraphs.
  • Blue highlight on the left of the paragraph to indicate that the paragraph is active or selected.
  • The Run button, option icons and the settings option appear as floating options in a paragraph when you hover the mouse on the paragraph.
  • Line numbers are displayed in paragraphs by default.
  • Titles are bold. You can add a title to a paragraph in the Title box with one click.
  • Status Saving appears when you are working on the paragraph for better accessibility.
  • Code and the corresponding output are distinguished in a paragraph.
Accessing the new Notebooks page

Perform the following steps to launch the new version of the Notebooks page:

  1. Navigate to the Notebooks page.

  2. Open any existing notebook or create a notebook.

  3. Click on the Gear icon on the top right corner, and select Switch to New UI from the settings menu as shown below.

    _images/switch-new-ui.png
  4. Click Confirm on the pop-up box.

The following GIF shows the new Notebooks UI with the changes.

_images/new-notebooks-ui.gif

The new Notebooks UI is displayed as shown in the following figure.

_images/new-notebooks-ui.png

The following image shows numbered paragraphs, line numbers, the Run button, option icons and the settings option, bold title, and the active paragraph with the blue highlight on the left.

_images/changes-notebooks-ui1.png

The following image shows the code and the corresponding output in a paragraph.

_images/changes-notebooks-ui2.png

The following image shows the Saving status in a paragraph.

_images/changes-notebooks-ui3.png
Managing Notebooks

You can create, edit, or delete a notebook. The following topics explain how to manage notebooks:

Note

A pin is available at the bottom-left of the Notebooks UI. You can use it to hide/unhide the left side bar to toggle the notebooks’ list.

Viewing a Notebook Information

Click the notebook in the left panel to view its details. The notebook’s ID, its type, associated cluster are displayed. You can resize the left panel/sidebar.

The following figure shows an illustration of a notebook’s details displayed.

_images/NotebookDetails.png

A notebook that is marked green indicates that its assigned cluster running in the left panel. Click it in the left panel and the notebook is displayed in the right panel as shown in the following figure.

_images/EditableNotebook1.png

The notebook shows a green circle against Connected status and Interpreters. It also shows the associated cluster status (running) with its ID along with the notebook’s ID on the top-left corner of the notebook.

The following figure provides the different icon options available in a notebook.

_images/NotebookIcons.png

A notebook that is marked red indicates that its assigned cluster in a down state. Such notebooks are read-only and cannot be used to run a paragraph. You can edit the name of a read-only notebook. You can also start the assigned cluster within the notebook. The following figure shows an example of a read-only notebook whose assigned cluster is not running.

_images/ReadOnlyNotebook.png

Click Start Now to start the assigned cluster. After the assigned cluster is in a running state, use this notebook to run paragraphs.

A notebook that is marked grey indicates that it does not have any assigned cluster. Click it to see more details. To assign a cluster to a notebook, just click the cluster drop-down that is available in the notebook. Alternatively, you can configure the settings as described in Configuring a Notebook.

An unassigned notebook example is shown in the following figure.

_images/UnassignedNotebook.png

Click Refresh List for refreshing notebooks’ list.

Click the Notebook button to see all notebooks in the account.

Note

If there is a firewall blocking outgoing data packets from a notebook, then allow the outgoing data traffic on the port 443. In general, allow port 80 and port 443 for a secure HTTPS-to-HTTPS communication.

Viewing Cluster Status

You can view the cluster status by using the cluster widget on the Notebooks page that displays the real-time information about the cluster health.

Note

This feature is not enabled for all users by default. Create a ticket with Qubole Support to enable this feature on the QDS account.

On the Notebooks page, click on the Cluster widget. Status of the cluster is displayed as shown in the following figure.

_images/cluster-status-notebook.png

You can also switch the attached cluster by selecting another cluster from the Switch Attached Cluster drop-down list.

Viewing Spark Application Status

You can view the Spark application status by using the widget on the Notebooks page that tracks the status of the Spark application and displays real-time information about the application.

Note

This feature is not enabled for all users by default. Create a ticket with Qubole Support to enable this feature on the QDS account.

On the Notebooks page, click on the Spark Application widget. Status of the Spark application is displayed as shown in the following figure.

_images/spark-status-notebook.png
Using Folders in Notebooks

Qubole supports folders in notebooks as illustrated in the following figure.

_images/NotebookHome.png

Currently folders are available only with notebooks. So, in a notebook, you can create folders and organize your notebooks in it. You can also drag a notebook from one folder and drop it to another folder to move it. The side panel that shows the notebook folders gets automatically hidden once you are in the active notebook.

All folders are created inside the default location (storage). The folders that are available in the Notebooks UI by default are:

  • Recent - It contains the notebooks that have been created/used recently by the current user with a folder hierarchy as shown below.

    _images/RecentFolder.png
  • My Home - It contains the notebooks of the current user. It is the home directory/location of a Qubole account’s user.

  • Common - It is a special folder, where you can create projects and collaborate with your team members and provide access to different users of a project. A system admin or a user with Folder write access can create projects inside the Common folder and provide access to a set of users. You require admin permission to enable changing-permissions of other users.

    If you are a system admin or Folder admin, you can grant access to other users in an organization.

  • Users - As Qubole is a shared system and there are multiple users within an account, Qubole has combined all folders of a single Qubole account’s users in this folder. With this, you can easily navigate through other user’s folders. By default, Qubole creates a Notebook folder for each user in user/<your email address>/. You can create multiple folders and sub-folders to organize notebooks. You need to go to a user’s folder to check out a peer’s notebooks. You can view someone else notebooks until that notebook’s owner has explicitly denied access to other users.

  • Examples - It contains a list of sample notebooks with paragraphs in supported languages as shown here.

    _images/NotebookExamples.png

    You can copy the sample notebooks.

  • Tables - It provides access to Hive tables. For more information, see tables-tab.

By default, users of a Qubole account have read access and a system admin and users with Folder write access have full access on all notebooks. You can change permission for users but you cannot revoke the access for the system admins.

Creating a Folder

You can create a folder in My Home, Common, or Users/other_user_email folders with the required permissions for Common and Users folders. In the left panel, pull the downward arrow that is next to the New button and click Folder from the drop-down list. The dialog to create a new folder is displayed as shown in the following example.

_images/CreateFolder.png

Add a name to the folder and the base folder location is by default added. Change it if you want a different location. You can select the location through the visual location picker provided by Qubole as shown here.

_images/LocationPicker.png

Clicking the leftward arrow displays the top-level folders as shown here.

_images/LocationPicker1.png
Understanding Folder Operations

You can refresh, rename, move, copy, or delete a notebook folder as described in this list:

  • To refresh a folder, click the gear icon against the notebook folder that you want to refresh. Click Refresh from the drop-down list.

  • To rename a folder, click the gear icon against the notebook folder that you want to rename. Click Rename from the drop-down list. The dialog is displayed as shown in the following figure.

    _images/RenameFolder.png

    Add a new name to the folder and click Rename.

  • To move a folder, click the gear icon against the notebook folder that you want to move. Click Move from the drop-down list. The dialog is displayed as shown in the following figure.

    _images/MoveFolder.png

    Add a path to the folder in Destination or browse to the new location and click Move.

  • To delete a folder, click the gear icon against the notebook folder that you want to delete. Click Delete from the drop-down list.

    A dialog is displayed that asks for confirmation for deleting the folder. Click OK to delete it.

Managing Folder-level Permissions

You can override the Folder access for folders that is granted at the account-level in the Control Panel. For more information, see Managing Roles.

If you are part of the system-admin group or any group which have manage access on the Folder resource, then you can manage permissions.

Note

You can set/override folder-level permissions only to the first-level folder in each default-folder such as Common or Users.

To set the folder-level permission, perform the following steps:

  1. Click the gear/settings icon against a folder, the Manage Permissions for <Foldername> dialog is displayed as shown in the following figure.

    _images/PermissionFolder.png
  2. Select a user/group from the drop-down list.

  3. You can set the following folder-level permissions for a user or a group:

    • Read: Set it if you want to change a a user/group’s read access to this specific folder.
    • Write: Set it if you want a user/group to have write privileges in this specific folder.
    • Manage: Set it if you want a user/group who can manage access to this specific folder. The subfolders have the same access.
  4. You can add any number of permissions to the folder by clicking Add Permission.

  5. You can click the delete icon against a permission to delete it.

  6. Click Save for setting folder permissions to the user/group. Click Cancel to go back to the previous tab.

Managing Notebook Permissions

Here, you can set permission for a notebook. By default, all users in a Qubole account have read access on the notebook but you can change the access. You can override the notebook access that is granted at the account-level in the Control Panel. If you are part of the system-admin group or any group which have full access on the Notes resource, then you can manage permissions. For more information, see Managing Roles.

A system-admin and the owner can manage the permissions of a notebook by default. Perform the following steps to manage a notebook’s permissions:

  1. Click the gear box icon next to notebooks and click Manage Permission.

  2. The dialog to manage permissions for a specific notebook is displayed as shown in the following figure.

    _images/PermissionNotebook.png
  3. You can set the following notebook-level permissions for a user or a group:

    • Read: Set it if you want to change a user/group’s read access to this specific notebook.
    • Update: Set it if you want a user/group to have write privileges for this specific notebook.
    • Delete: Set it if you want a user/group who can delete this specific notebook.
    • Manage: Set it if you want a user/group to grant and manage access to other users/groups for accessing this specific notebook.
  4. You can add any number of permissions to the notebook by clicking Add Permission.

  5. You can click the delete icon against a permission to delete it.

  6. Click Save for setting permissions to the user/group. Click Cancel to go back to the previous tab.

Locking a Notebook

You can lock and unlock notebooks. When you lock a notebook, you prevent edits and other actions such as running a paragraph, clear output and show/hide code on it, from other users in the account.

Once the notebook is ready to be used, you can unlock it to make it available to all users in the account. This is a useful safeguard in a multi-user Qubole account.

To lock a notebook, click the lock icon, LockIcon.

When you lock a notebook, the icon turns locked as illustrated in the following figure.

_images/LockedNote.png

By locking a notebook, you get exclusive control over it. Other users in the account can view the notebook but cannot edit or delete it until you unlock it. To unlock the notebook, click the lock icon again:

_images/UnlockNote1.png
Creating a Notebook

You can create notebooks for Spark and Presto clusters. You can create a notebook of a particular cluster type (for example, Spark) only if your QDS account has at least one cluster of that type. You cannot modify the type of the notebook after it is assigned to a cluster.

  1. Click the New button that is on the top of the left panel, and click Notebook from the drop-down list. The Create New Notebook dialog box is displayed as shown in the following figure.

    _images/AddNotebook.png
  2. Add a name for the notebook.

  3. Select the type from the drop-down list that shows Spark by default.

  4. If you selected the type of notebook as Spark, then the Language field is displayed. Select the required language from the drop-down list. By default, Scala is selected.

    For Spark notebooks, this field specifies the default supported language for the Spark Interpreter. The default language persists when the notebook is detached from one cluster and attached to another cluster, and when this notebook is imported or exported.

Note

The default language option for Spark notebooks is not available for all users by default. Create a ticket with Qubole Support to enable this feature on the QDS account.

  1. Location shows the current user’s last-visited folder. Qubole provides a visual location picker to set the location. The two illustrations in Creating a Folder show the visual location picker with the default location and top-level folders.

  2. Select a cluster from the Cluster drop-down list to which you want to assign the notebook.

  3. Click Create to add the notebook.

    A unique notebook ID is assigned to the newly created notebook.

Exporting a Notebook

You can export a notebook in the JSON format. To export an existing notebook, click the settings icon to see the list as shown in the following figure.

_images/NotebookActions.png

Click Export. A Save As dialog is displayed and save the notebook by browsing to the required folder/directory in the JSON format. By default, it saves as the notebook in the parent notebook’s name. You can change the name while saving the notebook.

You can also export a notebook by clicking the settings icon (gear icon) within a notebook and click Export from the drop-down list.

Note

You can export the notebook even when the cluster is down.

Downloading a Notebook

You can download a notebook in PDF, PNG, and HTML formats. When downloading the notebook, you can choose to show or hide the notebook code in the downloaded file.

Note

To download a notebook, you must have the create permission to the Commands resource as described in Resources, Actions, and What they Mean.

  1. In the Notebooks page, click on the Settings icon.
  2. Select Download As.
  3. In the Download Notebook As dialog box, select the required format from the drop-down list. By default, PDF is selected.
  4. If you want to see the notebook code in the downloaded format, then select Show Code check box. Click Download.

The following figure shows a sample Download Notebook As dialog box.

_images/download-notebook.png

Note

If a notebook fails to render within 3 minutes, then the download option fails.

You can also download notebooks by using the command API. See Submit a Notebook Convert Command

Emailing a Notebook

You can email a notebook as an attachment in PDF, PNG, and HTML formats. When emailing the notebook, you can choose to show or hide the notebook code in the attachment.

Note

To email a notebook, you must have the create permission to the Commands resource as described in Resources, Actions, and What they Mean.

  1. In the Notebooks page, click on the Settings icon.
  2. Select Email as attachment.
  3. In the Email Notebook dialog box, select the required format from the drop-down list. By default, PDF is selected.
  4. Enter the email address. If you want to send the attachment to multiple recipients, then add comma separated email addresses.
  5. If you want to see the notebook code in the attachment, then select Show Code check box. Click Send.

The following figure shows a sample Email Notebook dialog box.

_images/email-notebook.png

Note

If a notebook fails to render within 3 minutes, then the email option fails.

You can also email notebooks as attachments by using the command API. See Submit a Notebook Convert Command.

Uploading a Notebook

You can upload Zeppelin (JSON) or Jupyter (ipynb) notebooks to your Qubole account.

  1. Click +New in the top of the left navigation pane. The Create New Notebook dialog box is displayed.
  2. Select the Upload option. The corresponding text fields are as shown in the following figure.
_images/ImportNotebook.png
  1. In File path, click Choose file to browse to the notebook’s location. After you select the notebook (saved as JSON or ipynb), the notebook’s name is automatically populated. You can edit the notebook’s name.

    Location shows the current user’s last-visited folder. Qubole provides a visual location picker to set the location. The two illustrations in Creating a Folder show the visual location picker with the default location and top-level folders.

  2. Select the cluster type from the Type drop-down list, and assign a cluster by selecting it from the Cluster drop-down list.

  3. Click Create to upload the notebook to the Qubole account.

Importing a Notebook

You can import Zeppelin (JSON) or Jupyter (ipynb) notebooks into your Qubole account by using a valid JSON or ipynb URL.

Note

Importing a notebook is only supported from an unauthenticated URL.

You can use any of the following URLs to import the notebooks:

  • Link to a notebook that is saved on any cloud storage.
  • Raw Github link to a notebook that is saved on Github.
  • Link to the Gist created from the Github notebook.

Note

Importing a notebook with a link to gcs is currently not supported due to the additional Google authentication requirement from Google. For more information, see Accessing public data.

Prerequisite

You must ensure that the following requirements are met before importing a notebook.

  • The object in a cloud storage or Github is public.
Steps
  1. Click +New in the top of the left navigation pane. The Create New Notebook dialog box is displayed.
  2. Select the Import from URL option. The corresponding text fields are as shown in the following figure.
_images/ImportNotebookURL.png
  1. In the File path, enter the location of the notebook that is a valid JSON or ipynb URL. After you enter the URL, add a name to the notebook that is being imported.

    Location shows the current user’s last-visited folder. Qubole provides a visual location picker to set the location. The two illustrations in Creating a Folder show the visual location picker with the default location and top-level folders.

  2. Select the cluster type from the Type drop-down list, and assign a cluster by selecting it from the Cluster drop-down list.

  3. Click Create to import the notebook into the list of notebooks on the Qubole account.

Configuring a Notebook

To configure an existing notebook, click the settings icon to see the list as shown in the following figure.

_images/NotebookActions.png

Click Configure. The Configure Notebook dialog is displayed as shown in the following figure.

_images/ConfigureNotebook.png

You cannot change the cluster type and location. Change the name or the assigned cluster by selecting the one from the drop-down list if any. You can change the cluster associated with the notebook only when the notebook does not have any active command or active schedules associated with the notebook. You can also change the name by clicking the name on the header as shown here.

_images/NotebookNameChangeHeader.png
Cloning a Notebook

You can clone a notebook if you want the same settings. Select a notebook and click the the settings icon to see the list as shown in the following figure.

_images/NotebookActions.png

Click Clone. The Clone Notebook dialog is displayed as shown in the following figure.

_images/CloneNotebook.png

You can change the name of the notebook and its assigned cluster.

Location shows the current user’s last-visited folder. You can click the home icon to select the home directory.

Linking a Notebook

To link an existing notebook to the GitHub profile, click the settings icon to see the list as shown in the following figure.

_images/NotebookActions.png

Click Configure GitHub Link and you can see the dialog to add the repository, branch, and path details. Click Save after adding the details. See GitHub Version Control for Zeppelin Notebooks for more information.

Tagging a Notebook

You can tag a notebook that can help in tracing it from a list of notebooks. There is one tag label on the top-left of a notebook. Do a mouse hover on the edit icon and you can see a Edit label as shown in the following figure.

_images/NoTags1.png

Click the edit icon and type a tag. You can type more than one tag to a notebook. The two actions are illustrated in the following figure.

_images/TaggingNote1.png _images/TaggingNote2.png

After adding tags, click the tick-mark symbol icon to save a tag. A tagged notebook is illustrated in the following figure.

_images/TaggedNote1.png

In the notebook list, click a notebook to see its tag details as illustrated in the following figure.

_images/TaggedNoteDetails.png
Filtering a Specific Notebook from a List

If you want to see a specific notebook from a list of notebooks, click the filter icon in the left pane. The filter is displayed as shown in the following figure.

_images/NotebookFilter.png

Type the notebook ID, name, location, cluster type, cluster label, or a notebook tag to filter it from the list. You can also specify a comma-separated notebook IDs to search. Click Apply for applying the filter. Click Cancel if you do not want to use the filter.

Click Refresh List for refreshing notebooks’ list. Click the filter icon if you do not want to see the filter fields.

Deleting a Notebook

Before deleting a notebook, you must ensure that the notebook does not have any active command or active schedules associated with the notebook.

To delete a notebook, click the the settings icon to see the list as shown in the following figure.

_images/NotebookActions.png

Click Delete. A dialog is displayed to confirm if you want to delete the notebook. Click OK to delete it or Cancel if you want to retain it.

You can also delete a notebook by clicking the settings icon (gear icon) within a notebook and click Delete from the drop-down list.

Using Different Types of Notebook

You can use a notebook only when its associated cluster is up. A sample notebook with its cluster up is as shown in the following figure.

_images/EditableNotebook1.png

Click the notebook in the left panel to view its details. The notebook’s ID, its type, and associated cluster are displayed.

Note

A pin is available at the bottom-left of the Notebooks UI. You can use it to hide/unhide the left side bar to toggle the notebooks’ list.

The following figure shows a notebook with its details displayed.

_images/NotebookDetails.png

Using Folders in Notebooks explains how to use folders in the notebook.

You can run paragraphs in a notebook. After running paragraphs, you can export the results that are in a table format to a CSV (comma-separated value), TSV (tab-separated value), or raw format. Use these options by clicking the gear icon available in each paragraph (at the top-right corner). To download results:

  • Click Download Results as CSV to get paragraph results in a CSV format.
  • Click Download Results as TSV to get paragraph results in a TSV format.
  • Click Download Raw Results to get paragraph results in a raw format.

Qubole provides code auto completion in the paragraphs and an ability to stream outputs/query results. Notebooks also provide improved dependencies management.

Currently, only Spark notebooks are supported. See Using a Spark Notebook.

Using a Spark Notebook

Select a Spark notebook from the list of notebooks and ensure that its assigned cluster is up to use it for running queries. See Running Spark Applications in Notebooks and Understanding Spark Notebooks and Interpreters for more information.

See Using the Angular Interpreter for more information.

When you run paragraphs in a notebook, you can watch the progress of the job or jobs generated by each paragraph. The following figure shows a sample paragraph with progress of the jobs.

_images/job-ui-para.png

For more details about the job, click on the info icon adjacent to the job status in the paragraph, the Spark Application UI is displayed as shown below.

_images/spark-ui-para.png

You can see the Spark Application UI in a specific notebook paragraph even when the associated cluster is down.

Running Spark Applications in Notebooks

You can run Spark Applications from the Notebooks page of the QDS UI. When running Spark applications in notebooks, you should understand notebooks, how to associate interpreters with notebooks, how to run concurrent commands, and how to set the context.

Log in to QDS with your username and password. Navigate to the Notebooks page.

Understanding Notebooks

You can create any number of new notebooks. Notebooks data is synced and persisted in cloud storage (for example S3) regularly. Notebooks are associated with a cluster, so notebooks from cluster A can not be accessed by cluster B.

See Notebooks for more information on Qubole’s Notebooks. For more information, see:

Associating Interpreters with Notebooks

Spark Interpreters are associated with notebooks by default. However, if you want to use any user interpreter then you must associate the interpreter with the notebook. Interpreters are started on demand when required by notebooks.

  1. On the Notebooks page, click on the Gear icon.

    _images/ExpActionIcon1.png

    On the Settings page, list of Interpreters are displayed as shown in the following figure.

    _images/notebook-interpreter-association.png

    The first interpreter on the list is the default interpreter.

  2. Click on the required interpreter to associate with the notebook.

  3. Click Save.

Running Concurrent Spark Commands

You can run multiple Spark SQL commands in parallel.

  1. On the Notebooks page, click Interpreters.

  2. For the required interpreter, click on the corresponding edit button.

  3. Set the following properties:

    • zeppelin.spark.concurrentSQL = true
    • zepplin.spark.sql.maxConcurrency = number of concurrent commands to be run
  4. Create multiple paragraphs with Spark SQL commands and click the Run icon on the left side to run all the paragraphs.

    All the paragraphs run concurrently.

Using SparkContext, HiveContext and SQLContext Objects

Similar to a Spark shell, a SparkContext (SC) object is available in the Notebook. In addition, HiveContext is also available in a Notebook if you have to set the following configuration to true in interpreter settings:

zeppelin.spark.useHiveContext

If this configuration is not set to true, then SQLContext is available in a Notebook instead of HiveContext.

Running Shell Commands in Notebooks

You can run shell commands either sequentially or concurrently from the Notebooks page of the QDS UI. Shell Interpreters are associated with notebooks by default. However, if you want to use any user interpreter then you must associate the interpreter with the notebook. Interpreters are started on demand when required by notebooks.

Steps
  1. Navigate to the Notebooks page.

  2. On the Notebooks page, click Interpreters.

  3. For the shell interpreter, click on the corresponding edit button.

  4. Set zepplin.shell.concurrentCommands = true.

    You can run up to five shell commands concurrently.

  5. Create multiple paragraphs with Spark SQL commands and click the Run icon on the left side to run all the paragraphs.

    All the paragraphs run concurrently.

    The following figure shows sample shell commands run from the Notebooks page.

    _images/shell-cmd.png
Version Control Systems for Zeppelin Notebooks

Qubole supports GitHub, GitLab, and Bitbucket integration with notebooks. These integrations help in using a central repository to serve as the single point-of-entry for all changes to a project. It provides the following advantages:

  • Ability to track changes
  • Provides a central repository for code changes
  • Effective way to collaborate

To enable GitHub, GitLab, or Bitbucket integration, create a ticket with Qubole Support.

As a prerequisite, you must configure the version control systems before using the VCS with Zeppelin notebooks.

GitHub Version Control for Zeppelin Notebooks

To configure the version control for Notebooks using GitHub, you must perform the following tasks:

  1. Configure Version Control Settings
  2. Configure a GitHub Token
  3. Link Notebooks to GitHub

After configuring the GitHub repository, you can perform the following tasks to manage the notebook versions:

Configuring a GitHub Token

You can configure a GitHub Token for notebooks at per user setting level from the My Accounts or Notebooks UI.

  • To configure the GitHub token for notebooks for your account, see Configuring a GitHub Token.

  • To configure the GitHub token from notebooks, perform the following steps:

    1. Navigate to Notebooks and click a notebook.
    2. Click the Manage notebook versions icon that is on the top-right of the notebook. The Versions panel expands as shown in the following

    figure.

    _images/ConfigGitHubinNote.png
    1. Click Configure now.

    2. In the dialog box add the generated GitHub token and click Save.

      The GitHub token is now configured for your account.

Linking Notebooks to GitHub

After configuring the GitHub token, you can link the GitHub repository from notebooks.

  1. Obtain the GitHub repository URL.

    1. Navigate to the GitHub profile and click Repositories.

    2. From the list of repositories, click the repository that you want to link.

    3. Copy the URL that is displayed within that repository.

      Alternatively, you can navigate to the GitHub profile and copy the URL from the browser’s address-bar.

      Note

      If you want to add HTTPS *.git link as the GitHub repository URL, click Clone or Download. A drop-down text box is displayed. Copy the HTTPS URL or click Use HTTP (if it exists) to copy the HTTPS URL.

  2. Click the Manage notebook versions icon that is on the top-right of the notebook. The Version button expands as shown in the following figure.

    _images/LinkGitHubVersion1.png
  3. Click the Link Now option.

  4. In the Link Notebook to GitHub dialog box, perform the following actions:

    1. Add the GitHub repository URL in the Repository Web URL text field. Ensure that the GitHub profile token has read permissions for the repository to checkout a commit and write permissions for the repository to push a commit.

    2. Select a branch from the Branch drop-down list.

    3. Add an object path file in the Object Path text field.

      A sample is as shown in the following figure.

      _images/LinkNotetoGitHub.png
    4. Click Save.

Pushing Commits to GitHub

After you link notebooks with a GitHub profile, you can start using the notebook to push commits to the GitHub directly from a notebook associated with a running cluster.

Before you push the commits, ensure that the following requirements are met:

  • The GitHub profile token must have write permissions for the repository to push commits.
  • The associated cluster must be running.
Steps
  1. Click the Manage notebook versions icon that is on the top-right of the notebook. It expands and provides the version details.

  2. Click the Push icon to commit. A dialog opens to push commits. The following figure shows the version details and the Push to GitHub dialog.

    _images/PushtoGitHub1.png
  3. Add a commit message and click Save to push the commit to the GitHub repository. You can use the option force commit to force push over the old commit (irrespective of any conflict).

Note

Qubole does not store commits or revisions of notebooks. However, commits or revisions of notebooks can be fetched from users’ GitHub account whenever required.

Restoring a Commit from GitHub
  1. Click the Manage notebook versions icon that is on the top-right of the notebook. It expands and provides the version details.
  2. Select a version from the list and click Restore to checkout that version.
  3. Click OK to checkout that version in the confirmation dialog box.

Note

Qubole does not store commits or revisions of notebooks. However, commits or revisions of notebooks can be fetched from users’ GitHub account whenever required.

Creating a Pull Request from Notebooks
  1. Open the required notebook.

  2. Click on the Gear icon on the top right corner of the notebook, and select Configure GitHub Link. The Link Notebook to GitHub dialog is displayed.

  3. Click on the Create PR hyperlink.

  4. Proceed with the steps in GitHub to create the PR.

    For more information, see GitHub Documentation.

Resolving Conflicts While Using GitHub

There may be conflicts while pushing/checking out commits in the GitHub versions.

Note

You can use the option force commit to force push over the old commit (irrespective of any conflict).

Perform the following steps to resolve conflicts in commits:

  1. Clone the notebook.
  2. Link the cloned notebook to the same GitHub repo branch and path as the original notebook.
  3. Checkout the latest version of the cloned notebook.
  4. Manually port changes from the original notebook to the cloned notebook.
  5. You can commit the cloned notebook after porting changes.
GitLab Version Control for Zeppelin Notebooks

To configure the version control for Notebooks using GitLab, you must perform the following tasks:

  1. Configure Version Control Settings
  2. Configure a GitLab Token
  3. Link Notebooks to GitLab

After configuring the GitLab Repository, you can perform the following tasks to manage the notebook versions:

Configuring a GitLab Token

You can configure a GitLab Token for notebooks at per user setting level from the My Accounts or Notebooks UI.

  • To configure the GitLab token for notebooks at your account level, see Configuring a GitLab Token.

  • To configure the GitLab token from notebooks, perform the following steps:

    1. Navigate to Notebooks and click a notebook.

    2. Click Manage notebook versions that is on the top-right of the notebook. The Versions panel expands as shown in the following figure.

      _images/ConfigGitHubinNote.png
    3. Click Configure now.

    4. In the dialog box add the generated GitLab token and click Save.

    The GitLab token is now configured for your account.

Linking Notebooks to GitLab

After configuring the GitLab token, you can link the GitLab repository from notebooks.

  1. Navigate to the GitLab profile and copy the URL from the browser’s address-bar.

    Note

    If you want to add HTTPS *.git link as the GitLab repository URL, click Clone or Download. A drop-down text box is displayed. Copy the HTTPS URL or click Use HTTP (if it exists) to copy the HTTPS URL.

  2. Click Manage notebook versions icon that is on the top-right of the notebook. The Versions panel expands as shown in the following figure.

    _images/ConfigGitLabinNote.png
  3. Click the Link Now option.

  4. In the Link Notebook to GitLab dialog box, perform the following actions:

    1. Add the GitLab repository URL in the Repository Web URL text field. Ensure that the GitLab profile token has read permissions for the repository to checkout a commit and write permissions for the repository to push a commit.

    2. Select a branch from the Branch drop-down list.

    3. Add an object path file in the Object Path text field.

      A sample is as shown in the following figure.

      _images/LinkGitLabVersion1.png
    4. Click Save.

Pushing Commits to GitLab

After you link notebooks with a GitLab repository, you can start using the notebook to push commits to the GitLab directly from a notebook associated with a running cluster.

Before you push the commits, ensure that the following requirements are met:

  • The GitLab profile token must have write permissions for the repository to push commits.
  • The associated cluster must be running.
Steps
  1. Click Manage notebook versions that is on the top-right of the notebook. It expands and provides the version details.

  2. Click the Push icon to commit. A dialog opens to push commits. The following figure shows the version details and the Push to GitLab dialog.

    _images/PushtoGitLab11.png
  3. Add a commit message and click Save to push the commit to the GitLab repository. You can use the option force commit to force push over the old commit (irrespective of any conflict).

Note

Qubole does not store commits or revisions of notebooks. However, commits or revisions of notebooks can be fetched from users’ GitLab account whenever required.

Restoring a Commit from GitLab
  1. Click Manage notebook versions that is on the top-right of the notebook. It expands and provides the version details.

  2. Select a version from the list and click Restore to checkout that version as shown the following figure.

    _images/restoreGitLab1.png
  3. Click OK to checkout that version in the confirmation dialog box.

Note

Qubole does not store commits or revisions of notebooks. However, commits or revisions of notebooks can be fetched from users’ GitLab account whenever required.

Creating a Pull Request from Notebooks
  1. Open the required notebook.

  2. Click on the Gear icon on the top right corner of the notebook, and select Configure GitLab Link. The Link Notebook to GitLab dialog is displayed.

  3. Click on the Create PR hyperlink.

  4. Proceed with the steps in GitLab to create the PR.

    For more information, see GitLab Documentation.

Resolving Conflicts While Using GitLab

There may be conflicts while pushing/checking out commits in the GitLab versions.

Note

You can use the option force commit to force push over the old commit (irrespective of any conflict).

Perform the following steps to resolve conflicts in commits:

  1. Clone the notebook.
  2. Link the cloned notebook to the same GitLab repo branch and path as the original notebook.
  3. Checkout the latest version of the cloned notebook.
  4. Manually port changes from the original notebook to the cloned notebook.
  5. You can commit the cloned notebook after porting changes.
Bitbucket Version Control for Zeppelin Notebooks

To configure the version control for Notebooks using Bitbucket, you must perform the following tasks:

  1. Configure Version Control Settings
  2. Configure Bitbucket
  3. Link Notebooks to Bitbucket

After configuring the Bitbucket Repository, you can perform the following tasks to manage the notebook versions:

Configuring Bitbucket

You can configure Bitbucket for notebooks at per user setting level from the My Accounts or Notebooks UI.

  • To configure Bitbucket for notebooks at your account level, see Configuring Bitbucket.

  • To configure Bitbucket from notebooks, perform the following steps:

    1. Navigate to Notebooks and click a notebook.

    2. Click Manage notebook versions that is on the top-right of the notebook. The Versions panel expands as shown in the following figure.

      _images/ConfigGitHubinNote.png
    3. Click Configure now.

    4. In the dialog box, add the Bitbucket credentials and click Save. You can either use your Bitbucket credentials or Bitbucket App password.

    Bitbucket is now configured for your account.

Linking Notebooks to bitbucket

After configuring Bitbucket, you can link the Bitbucket repository from notebooks.

  1. Navigate to the Bitbucket profile and copy the URL from the browser’s address-bar.

  2. Click Manage notebook versions icon that is on the top-right of the notebook. The Versions panel expands as shown in the following figure.

    _images/ConfigGitLabinNote.png
  3. Click the Link Now option.

  4. In the Link Notebook to Bitbucket dialog box, perform the following actions:

    1. Add the Bitbucket repository URL in the Repository Web URL text field. Ensure that the Bitbucket profile has read permissions for the repository to checkout a commit and write permissions for the repository to push a commit.

    2. Select a branch from the Branch drop-down list.

    3. Add an object path file in the Object Path text field.

      A sample is as shown in the following figure.

      _images/LinkbbVersion1.png
    4. Click Save.

Pushing Commits to Bitbucket

After you link notebooks with a Bitbucket repository, you can start using the notebook to push commits to Bitbucket directly from a notebook associated with a running cluster.

Before you push the commits, ensure that the following requirements are met:

  • The Bitbucket profile must have write permissions for the repository to push commits.
  • The associated cluster must be running.
Steps
  1. Click Manage notebook versions that is on the top-right of the notebook. It expands and provides the version details.

  2. Click the Push icon to commit. A dialog opens to push commits. The following figure shows the version details and the Push to Bitbucket dialog.

    _images/Pushtobb11.png
  3. Add a commit message and click Save to push the commit to the Bitbucket repository. You can use the option force commit to force push over the old commit (irrespective of any conflict).

Note

Qubole does not store commits or revisions of notebooks. However, commits or revisions of notebooks can be fetched from users’ Bitbucket account whenever required.

Restoring a Commit from Bitbucket
  1. Click Manage notebook versions that is on the top-right of the notebook. It expands and provides the version details.

  2. Select a version from the list and click Restore to checkout that version as shown the following figure.

    _images/restorebb1.png
  3. Click OK to checkout that version in the confirmation dialog box.

Note

Qubole does not store commits or revisions of notebooks. However, commits or revisions of notebooks can be fetched from users’ Bitbucket account whenever required.

Creating a Pull Request from Notebooks
  1. Open the required notebook.

  2. Click on the Gear icon on the top right corner of the notebook, and select Configure Bitbucket Link. The Link Notebook to Bitbucket dialog is displayed.

  3. Click on the Create PR hyperlink.

  4. Proceed with the steps in BitBucket to create the PR.

    For more information, see BitBucket Documentation.

Resolving Conflicts While Using Bitbucket

There may be conflicts while pushing/checking out commits in the Bitbucket versions.

Note

You can use the option force commit to force push over the old commit (irrespective of any conflict).

Perform the following steps to resolve conflicts in commits:

  1. Clone the notebook.
  2. Link the cloned notebook to the same Bitbucket repo branch and path as the original notebook.
  3. Checkout the latest version of the cloned notebook.
  4. Manually port changes from the original notebook to the cloned notebook.
  5. You can commit the cloned notebook after porting changes.
Configuring Interpreters in a Notebook

An interpreter enables using a specific language/data-processing backend. It is denoted by %<interpreter>. Notebooks support Angular, Presto (on AWS and Azure for Presto clusters), Spark (pyspark, scala, sql, R, and knitr for Spark clusters), markdown, and shell as interpreters. You can create any number of interpreter setting objects.

Interpreters are associated with a notebook. A specific cluster type provides its own type of interpreters.

From a running notebook, click Interpreters to view the set of Interpreters.

Qubole provides the following set of default Interpreters for Spark notebooks:

  • %spark, %pyspark, %sparkr, %sql, %dep, %knitr, and %r.

    %spark is useful in a Spark notebook.

Qubole notebooks now support expanding and collapsing the different intepreters. The expand/collapse button is as shown here.

_images/ExpandCollapsibleInterpreters.png

For more information on how to use interpreters, see Associating Interpreters with Notebooks. Understanding Spark Notebooks and Interpreters and Configuring a Spark Notebook describe about the interpreter settings of a Spark notebook.

Using the Anaconda Interpreter

QDS supports using Anaconda IDE features using it as a Python interpreter on Qubole notebooks. One of the advantages of using Anaconda is it eases the Python installation compared to the pip tool.

The Anaconda Python interpreter is part of Qubole AMI. To use it in the notebook, change the zeppelin.pyspark.python interpreter’s value to /usr/lib/anaconda2/bin/python.

Notebook Interpreter Operations

You should understand the different operations that you can perform on interpreters to manage the interpreters.

  1. Go to a specific running Notebook’s page, and click on Interpreters.
  2. Select the required interpreter, and perform any of the following actions by using the corresponding buttons on the top-right corner against the interpreter:
  • Edit the interpreter settings.
  • Restart the interpreter.
  • Remove the interpreter.
  • Stop the interpreter.
  • Access the logs.
  • View the list of jobs for paragraphs run in a notebook.

Note

Some of the operations are available only for Spark interpreters.

The following table lists the operations that can be performed on the supported interpreters.

Interpreters Supported operations
Angular Edit, restart, and remove.
Presto Edit, restart, and remove.
markdown Edit, restart, and remove.
Spark (pyspark, scala, sql, R, and knitr) Edit, stop, restart, remove, accessing logs, and viewing list of jobs for paragraphs run in a notebook.
Shell Edit, restart, and remove.

The following illustration displays sample interpreters with the options.

_images/all-interpreters-options.png
Using Dynamic Input Forms in Notebooks

Qubole supports Apache Zeppelin’s dynamic input forms and also based on the language backend. You can run the paragraphs several time after you change the input values. You can create forms by:

Dynamic forms are supported on Spark notebooks. For more information about parameterized notebook API that uses these dynamic forms to populate parameter values, see Run a Notebook.

Using Form Templates

You can create a text-input form, select form, and a check box form using form templates. You can change the values in the input fields and rerun the paragraph as many times as required.

A simple form can be created as shown in this example.

_images/FormTemplate.png

A simple check box form can be created as shown in this example.

_images/CheckboxForm.png

For more information, see using form templates.

Using Programming Language

You can create a text input form, check box form using the scala (%spark) and Python (%pyspark) interpreters. As in using form templates, by using the programming language, you can also change the value of inputs and rerun the paragraphs as many times as required.

A simple form can be created as shown in this example.

_images/ProgramFormTemplate.png

A simple check box form can be created as shown in this example.

_images/ProgramCheckboxForm.png

For more information, see creating forms programmatically.

Example

The following example shows how to use z.bind() function to create universal variable with dynamic forms. You can run this sample code as a paragraph on the Notebooks page.

First Paragraph

%spark

z.angularBind("year",z.input("Year"))

Second Paragraph

%spark
z.show(sqlContext.sql("""
SELECT avg(depth) as avg_depth, max(depth) as max_depth, min(depth) as min_depth
FROM eq
WHERE year = """ + z.angular("year") + """
"""
))

Third Paragraph

%spark
z.show(sqlContext.sql("""
SELECT avg(lat) as avg_lat, avg(lon) as avg_lon
FROM eq
WHERE year = """ + z.angular("year") + """
"""
))

The following figure shows the sample paragraphs that use universal variables using Angular Bind.

_images/angular-bind.png
Parameterizing Notebooks

If you want to run notebook paragraphs with different values, you can parameterize the notebook and then pass the values from the Wokbench or Scheduler page in the QDS UI, or via the REST API.

  1. Defining Parameters
  2. Running Parameterized Notebooks
Defining Parameters

You should define the parameters that have to be passed to the cell when running the notebook through various interfaces.

The following examples show how to define parameters.

Python, Scala, and SQL

The following example shows how to define Python read parameters.

%pyspark
param1 = z.input("param_1")
param2 = z.input("param_2")
print(param1)
print(param2)

The following example shows how to define Scala read parameters.

val param1 = z.input("param_1")
val param2 = z.input("param_2")
println(param1)
println(param2)

The following example shows how to define SQL read parameters.

%sql
select * from employees where emp_id='${param_1}'
Angular Variables

The following example shows how to set Python angular variables.

%pyspark
z.put("AngularVar1",z.input("param_1"))
z.put("AngularVar2",z.input("param_2"))

The following example shows how to set Scala angular variables.

%spark
z.put("AngularVar1", z.input("param_1"))
z.put("AngularVar2", z.input("param_2"))

// below two to pass to %sql
z.angularBind("AngularVar1",z.input("param_1"))
z.angularBind("AngularVar2",z.input("param_2"))

The following example shows how to get Python angular variables.

%pyspark
var_1 = z.get("AngularVar1")
var_2 = z.get("AngularVar2")
print(var_1)
print(var_2)

The following example shows how to get Scala angular variables.

%spark
val var_1 = z.get("AngularVar1")
val var_2 = z.get("AngularVar2")

println(var_1)
println(var_2)

The following example shows how to get SQL angular variables.

%sql
select * from employees where emp_id= '${AngularVar1}'
Running Parameterized Notebooks

See:

Examples

The following illustrations show the parameterized notebook after execution.

_images/parameterizednotebook1.png _images/parameterizednotebook2.png _images/parameterizednotebook3.png _images/parameterizednotebook4.png
Data Visualization in Zeppelin Notebooks

Zeppelin Notebooks in the QDS UI support data visualization. You can use packages, such as matplotlib and plotly that are part of Python libraries to represent the datasets and dataframes in a visual format. Python libraries are available as part of QDS package management.

Note

Package management is enabled by default for new accounts. Existing account users must create a ticket with Qubole Support to enable package management.

Prerequisites for Data Visualization

Before using packages for data visualization, you must ensure that the required libraries are installed and the environment is set up appropriately.

Depending on whether the package management feature is enabled on your QDS account, perform the appropriate action:

Creating an Environment

You should create an environment, attach it to the cluster, and add the required packages for data visualization.

  1. From the Home menu, navigate to Control Panel >> Environments.

  2. Click New.

  3. In the Create New Environment dialog box, enter the name and the description in the respective fields.

  4. Select the appropriate Python and R versions in the respective drop-down lists, and click Create.

  5. Select the newly created environment. On the top right corner, from the Cluster drop-down list, select a cluster to attach the cluster to the environment.

    The following figure shows a sample environment that is created for package management.

    _images/sample-env.png
  6. Click See list of pre-installed packages link to view the list of pre-installed packages.

  7. If you want to add more packages or a different version of a pre-installed package, perform the following steps:

    1. Click +Add.
    2. In the Add Packages dialog box, from the Source drop-down list, select the required source package.
    3. Enter the name and version (optional) of the packages and click Add.

    The following figure shows a sample Add Packages dialog box.

    _images/sample-add-pkg.png

For more information, see Using the Default Package Management UI.

Installing the Libraries
  1. From the Home menu, navigate to Clusters. Select the required cluster to view the settings.

  2. Verify if the appropriate Python version is set as the default value for the cluster.

  3. If you want to change the default Python version, then add the following code to the Cluster node bootstrap script:

    source /usr/lib/hustler/bin/qubole-bash-lib.sh
    make-python<version>-system-default
    

The following example shows how to set Python 2.7 as the default version for the cluster.

source /usr/lib/hustler/bin/qubole-bash-lib.sh
make-python2.7-system-default
  1. Add the following code to the Cluster node bootstrap script to install the libraries.

    pip install <library name>
    

    The following example shows how to install Pandas and Plotly libraries.

    pip install pandas
    pip install 'plotly<=2.0'
    
  2. Navigate to Notebooks >> Interpreters. In the Interpreter settings, set zeppelin.default.interpreter to pyspark.

Using matplotlib

Matplotlib is a multi-platform data visualization library, which you can use to graphically represent your datasets.

Perform the following steps to generate matplotlib visuals:

  1. Navigate to the Notebooks page.

  2. Enter the matplotlib code in the paragraph and click the Run button.

    Note

    If the language of the notebook is not pyspark, then you must use %pyspark as the first line in each paragraph.

    The following example shows a sample code.

    import matplotlib
    import numpy as np
    import matplotlib.pyplot as plt
    # Example data
    people = ('Tom', 'Dick', 'Harry', 'Slim', 'Jim')
    np.random.seed(1234)
    y_pos = np.arange(len(people))
    performance = 3 + 10 * np.random.rand(len(people))
    error = np.random.rand(len(people))
    
    plt.barh(y_pos, performance, xerr=error, align='center', alpha=0.4)
    plt.yticks(y_pos, people)
    plt.xlabel('Performance')
    plt.title('How fast do you want to go today?')
    
    z.showplot(plt)
    

The z.showplot() function in the sample code is a Qubole specific function that is used to plot the graphs.

The respective graph is displayed in the Notebooks page as shown in the following figure.

_images/mat-sample-output.png
Using Plotly

Plotly is a data visualization library, which you can use to create graphs and dashboards.

Perform the following steps to generate plotly visuals:

  1. Navigate to the Notebooks page.

  2. Enter the plotly code in the paragraph and click the Run button.

    Note

    If the language of the notebook is not pyspark, then you must use %pyspark as the first line in each paragraph.

    The following example shows a sample code.

    import plotly
    import plotly.graph_objs as go
    
    # Create random data with numpy
    import numpy as np
    
    def plot(plot_dic, height=1000, width=1000, **kwargs):
        kwargs['output_type'] = 'div'
        plot_str = plotly.offline.plot(plot_dic, **kwargs)
        print('%%angular <div style="height: %ipx; width: %spx"> %s </div>' % (height, width, plot_str))
    
    
    N = 100
    random_x = np.linspace(0, 1, N)
    random_y0 = np.random.randn(N)+5
    random_y1 = np.random.randn(N)
    random_y2 = np.random.randn(N)-5
    
    
    trace0 = go.Scatter(
        x = random_x,
        y = random_y0,
        mode = 'markers',
        name = 'markers'
    )
    trace1 = go.Scatter(
        x = random_x,
        y = random_y1,
        mode = 'lines+markers',
        name = 'lines+markers'
    )
    trace2 = go.Scatter(
        x = random_x,
        y = random_y2,
        mode = 'lines',
        name = 'lines'
    )
    
    
    layout = dict(
      title = 'Line and Scatter Plots',
      xaxis = dict(title='X Axis'),
      yaxis = dict(title='Y Axis'),
      showlegend = False,
      height = 800
    )
    
    data1 = [trace0, trace1, trace2]
    fig1 = dict( data=data1, layout=layout )
    plot(fig1,  show_link=True)
    

The respective interactive graph is displayed in the Notebooks page as shown below.

Plot 1

For information on using Notebooks with Spark, see:

To create, configure, run, clone, delete or bind a notebook through a REST API call, see notebook-api.

Jupyter Notebooks

Qubole provides JupyterLab interface, which is the next generation user interface for Jupyter. Jupyter notebooks are supported on Spark 2.2 and later versions.

Note

JupyterLab interface is a Beta feature, and is not enabled for all users by default. You can enable this feature from the Control Panel >> Account Features page. For more information about enabling features, see Managing Account Features.

The following topics help you understand how to use the JupyterLab notebook interface to create and manage Jupyter notebooks:

Accessing JupyterLab Interface with R59

You can access the JupyterLab interface, which is the next generation user interface for Jupyter, to create and manage Jupyter .ipynb notebooks.

Navigate to Notebooks >> Jupyter to access the JupyterLab interface.

The following shows a sample JupyterLab interface.

_images/jupyter-home2.png

The Menu Bar is at the top of the interface. The default menus are:

  • File − Actions related to notebooks.
  • Edit − Actions related to editing notebooks and other activities.
  • View − Actions that alter the appearance of JupyterLab interface.
  • Run − Actions for running code in notebooks.
  • Kernel − Actions for managing kernels, which are separate processes for running code.
  • Tabs − A list of the open documents and activities in the dock panel.
  • Spark - Links to Resource Manager and Livy.
  • Settings − Common settings and an advanced settings editor.
  • Help − Example notebooks, and a list of JupyterLab and kernel help links.

The Left Sidebar has the following tabs:

  • File Browser: Displays working directory and folders. Displays buttons for starting a new launcher, adding a folder, uploading file and refresh file list.
  • Running Kernels: Displays running kernels.
  • Scheduler: Provides option to create a schedule, view schedules and run history of a scheduled and API based execution of Jupyter notebooks.
  • VCS: Provides access to GitHub, GitLab or BitBucket Version Control System.
  • Commands Pallete: Lists all the supported commands. You can search for any command by using the search option.
  • Notebook Tools: Provides options to add and remove tags from individual cells, and view tags in active cell and Jupyter notebook.
  • Open Tabs: Lists all the open windows.
  • Table of Contents: Displays the table of contents of a selected notebook.
  • Object Storage Explorer: Displays the content of your cloud storage.
  • Table Explorer: Displays the list of schema, tables in a given schema, and table metadata.
  • Example Notebooks: Lists the available sample notebooks to help you get started.

The Main Work Area shows the launcher with the supported kernels option for Jupyter notebooks. The notebooks are opened in the Main Work Area.

Accessing JupyterLab Interface

You can access the JupyterLab interface, which is the next generation user interface for Jupyter, to create and manage Jupyter .ipynb notebooks.

  1. From the Home page, navigate to Notebooks >> Jupyter.

  2. Select a Spark cluster from the drop-down list, and click Open. Spark clusters with versions 2.2 and later are displayed in the drop-down list.

    If the selected cluster is running, then the JupyterLab interface is launched.

    If the selected cluster is not running, the cluster is started when you click Open. After the cluster is started, the JupyterLab interface is launched.

Alternatively, you can also access the JupyterLab interface from the Clusters page. Click on Resources for a Spark cluster, and select Jupyter from the menu.

After the interface is launched, you can see the associated cluster on the top right corner. To change the associated cluster perform the following steps:

  1. Click on the down arrow and select the required cluster.
  2. If you select a cluster that is not running, then initial UI of the JupyterLab interface is opened.
  3. Select the required cluster, and click Open.

The following JupyterLab interface with labels provides an overview with respect to the layout and menu options that help you create and manage Jupyter notebooks.

_images/jupyter-home1.png

The Menu Bar is at the top of the interface. The default menus are:

  • File: Actions related to notebooks.
  • Edit: Actions related to editing notebooks and other activities.
  • View: Actions that alter the appearance of JupyterLab interface.
  • Run: Actions for running code in notebooks.
  • Kernel: Actions for managing kernels, which are separate processes for running code.
  • Tabs: A list of the open documents and activities in the dock panel.
  • Settings: Common settings and an advanced settings editor.
  • Help: Example notebooks, and a list of JupyterLab and kernel help links.

The Left Sidebar has the following options:

  • File Browser: Displays working directory and folders. Displays buttons for starting a new launcher, adding a folder, uploading file and refresh file list.
  • Running: Displays running kernels.
  • Commands Pallete: Lists all the supported commands. You can search for any command by using the search option.
  • Noteook Tools: Lists all the tools for the Jupyter notebooks.
  • Open Tabs: Lists all the open windows.
  • Table of Contents: Displays the table of contents of a selected notebook.
  • Example Notebooks: Lists the available sample notebooks to help you get started.

The Main Work Area shows the launcher with the supported kernels option for Jupyter notebooks. The notebooks are opened in the Main Work Area.

Access Control in Jupyter Notebooks

You can set access control for Jupyter notebooks at the account level and at the object level.

Note

This feature is available in the latest version of Jupyter Notebooks. If you are not using the latest version of Jupyter notebooks, you can enable this feature from the Control Panel >> Account Features page. For more information about enabling features, see Managing Account Features. For custom roles, add the following permissions before enabling this feature:

  • CREATE and READ allow access on Jupyter Notebook.
  • READ allow access on Folder.

As an account admin, you can set account level permissions in the Manage Roles UI. You can allow or deny Jupyter Notebook resource for a role and specify the policy actions such as create, read, update and delete. For more information, see Managing Roles.

As an owner of the Jupyter notebook, you can override the permissions for the folders and notebooks to restrict the access and visibility to a few users or groups. For the folders, you can set permissions only on your working directory and on the first level folders in the Common folder only if you have the required permissions for those folders. You can set permissions on the notebooks that you own.

The following table lists the resource, policy actions, and the descriptions.

Resource Policy Action Description
Jupyter Notebook create Create and manage Jupyter notebooks.
read Read any Jupyter notebook.
update Update any Jupyter notebook.
delete Delete any Jupyter notebook.
Folder read Create resources in your working directory. Read other resources.
write Create and read resources in your working directory, Common folder, and other users` folders.
manage Manage folder permissions.
Folders in JupyterLab Interface

When you launch the JupyterLab interface for the first time, a folder with your email address is created as your working directory. By default, the File Browser option is selected from the left menu.

Currently, there is no access control list for the folders. You can view, edit, and manage the notebooks of other users as well.

You can perform the following actions from the File Browser view:

  • Create a new folder: Right-click on the left panel and select New Folder. A folder with untitled(n) name is created. You can rename the folder by using the Rename option from the context menu of the folder.

  • Manage folders: Right-click on the required folder to open the context menu, and select the appropriate option from the menu. You can perform operations such as rename, cut, paste, copy path, and delete a folder.

    Note

    You can delete only empty folders.

  • View all folders in the account: Click on the folder icon next to your email address.

  • View notebooks in other folders: Double-click or right-click on the required folder, and select Open from the context menu.

  • Upload Jupyter notebooks from the local storage: Click on the upload files button above your working directory name. Select the notebooks from the local storage, and click Choose.

    Note

    You can upload Jupyter notebooks with size upto 25 MB.

Creating Jupyter Notebooks

You can create Jupyter notebooks with PySpark(Python), Spark(Scala), and SparkR Kernels from the JupyterLab interface.

  1. Perform one of the following steps to create a Jupyter notebook.

    • From the Launcher, click on PySpark, Spark, SparkR, or Python to create a Jupyter notebook with PySpark, Spark, SparkR, or Python Kernels, respectively.

      Note

      Python kernel does not use the distributed processing capabilities of Spark when executed on a Spark cluster.

    • Navigate to the File >> New menu and select Notebook. The New Notebook dialog is displayed as shown below.

      _images/create-jp-nb.png
  2. Enter a name for the Jupyter notebook.

  3. Select the appropriate Kernel from the drop-down list.

  4. Click Create.

The newly created Jupyter notebook opens in the main work area as shown in the following figure.

_images/launcher.png

The new Jupyter notebook has the following UI options:

  • Associated cluster on the top-right corner. To change the associated cluster perform the following steps:
    1. Click on the down arrow and select the required cluster. If you select a cluster that is not running, then initial UI of the JupyterLab interface is opened.
    2. Select the required cluster, and click Open.
  • Associated Kernel. The empty circle indicates an idle kernel. The circle with the cross bar indicates a disconnected kernel, and a filled circle indicates a busy kernel.
  • The widget shown as down arrow displays the Spark application status.
  • Various buttons in the toolbar to perform operations on that notebook.
  • Context menus that are displayed with a right-click on the UI elements. In the Main Work Area, you can perform cell-level and notebook-level operations by using the context menu of the Main Work Area. For details, see Context Menu for Main Work Area.

The following magic help you build and run the code in the Jupyter notebooks:

  • %%help shows the supported magics.
  • %%markdown for markdown or select “Markdown” from Cell type drop-down list.
  • %%sql for sql on spark.
  • %%bash or %%sh for shell.
  • %%local for execution in kernel.
  • %%configure for configuring Spark settings.
  • %matplot plt instead of %%local for matplots.
Running Jupyter Notebooks

After creating a Jupyter notebook, you can either run the required cells by using the Run option for a cell or run all cells by using the Run All Cells option. You can use these options from the toolbar of the notebook or from the context menu of the cells.

For more information, see Other Options.

Running a Jupyter Notebook from Another Jupyter Notebook

You can run a Jupyter notebook from another Jupyter notebook that has the same type of kernel. For example, you can run a Jupyter notebook with Spark kernel from another Jupyter notebook with Spark kernel.

  1. From the left Sidebar, select and right-click on the Jupyter notebook that has to be run from another notebook.

  2. From the context menu, select Copy Path.

  3. Open the Jupyter notebook from which you want to run another notebook.

  4. Enter the %run magic as shown below:

    %run <path of the notebook that has to be run>
    
  5. Click Run.

The notebook run is synchronous. The cell that initiates the notebook run, waits until the completion of the called notebook before proceeding.

The following image shows the Jupyter notebook n2.ipynb that has to be run from another notebook.

_images/src-run-magic.png

The following image shows how to run a Jupyter notebook n2.ipynb from another notebook.

_images/run-with-magic.png
Scheduling Jupyter Notebooks

You can create a schedule to run Jupyter notebooks at periodic intervals without a manual intervention from the JupyterLab interface and Scheduler UI. After scheduling, you can view the list of schedules, run history, and output of the scheduled runs.

Note

This feature is available in the latest version of Jupyter Notebooks. Contact Qubole Support to migrate to the latest version of Jupyter Notebooks.

Note

You must have the access to create the Jupyter Notebook command to schedule Jupyter notebooks.

Creating a Schedule
  1. Depending on whether you want to create a schedule from the JupyterLab interface or Scheduler UI, perform the appropriate actions:

    • From JupyterLab Interface
      1. Select and open the required Jupyter notebook.
      2. If you want to pass parameters from the Scheduler arguments, then designate a cell as a parameter cell. Select the appropriate cell, right-click to open the context menu, and select Set as parameters cell.
      3. Click on the Scheduler icon on the left sidebar tab or on the top tool bar of the Jupyter notebook.
      4. Click the ** + ** icon in the Scheduler context menu. The Scheduler UI opens in a separate tab.
    • Navigate to the Scheduler UI, and click the +Create button in the left pane.
  2. Enter a name in the Schedule Name text field.

  3. In the command field, select Jupyter Notebook from the drop-down list.

  4. Select the required Jupyter notebook from the Select Jupyter Notebook drop-down list.

  5. Select the required cluster from the drop-down list. Spark clusters running only Spark 2.2 and later versions are supported.

  6. Optionally, enter the arguments and their values in a valid JSON format in the Arguments field.

    If you designated a cell as a parameter cell in the Jupyter notebook, then the schedule parameters get injected after the designated cell.

    If the Jupyter notebook does not contain any designated parameter cell, then the schedule parameters get injected into the first cell or the cell after the %%configure magic, if the %%configure magic is used.

  7. To add details in the Macros, Schedule, and Advanced settings sections, see Creating a New Schedule.

  8. Click Save.

The following figure shows a sample Scheduler UI for a Jupyter notebook.

_images/schedule-jupy-nb.png

The Scheduler runs the scheduled Jupyter notebooks at the specified schedule. The schedule run is viewable when at least one cell in the Jupyter notebook is executed. If the schedule run fails, verify the command logs.

If you click on Command ID, the output of the Jupyter notebook is displayed in a separate tab. If the associated cluster is down or the Jupyter interface is not accessible then the Command Logs are displayed in a separate tab. The output of notebook is read-only.

Viewing Schedules and Run History

For a Jupyter notebook, you can view the list of schedules and run history.

  1. Open a Jupyter notebook from the left sidebar.

  2. Click on the Scheduler icon either from the left sidebar tab or from the top toolbar of the Jupyter notebook. The left sidebar displays the Schedule(s) and Run History tabs as shown below.

    _images/view-jupy-schedule.png
  3. To view the active schedules, click Schedule(s) tab. Double-click on the active schedule opens the Scheduler UI in a separate tab.

  4. To view the output of a notebook, click Run History and double-click on the required run. Output of the notebook is displayed in a separate tab as shown below:

    _images/jupy-notebook-result.png

    Hover on the schedules in the Run History to view details.

Managing Jupyter Notebooks

After creating a Jupyter notebook, you can perform various tasks to manage Jupyter notebooks from the JupyterLab interface.

Setting Access Control for Jupyter Notebooks

You can specify the access control for your Jupyter notebooks.

Note

This feature is available in the latest version of Jupyter Notebooks. Contact Qubole Support to migrate to the latest version of Jupyter Notebooks.

  1. Select and right-click on the required Jupyter notebook.

  2. From the context menu, select Manage Permissions. The Manage Permissions dialog is displayed as shown below.

    _images/file-level-permissions.png
  3. Select the users or groups, and select the appropriate permissions.

  4. Click Save.

Viewing Jupyter Notebooks
  • You can open and view multiple Jupyter notebooks. Each Jupyter notebook opens as a separate tab in the Main Work Area.
  • You can drag and reposition the notebooks anywhere in the Main Work Area.
  • To view a single document in the Main Work Area, navigate to View and select Single-Document Mode.
  • You can also create a new view for the output for a cell as a separate tab by right-clicking on the cell and selecting Create New View for Output. You can drag and reposition the Output View window in the Main Work Area.
Renaming Jupyter Notebooks

You can rename the Jupyter notebooks either from the Jupyter notebook tab in the Main Work Area or from the File Browser.

Perform one of the following steps to rename a Jupyter notebook:

  • Rename from the Main Work Area.

    1. In the center panel, right-click on the title of the notebook and select Rename Notebook.

    2. From the Rename File pop-up window, enter a new name and click Rename.

      The following animated GIF shows how to rename a Jupyter notebook from the Main Work Area.

      _images/rename1.gif
  • Rename from the File Browser.

    1. From the File Browser pane on the left panel, right-click on the required notebook, select Rename.
    2. Enter a new name.

The following animated GIF shows how to rename a Jupyter notebook from the File Browser pane.

_images/rename2.gif
Copying Example Notebooks

You can copy the example notebooks and paste it in your working directory.

  1. Click on the Example Notebooks icon on the left tool bar.
  2. In the left sidebar, select and right click on the required example notebook, and click Copy.
  3. Navigate to your working directory and open the required folder.
  4. Right-click in the left sidebar, click Paste.
Moving Jupyter Notebooks to a Folder

When you create a Jupyter notebook, it is created in your current working folder. You can move the Jupyter notebook from one folder to the other by dragging from the source folder and dropping to the destination folder.

Other Options

The context menu of a Jupyter notebook in the left panel provides various options that you can use for a notebook.

Depending on the action you want to perform, select the appropriate option from the menu as shown below.

_images/manage-notebook-jupy.png
Cell Operations

You can perform various cell-level operations by using the options displayed in the context menu of a cell.

  1. Right-click on the required cell.
  2. Depending on the action you want to perform, select the appropriate option from the menu as shown below.
_images/cell-level-menu.png

You can perform various cell-level operations for the entire notebook by using the options displayed in the context menu of the main work area.

  1. Right-click anywhere on the main work area.
  2. Depending on the action you want to perform, select the appropriate option from the menu as shown below.
_images/notebook-level.png
Using Shortcuts

You can use certain shortcuts to perform a few operations:

  • Press Tab in the cells to view and use autocomplete suggestions for Spark and PySpark notebooks.
  • Press Shift+Tab to view docstring help in PySpark notebooks
Version Control Systems for Jupyter Notebooks

Qubole supports GitHub, GitLab, and Bitbucket integration with Jupyter notebooks. These integrations help in using a central repository to serve as the single point-of-entry for all changes to a project. It provides the following advantages:

  • Ability to track changes
  • Provides a central repository for code changes
  • Effective way to collaborate

To enable GitHub, GitLab, or Bitbucket integration, create a ticket with Qubole Support.

As a prerequisite, you must configure the version control systems before using the VCS with Jupyter notebooks.

GitHub Version Control for Jupyter Notebooks

To use the version control for Jupyter Notebooks using GitHub, you must perform the following tasks:

  1. Configure Version Control Settings
  2. Configure a GitHub Token
  3. Link Jupyter Notebooks to GitHub

After configuring the GitHub repository, you can perform the following tasks to manage the notebook versions:

Configuring a GitHub Token

You can configure a GitHub Token for Jupyter notebooks at per user and per account setting level from the My Accounts or JupyterLab interface.

  • To configure the GitHub token for Jupyter notebooks for your account, see Configuring a GitHub Token.

  • To configure the GitHub token from the Jupyter notebooks, perform the following steps:

    1. Navigate to Notebooks >> Jupyter and open a Jupyter notebook.

    2. From the left sidebar, click on the Github Versions icon as shown in the following figure.

      _images/ConfigGitHubinNote-jupy.png
    3. Click Configure now.

    4. In the dialog box add the generated GitHub token and click Save.

      The GitHub token is now configured for your account.

Linking Jupyter Notebooks to GitHub

After configuring the GitHub token, you can link the Jupyter notebooks to GitHub.

  1. Obtain the GitHub repository URL.

    1. Navigate to the GitHub profile and click Repositories.

    2. From the list of repositories, click the repository that you want to link.

    3. Copy the URL that is displayed within that repository.

      Alternatively, you can navigate to the GitHub profile and copy the URL from the browser’s address-bar.

      Note

      If you want to add HTTPS *.git link as the GitHub repository URL, click Clone or Download. A drop-down text box is displayed. Copy the HTTPS URL or click Use HTTP (if it exists) to copy the HTTP URL.

  2. Navigate to Notebooks >> Jupyter and open a Jupyter notebook.

  3. From the left sidebar, click on the GitHub Versions icon as shown in the following figure.

    _images/link-github-jp.png
  4. Click the Link Now option.

  5. In the Link Notebook to GitHub dialog box, perform the following actions:

    1. Add the GitHub repository URL in the Repository Web URL text field. Ensure that the GitHub profile token has read permissions for the repository to checkout a commit and write permissions for the repository to push a commit.

    2. Select a branch from the Branch drop-down list.

    3. Add an object path file in the Object Path text field.

    4. If you want to strip the outputs from the notebooks before committing to GitHub, select the Strip Output checkbox.

      A sample is as shown in the following figure.

      _images/link-github-config-jp.png
    5. Click Save.

Pushing Commits to GitHub

After you link notebooks with a GitHub profile, you can start using the notebook to push commits to the GitHub directly from a notebook.

Steps
  1. Open the required Jupyter notebook and save the changes.
  2. From the left sidebar, click on the GitHub Versions icon.
  3. Click the Push icon to commit. A dialog opens to push commits.
  4. Add a commit message and click Save to push the commit to the GitHub repository. You can use the option force commit to force push over the old commit (irrespective of any conflict).

Note

Qubole does not store commits or revisions of notebooks. However, commits or revisions of notebooks can be fetched from users’ GitHub account whenever required.

Viewing and Comparing the Jupyter Notebook Versions

You can view a particular version of the Jupyter notebook by using the View option in the GITHUB VERSIONS sidebar as shown below.

_images/view-in-github.png

You can compare a version of the Jupyter notebook with the previous version or version with current changes by using the Compare option in the GITHUB VERSIONS sidebar as shown below.

The Compare icon on top of the left sidebar compares the current notebook with the head of the branch. The Compare hyperlink in the left sidebar compares the given version with the previous version.

The following image shows a sample comparison of Jupyter notebook versions.

_images/compare-in-github.png
Restoring a Commit from GitHub
  1. Open the required Jupyter notebook.
  2. From the left sidebar, click on the GitHub Versions icon.
  3. Select a version from the list and click Restore to checkout that version.
  4. Click OK to checkout that version in the confirmation dialog box.

Note

Qubole does not store commits or revisions of notebooks. However, commits or revisions of notebooks can be fetched from users’ GitHub account whenever required.

Creating a Pull Request from Jupyter Notebooks
  1. Open the required Jupyter notebook.

  2. From the left side bar, click on the GitHub Versions icon.

  3. Click on the Gear icon in the GITHUB VERSIONS pane. The Link Notebook to GitHub dialog is displayed.

  4. Click on the Create PR hyperlink.

  5. Proceed with the steps in GitHub to create the PR.

    For more information, see GitHub Documentation.

Resolving Conflicts While Using GitHub

There may be conflicts while pushing/checking out commits in the GitHub versions.

Note

You can use the option force commit to force push over the old commit (irrespective of any conflict).

Perform the following steps to resolve conflicts in commits:

  1. Clone the notebook.
  2. Link the cloned notebook to the same GitHub repo branch and path as the original notebook.
  3. Checkout the latest version of the cloned notebook.
  4. Manually port changes from the original notebook to the cloned notebook.
  5. You can commit the cloned notebook after porting changes.
GitLab Version Control for Jupyter Notebooks

To use the version control for Notebooks using GitLab, you must perform the following tasks:

  1. Configure Version Control Settings
  2. Configure a GitLab Token
  3. Link Jupyter Notebooks to GitLab

After configuring the GitLab Repository, you can perform the following tasks to manage the notebook versions:

Configuring a GitLab Token

You can configure a GitLab Token for Jupyter notebooks at per user and per account setting level from the My Accounts or JupyterLab interface.

  • To configure the GitLab token for Jupyter notebooks for your account, see Configuring a GitLab Token.

  • To configure the GitLab token from Jupyter notebooks, perform the following steps:

    1. Navigate to Notebooks >> Jupyter and open a Jupyter notebook.

    2. From the left sidebar, click on the GitLab Versions icon as shown in the following figure.

      _images/config-gitlab.png
    3. Click Configure now.

    4. In the dialog box add the generated GitLab token and click Save.

      The GitLab token is now configured for your account.

Linking Jupyter Notebooks to GitLab

After configuring the GitLab token, you can link Jupyter notebooks to GitLab.

  1. Navigate to the GitLab profile and copy the URL from the browser’s address-bar.

    Note

    If you want to add HTTPS *.git link as the GitLab repository URL, click Clone or Download. A drop-down text box is displayed. Copy the HTTPS URL or click Use HTTP (if it exists) to copy the HTTP URL.

  2. Navigate to Notebooks >> Jupyter and open a Jupyter notebook.

  3. From the left sidebar, click on the GitLab Versions icon as shown in the following figure.

    _images/link-gitlab-jp.png
  4. Click the Link Now option.

  5. In the Link Notebook to GitLab dialog box, perform the following actions:

    1. Add the GitLab repository URL in the Repository Web URL text field. Ensure that the GitLab profile token has read permissions for the repository to checkout a commit and write permissions for the repository to push a commit.

    2. Select a branch from the Branch drop-down list.

    3. Add an object path file in the Object Path text field.

    4. If you want to strip the outputs from the notebooks before committing to GitLab, select the Strip Output checkbox.

      A sample is as shown in the following figure.

      _images/link-gl-jp.png
    5. Click Save.

Pushing Commits to Linked GitLab

After you link notebooks with GitLab, you can start using the Jupyter notebook to push commits to the GitLab directly from a Jupyter notebook.

Steps
  1. Open the required Jupyter notebook and save the changes.

  2. From the left sidebar, click on the GitLab Versions icon.

  3. Click the cloud icon to push the commit. A dialog opens to push commits. The following figure shows the version details and the Push to GitLab dialog.

    _images/PushtoGitLab1.png
  4. Add a commit message and click Save to push the commit to the GitLab repository. You can use the option force commit to force push over the old commit (irrespective of any conflict).

Note

Qubole does not store commits or revisions of notebooks. However, commits or revisions of notebooks can be fetched from users’ GitLab account whenever required.

Viewing and Comparing the Jupyter Notebook Versions

You can view a particular version of the Jupyter notebook by using the View option in the GITLAB VERSIONS sidebar as shown below.

_images/view-in-gitlab.png

You can compare a version of the Jupyter notebook with the previous version or version with current changes by using the Compare option in the GITLAB VERSIONS sidebar as shown below.

The Compare icon on top of the left sidebar compares the current notebook with the head of the branch. The Compare hyperlink in the left sidebar compares the given version with the previous version.

The following image shows a sample comparison of Jupyter notebook versions.

_images/compare-in-gitlab.png
Restoring a Commit from GitLab
  1. Open the required Jupyter notebook.
  2. From the left sidebar, click on the GitLab Versions icon.
  3. Select a version from the list and click Restore to checkout that version.
  4. Click OK to checkout that version in the confirmation dialog box.

Note

Qubole does not store commits or revisions of notebooks. However, commits or revisions of notebooks can be fetched from users’ GitLab account whenever required.

Creating a Pull Request from Jupyter Notebooks
  1. Open the required Jupyter notebook.

  2. From the left side bar, click on the GitLab Versions icon.

  3. Click on the Gear icon in the GITLAB VERSIONS pane. The Link Notebook to GitLab dialog is displayed.

  4. Click on the Create PR hyperlink.

  5. Proceed with the steps in GitLab to create the PR.

    For more information, see GitLab Documentation.

Resolving Conflicts While Using GitLab

There may be conflicts while pushing/checking out commits in the GitLab versions.

Note

You can use the option force commit to force push over the old commit (irrespective of any conflict).

Perform the following steps to resolve conflicts in commits:

  1. Clone the notebook.
  2. Link the cloned notebook to the same GitLab repo branch and path as the original notebook.
  3. Checkout the latest version of the cloned notebook.
  4. Manually port changes from the original notebook to the cloned notebook.
  5. You can commit the cloned notebook after porting changes.
Bitbucket Version Control for Jupyter Notebooks

To use the version control for Jupyter Notebooks using Bitbucket, you must perform the following tasks:

  1. Configure Version Control Settings
  2. Configure Bitbucket
  3. Link Jupyter Notebooks to Bitbucket

After configuring the Bitbucket Repository, you can perform the following tasks to manage the Jupyter notebook versions:

Configuring Bitbucket

You can configure Bitbucket for Jupyter notebooks at per user and per account setting level from the My Accounts or JupyterLab interface.

  • To configure Bitbucket for Jupyter notebooks at your account level, see Configuring Bitbucket.

  • To configure Bitbucket from Jupyter notebooks, perform the following steps:

    1. Navigate to Notebooks >> Jupyter and open a Jupyter notebook.

    2. From the left side bar, click on the icon for the Bitbucket versions as shown in the following figure.

      _images/config-bb-jp.png
    3. Click Configure now.

    4. In the dialog box, add the Bitbucket credentials and click Save. You can either use your Bitbucket credentials or Bitbucket App password.

    Bitbucket is now configured for your account.

Linking Jupyter Notebooks to Bitbucket

After configuring Bitbucket, you can link the Jupyter notebooks to Bitbucket.

  1. Navigate to the Bitbucket profile and copy the URL from the browser’s address-bar.

  2. Navigate to Notebooks >> Jupyter and open a Jupyter notebook.

  3. From the left side bar, click on the Bitbucket Versions icon as shown in the following figure.

    _images/link-bb-jp.png
  4. Click the Link Now option.

  5. In the Link Notebook to Bitbucket dialog box, perform the following actions:

    1. Add the Bitbucket repository URL in the Repository Web URL text field. Ensure that the Bitbucket profile has read permissions for the repository to checkout a commit and write permissions for the repository to push a commit.

    2. Select a branch from the Branch drop-down list.

    3. Add an object path file in the Object Path text field.

    4. If you want to strip the outputs from the notebooks before committing to Bitbucket, select the Strip Output checkbox.

      A sample is as shown in the following figure.

      _images/link-bb-jp-repo.png
    5. Click Save.

Pushing Commits to Bitbucket

After you link notebooks with Bitbucket, you can start using the notebook to push commits to Bitbucket directly from a notebook.

Steps
  1. Open the required Jupyter notebook and save the changes.

  2. From the left side bar, click on the BitBucket Versions icon.

  3. Click the Push icon to commit. A dialog opens to push commits. The following figure shows the version details and the Push to Bitbucket dialog.

    _images/Pushtobb1.png
  4. Add a commit message and click Save to push the commit to the Bitbucket repository. You can use the option force commit to force push over the old commit (irrespective of any conflict).

Note

Qubole does not store commits or revisions of notebooks. However, commits or revisions of notebooks can be fetched from users’ Bitbucket account whenever required.

Viewing and Comparing the Jupyter Notebook Versions

You can view a particular version of the Jupyter notebook by using the View option in the BITBUCKET VERSIONS sidebar as shown below.

_images/view-in-bb.png

You can compare a version of the Jupyter notebook with the previous version or version with current changes by using the Compare option in the BITBUCKET VERSIONS sidebar as shown below.

The Compare icon on top of the left sidebar compares the current notebook with the head of the branch. The Compare hyperlink in the left sidebar compares the given version with the previous version.

The following image shows a sample comparison of Jupyter notebook versions.

_images/compare-in-bb.png
Restoring a Commit from Bitbucket
  1. Open the required Jupyter notebook.
  2. From the left side bar, click on the BitBucket Versions icon.
  3. Select a version from the list and click Restore to checkout that version.
  4. Click OK to checkout that version in the confirmation dialog box.

Note

Qubole does not store commits or revisions of notebooks. However, commits or revisions of notebooks can be fetched from users’ Bitbucket account whenever required.

Creating a Pull Request from Jupyter Notebooks
  1. Open the required Jupyter notebook.

  2. From the left side bar, click on the BitBucket Versions icon.

  3. Click on the Gear icon in the BITBUCKET VERSIONS pane. The Link Notebook to Bitbucket dialog is displayed.

  4. Click on the Create PR hyperlink.

  5. Proceed with the steps in BitBucket to create the PR.

    For more information, see BitBucket Documentation.

Resolving Conflicts While Using Bitbucket

There may be conflicts while pushing/checking out commits in the Bitbucket versions.

Note

You can use the option force commit to force push over the old commit (irrespective of any conflict).

Perform the following steps to resolve conflicts in commits:

  1. Clone the notebook.
  2. Link the cloned notebook to the same Bitbucket repo branch and path as the original notebook.
  3. Checkout the latest version of the cloned notebook.
  4. Manually port changes from the original notebook to the cloned notebook.
  5. You can commit the cloned notebook after porting changes.
Data Visualization in Jupyter Notebooks

Jupyter Notebooks provide a data visualization framework called Qviz that enables you to visualize dataframes with improved charting options and Python plots on the Spark driver. Qviz provides a display() function that enables you to plot charts, such as table chart, pie chart, line chart, and area chart for the following data types:

  • Spark dataframes
  • pandas dataframes
  • SQL (%%sql) magic

You can also create visualization for custom plots directly on the Spark driver by using the supported Python libraries.

Note

The display() function is supported only on PySpark kernels.

The Qviz framework provides various options to visualize data and customize the charts.

Visualizing Spark Dataframes

You can visualize a Spark dataframe in Jupyter notebooks by using the display(<dataframe-name>) function.

Note

The display() function is supported only on PySpark kernels. The Qviz framework supports 1000 rows and 100 columns.

For example, you have a Spark dataframe sdf that selects all the data from the table default_qubole_airline_origin_destination. You can visualize the content of this Spark dataframe by using display(sdf) function as show below:

sdf = spark.sql("select * from default_qubole_airline_origin_destination limit 10")
display(sdf)

By default, the dataframe is visualized as a table.

The following illustration shows the sample visualization chart of display(sdf).

_images/spark-df.png

You can click on the other chart options in the Qviz framework to view other visualization types and customize the chart by using the Plot Builder option. For more information, see Using Qviz Options.

Visualizing pandas dataframes

You can visualize a pandas dataframe in Jupyter notebooks by using the display(<dataframe-name>) function.

Note

The display() function is supported only on PySpark kernels. The Qviz framework supports 1000 rows and 100 columns.

For example, you have a pandas dataframe df that reads a .csv file. You can visualize the content of this pandas dataframe by using the display(df) function as show below:

By default, the dataframe is visualized as a table.

The following image shows the sample visualization chart of display(df).

_images/panda-df.png

You can click on the other chart options in the Qviz framework to view other visualization types and customize the chart by using the Plot Builder option. For more information, see Using Qviz Options.

The following image shows the Area chart of the above mentioned pandas dataframe.

_images/panda-df1.png
Visualizing SQL

When you execute an SQL query with SQL magic %%sql, Jupyter notebooks initiates Qviz framework and the query data is displayed in the table format. You can select any other visualization type to view the data.

Note

The display() function is supported only on PySpark, Spark, and SparkR kernels. The Qviz framework supports 1000 rows and 100 columns.

Example:

%%sql
select origin, quarter, count(*)/1000000 count from default_qubole_airline_origin_destination
  where quarter is not NULL group by origin, quarter order by count desc limit 100

The following image shows a sample SQL visualization in Line chart.

_images/sql-qviz.png

You can click on the other chart options in the Qviz framework to view other visualization types and customize the chart by using the Plot Builder option. For more information, see Using Qviz Options.

Visualising Using Python Plotting Libraries

You can visualize Python on the Spark driver by using the display(<dataframe-name>) function.

The following Python libraries are supported:

  • plotly
  • matplotlib
  • seaborn
  • altair
  • pygal
  • leather

Note

The display() function is supported only on PySpark kernels.

Using plotly
import plotly.express as px
data_canada = px.data.gapminder().query("country == 'Canada'")
fig = px.bar(data_canada, x='year', y='pop')
display(fig)

The following image shows the visualization of the plotly plot.

_images/plotly.png
Using matplotlib
import pandas as pd
import matplotlib.pyplot as plt
plt.switch_backend('agg')


sdf = spark.sql("select * from default_qubole_airline_origin_destination limit 10")
data = sdf.toPandas()

data['distance'] = pd.to_numeric(data['distance'], errors='coerce')
data.plot(kind='bar', x='dest', y='distance', color='blue')

display(plt)

The following image shows the visualization of the matplotlib plot.

_images/matplotlib.png
Using seaborn
import numpy as np
import matplotlib.pyplot as plt
plt.switch_backend('agg')
import seaborn as sns
print(sns)
data = np.random.normal(0, 1, 3)
plt.figure(figsize=(9, 2))
sns.boxplot(x=data);

display(plt)

The following image shows the visualization of the seaborn plot.

_images/seaborn.png
Using altair
import altair as alt
import pandas as pd

source = pd.DataFrame({
    'a': ['A', 'B', 'C', 'D', 'E', 'F', 'G', 'H', 'I'],
    'b': [28, 55, 43, 91, 81, 53, 19, 87, 52]
})

plt = alt.Chart(source).mark_bar().encode(
    x='a',
    y='b'
)

The following image shows the visualization of the altair plot.

_images/altair.png
Using pygal
import pygal

bar_chart = pygal.Bar()
bar_chart.add('Fibonacci', [0, 1, 1, 2, 3, 5, 8])

display(bar_chart)

The following image shows the visualization of the pygal plot.

_images/pygal.png
Using leather
import random
import leather
dot_data = [(random.randint(0, 250), random.randint(0, 250)) for i in range(100)]
def colorizer(d):
    return 'rgb(%i, %i, %i)' % (d.x, d.y, 150)
chart = leather.Chart('Colorized dots')
chart.add_dots(dot_data, fill_color=colorizer)
display(chart)

The following image shows the visualization of the leather plot.

_images/leather.png

For other plot types, refer to the PlotExamplesPySpark.ipynb in the Example Notebooks of the Jupyter notebooks.

Using Qviz Options

The Qviz framework provides various UI elements that enables you to select among the different visualization types, and also enables you to customize the visualization by using the Plot Builder option. These options are available for visualizations that are created with dataframes (Spark and pandas), and with SQL magic.

Visualization Types

Qviz framework provides the following visualization types:

  • Table
  • Pie Chart
  • Line Chart
  • Area Chart
  • Bar Chart
  • Bubble Chart
  • Scatter Chart
  • Horizontal Bar chart
  • Donut Pie Chart

The Qviz framework displays a toolbar with certain options, such as downloading plot as a png file, zoom in, zoom out, etc on the top-right corner of the chart. These options vary depending on the chart type. When viewing the visualization as a table, you can also use pagination to view the data across pages.

Click on the required visualization type to plot the required charts.

The following illustration shows a sample Table visualization type.

_images/qviz-options.png

The following illustration shows a Bar chart with the highlighted toolbar.

_images/bar-chart.png

The following illustration shows a sample Pie chart.

_images/pie-chart.png

The following illustration shows a sample Line chart.

_images/line-chart.png

The following illustration shows a Area chart.

_images/area-chart.png
Customizing Visualization Types

You can use the Plot Builder option of the Qviz framework to customize a visualization. The Plot Builder option provides the following capabilities:

  • Change plot types.
  • Plot multiple (upto 4) traces by using the drag and drop action.
  • Select the traces for X and Y axis.
  • Use aggregate functions, such as, sum, min, max, avg, and count.
  • Change the axis scale to a log scale.
  • Use configuration option, such as, Normalized, Stacked, and Grouped for Area and Bar charts.
  • Preview the plot.

The following illustration shows a sample Plot Builder dialog.

_images/plot-builder.png
Exploring Data in Jupyter Notebooks

You can explore data through the Object Storage Explorer and Table Explorer options on the left toolbar of the JupyterLab interface.

From the Object Storage Explorer sidebar, you can view the folders in the storage, and drag and drop object storage objects, such as, bucket, folder, and file to a cell in a Jupyter notebook.

_images/cloud-explorer.png

Depending on the permissions, you can upload data to and download data from the cloud storage by clicking on the Gear icon as shown below.

_images/cloud-explorer-menu.png

From the Table Explorer sidebar, you can view the list of schemas, tables in a given schema, and table metadata as shown below. You can also and drag and drop table explorer objects, such as, tables and columns to a cell in a Jupyter notebook.

_images/table-explorer.png
Interpreter Modes for Jupyter Notebooks

Jupyter Notebooks on QDS support two Interpreter modes: Scoped mode and Isolated mode.

Scoped mode

This the default mode. In this mode, there is one Spark application for each user. All the notebooks executed by a user share the same Spark application. To prevent namespace conflicts between notebooks, new interpreter group per kernel is created. Each interpreter group has its own REPL or spark-shell. The driver is created with the default 2 GB memory. When the user disconnects the kernel, the Spark application is not terminated immediately but is terminated after the 60 min idle timeout.

In this mode, you cannot modify the Spark configurations. The Spark application considers the default configuration set at the cluster level. All the notebooks in the Scoped mode run serially and cells within a notebook run serially.

Isolated mode

In this mode, there is one Spark application per notebook of a user. A new Spark application is started for a notebook. Multiple notebooks can run parallelly and cells within a notebook run serially.

The notebooks first cell should be %%configure cell magic.

For example:

%%configure -f
{ }

You can modify the Spark configuration in the Isolated mode. To modify the Spark configuration, you can use %%configure cell magic and pass the parameters mentioned in the Spark Configuration Parameters table.

For example:

%%configure -f
{ "driverMemory" : "2g" , "conf" : { "spark.jars.packages":"graphframes:graphframes:0.7.0-spark2.4-s_2.11" } }

When you use the %%configure magic to specify any spark configuration, a new Spark application in the Isolated mode is created with the specified configuration for your notebook.

If you want to change the default mode to isolated mode, contact Qubole Support.

Switching the Interpreter Modes for a Notebook

To switch from the Scoped mode to the Isolated mode for a notebook, use the %%configure magic. For more details, see Isolated mode.

To switch from the Isolated to Scoped mode, delete the cell that has %%configure magic from the notebook. Restart the kernel.

Configuring Spark Settings for Jupyter Notebooks

By default, the cluster-wide spark configurations are used for Jupyter notebooks. You can specify the required Spark settings to configure the Spark application for a Jupyter notebook by using the %%configure magic.

You should specify the required configuration at the beginning of the notebook, before you run your first spark bound code cell.

If you want to specify the required configuration after running a Spark bound command, then you should use the -f option with the %%configure magic. If you use the -f option, then all the progress made in the previous Spark jobs is lost.

The following sample codes show how to specify Spark configurations.

%%configure -f
{"executorMemory": "3072M", "executorCores": 4, "numExecutors":10}
%%configure -f
{ "driverMemory" : "20G", "conf" : { "spark.sql.files.ignoreMissingFiles": "true",
"spark.jars.packages": "graphframes:graphframes:0.7.0-spark2.4-s_2.11"}}

Note

The Spark drivers are created on the cluster worker nodes by default for better distribution of load and better usage of cluster resources. If you want to execute the Spark driver on the coordinator node, contact Qubole Support.

The following table lists the Spark configuration parameters with their values.

Parameters Description Values
jars Jars to be used in the session List of string
pyFiles Python files to be used in the session List of string
files Files to be used in the session List of string
driverMemory Amount of memory to be used for the driver process string
driverCores Number of cores to be used for the driver process int
executorMemory Amount of memory to be used for the executor process string
executorCores Number of cores to be used for the executor process int
numExecutors Number of executors to be launched for the session int
archives Archives to be used in the session List of string
queue Name of the YARN queue string
name Name of the session (name must be in lower case) string
conf

Spark configuration properties


Note

You can specify all other Spark configurations.

Map of key=val
Viewing Spark Application Status

You can view the status of a Spark Application that is created for the notebook in the status widget on the notebook panel. The widget also displays links to the Spark UI, Driver Logs, and Kernel Log. Additionally, you can view the progress of the Spark job when you run the code.

When you create a Jupyter notebook, the Spark application is not created. When you run any Spark bound command, the Spark application is created and started.

Viewing Spark Application Status

Click on the down arrow next to the kernel on the right corner.

The widget displays the status of the Spark application as shown below.

_images/spark-app-status.gif

When the Spark Application Status is ready, you can access these logs to identify the errors for debbuging and troubleshooting issues. These links open as separate tabs in the JupyterLab interface.

Viewing Spark Job Progress

When you run a code in the Jupyter notebook, you can see the progress of the Spark job at each cell.

You can view the details of that Spark job by clicking on the View Details hyperlink.

The following animated GIF shows a sample Spark Job progress.

_images/spark-job-progress.gif
Converting Zeppelin Notebooks to Jupyter Notebooks

You can upload the Zeppelin notebooks on the JupyterLab interface and convert them to Jupyter notebooks. The incompatible code elements are converted to markdown cells. The Z.* functions are not supported in Jupyter notebooks.

Note

You can upload the Zeppelin notebooks with size less than 1 MB. If you want to upload notebooks that are larger than 1 MB, then clear the outputs and try again.

Steps
  1. Navigate to the Notebooks page.

  2. Open the Zeppelin notebooks that has to be coverted.

  3. Click the Settings icon on the top right corner and select Export. The notebook is downloaded to your local storage.

  4. Navigate to the Jupyter page.

  5. From the File menu, select Upload Zeppelin Notebooks as shown below

    _images/migrate-notebook1.png
  6. Read the confirmation message, and click Upload.

    _images/migrate-notebook2.png
  7. Select the Zeppelin notebooks to be converted from the local storage. You can select multiple Zeppelin notebooks.

  8. Click Choose.

  9. After the upload operation is completed, click Dismiss.

    _images/migrate-notebook3.png

The uploaded notebook appears in the File Browser pane. Any spaces in the name of the zeppelin notebook is removed.

Adding Packages to Jupyter Notebooks

The default environment is attached to the Spark cluster for the Jupyter notebooks. You might have to install some additional packages to execute Spark jobs in the Jupyter notebooks.

See Package Management for more information.

When the required additional packages are missing, then an error message with a link to Environments is displayed as shown below.

_images/pkg-mgmt1.png

You can either click on the Environments link on the error message or navigate to Settings, and select Environments.

The Environments page is opened in a separate tab as shown below.

_images/pkg-mgmt2.png

Follow the steps mentioned in Adding a Python or R Package.

Changing the JupyterLab Interface Theme

By default, theme of the JupyterLab interface is light.

To change the theme, navigate to the Settings menu, Select JupyterLab Theme >> JupyterLab Dark.

The following figure shows a sample JupyterLab interface with the Dark theme.

_images/jupy-theme-gcp.png
Known Limitations

You should be aware of the known limitations in Jupyter notebooks.

  • There is a 60-minute timeout for the Spark applications to start, that is to transition to RUNNING state. For example, if all resources in the cluster are in use and a notebook is run which creates a new Spark application, the Spark application has to transition to RUNNING state within 60 mins. If not, then the timeout expires,the Spark application is killed, and a timeout error is displayed in the notebook.
  • The following limitations are related to the %%sql magics, which use autovizwidgets to render various chart types.
    • Slow process- These charts are designed to be used for interactive use only. On a click of each chart type, the front end communicates with the backend through the same websocket which is used for cell executions. It generates the code in the backend which is transferred back to UI and rendered. This whole process is slow, especially if there are other cells executing in the notebook.
    • Re-rendering - After a notebook is executed, when it is saved, the rendered chart/table is not saved along with the notebook contents. Therefore, when a user re-opens a notebook later or shares it with others, these charts/table are not rendered again and instead an Error displaying widget: model not found message is displayed.

Zeppelin notebooks on Qubole are based on the Apache Zeppelin implementation and provide the following advantages:

Currently, Jupyter notebooks are supported only on Spark 2.2 and later versions.

Dashboards

This section explains how to use Dashboards, which allow QDS users to share notebooks with other users who do not have to be signed up for a full QDS account. The following topics provide more information:

Introduction to Dashboards

Dashboards provide a means for a QDS account user (a dashboard publisher) to share a notebook with other users of the QDS account, and also with users (dashboard-only consumers) who are not otherwise members of the QDS account.

To add a dashboard-only user, follow these instructions.

A dashboard is a report view (read-and-execute-only) of a notebook, and has the following characteristics:

  • Requires a running cluster, as with notebooks.
  • Must be created by the associated notebook owner, who must be a member of a QDS account.
  • Currently available from the QDS UI only (no API at present).
  • Includes the entire notebook (can’t be just a portion of it).
  • Includes only one notebook (can’t combine multiple notebooks or portions).
  • Is separate from the underlying notebook (the associated notebook owner and other users of the notebook can continue to develop the notebook over time without affecting the dashboard).
  • Can be used by any number of consumers concurrently.

Dashboards are organized in the QDS UI in the same way as notebooks, using the following main folders:

  • My Home
  • Common
  • Users

You can create any number of sub-folders under any of these folders. Associated notebook owners can make dashboards available to consumers by placing them in the Common folder.

Publishing Dashboards

Publishing a dashboard involves creating and sharing a report view (read-and-execute-only) of a notebook. The resulting dashboard can be run by users (consumers) who do not have to be owners of the QDS account. Once you have shared the dashboard, you can continue to develop the underlying notebook without affecting the dashboard. Conversely, consumers can change the dashboard parameter values, and re-run the dashboard and save the results, without affecting the notebook.

Note

Ensure that you have permissions to the Folder resource as described in Resources, Actions, and What they Mean.

If you as the associated notebook owner make changes to the notebook that you want to share, you can republish the dashboard, under either the same or a different name. If you use a different name, the original dashboard remains intact; otherwise it is overwritten.

If you delete the notebook, the dashboard will continue to exist, but it is best to delete it unless you have some good reason for keeping it, as it will quickly become out of date.

If you move the notebook to a new cluster, you will need to make any changes to the interpreters that are needed to make the dashboard run correctly, attach the interpreters to the notebook, and republish the dashboard.

Publishing a Dashboard

You must be a owner of a QDS account or a user with Create permission to publish a dashboard.

Note

Ensure that you have permissions to the Folder resource as described in Resources, Actions, and What they Mean.

Proceed as follows:

  1. From the main menu of the QDS UI, navigate to Notebooks.

  2. Click on a notebook to open it.

  3. Click the Dashboards icon near the top right of the screen and you can see Create Dashboard as shown here.

    _images/DashboardIconClick.png
  4. Click Create Dashboard or the + sign and the resulting screen is a dialog as shown here.

    _images/CreateDashboard.png
  5. Provide a name for the dashboard and choose a location from the dropdown (browse to any of the location using the location picker). The location can be:

    • My Home (your own folder)
    • Users (choose a user from the dropdown)
    • Common

    Note

    To create a dashboard in a folder other than My Home, you must have write permission to that folder.

  6. Optionally add a description.

  7. Select Schedule Dashboard if you want to refresh the dashboards at regular intervals. The Interval field gets enabled once you select Schedule Dashboard as shown here.

    _images/CreateDashboard1.png

    The default time interval displayed is based on the Maximum Instances per day set for a specific account in the account’s limits. To change it, select any value from the Interval drop-down list.

Note

For a scheduled Spark dashboard, the associated interpreters are stopped after the completion of the scheduled run.

  1. If you selected Schedule Dashboard and if you want to subscribe to emails that contain the dashboard report at every dashboard refresh, perform the following steps:

    1. Select the Send as Email option.
    2. Select the appropriate format from the Attachment Format drop-down list. By default, HTML is selected.
    3. Enter the email address and click Save.
  2. The new dashboard appears on the right panel of the screen as shown here.

    _images/CreatedDashboards.png

You can just click the dashboard or navigate to Dashboards to see the dashboard that you have published as shown here.

_images/DashboardwithSuccessfulResult.png
Understanding the Different Dashboard Modes

By default the dashboard is in a Read-only (compute is DOWN) or in an Editing (compute is UP) mode.

The associated notebook owner can see the dashboard in the Editing mode if the compute is UP. Pull down the Editing mode drop-down list and you can see Enable Interactive Mode which enables you to go into personalized view. The Interactive Mode lets you create a personalized view where you (or any user) can change the paragraph parameters (if any) and see a different result than the actual dashboard.

You (or any user) must have an Execute permission to enter into the Interactive Mode.

Managing Dashboard Permissions

Only the associated notebook owner or user with Manage permission can set Dashboard permissions that are described below.

To set permissions, click the gear icon and you can see the various options as shown here:

_images/DashboardActions.png

Choose Manage permissions , then choose a user or group from the drop-down. By default, each user who is a system-admin of a QDS account, and each group comprising such members, has all capabilities (read, update, delete, execute, and manage); keep in mind that these relate to the dashboard, not to the underlying notebook. You can adjust the permissions as you see fit. Here is an example of how permissions are set to two different users.

_images/ManageDBPermissions.png

You can assign permission to a user or a group. Click Add Permission to assign permission to another user/group. While selecting a user/group in the drop-down list, a user is displayed with the <username> (<email-address>) and a group is displayed with just the <group name>.

Configuring Dashboards

You (as the associated notebook owner) can configure dashboards and a user who has Update permission can also configure dashboards. Click the gear icon against the dashboard in the left panel or in the top-left corner of the dashboard. From the options displayed, click Configure Dashboard. The Configure Dashboard dialog is displayed as shown here.

_images/CreateDashboard.png

You can edit the periodic-refresh interval and any other fields except the source. You can also edit the dashboard name by clicking its name in the header as shown here.

_images/DBNameHeaderEdit.png

You can optionally subscribe to emails that contain the dashboard report at every dashboard refresh by selecting the Send as Email option as shown below. The report can be in one of the following formats: PDF, PNG, or HTML.

_images/subscribe-email.png
Changing the Dashboard Color Theme

A QDS user with Update permission can change the theme of the dashboard. The half black-and-white circle icon in the top-left of the dashboard lets you do that. The icon is as illustrated here.

_images/DBThemeChange.png

Click the theme icon and you can see different themes. Choose the one that you want to use and the new theme immediately gets applied on to the dashboard.

Creating and Managing Dashboard Folders

You can only create single-level folders on Dashboards UI to organize the dashboards.

In the left-side panel of Dashboards, there is a new folder icon on the top. Click it to create a new folder. The dialog to create a folder is displayed as shown here.

_images/DashboardFolder.png

Add a name to the folder. By default the location will be in the Users/<user-emailaddress>. Change it to a different location if you want a non-default location.

After you create the folder, you can refresh, rename, move, or delete it. To do, click the gear icon against that folder and select it. The dashboard folder operations are described below:

  • To refresh a folder, click the gear icon against the dashboard folder that you want to refresh. Click Refresh from the drop-down list.

  • To rename a folder, click the gear icon against the dashboard folder that you want to rename. Click Rename from the drop-down list. The dialog is displayed as shown here.

    _images/RenameDBFolder.png

    Add a new name to the folder and click Rename.

  • To move a folder, click the gear icon against the dashboard folder that you want to move. Click Move from the drop-down list. The dialog is displayed as shown here.

    _images/MoveDBFolder.png

    Add a path to the folder in Destination or browse to the new location and click Move.

  • To delete a folder, click the gear icon against the dashboard folder that you want to delete. Click Delete from the drop-down list.

    A dialog is displayed that asks for confirmation for deleting the folder. Click OK to delete it.

Adding a Dashboard-Only User

To add a dashboard-only user– a user who is limited to using dashboards as a consumer– proceed as follows:

  1. In the QDS UI, navigate to the Control Panel and choose Manage Users.
  2. On the resulting screen, click on the person icon near the top right.
  3. On the resulting screen, enter the user’s email address in the User Email(s) field.
  4. Click in the Groups field and choose dashboard-user from the dropdown.

You will see a message that an invitation has been sent to that user. Once the user has received the invitation and followed the instructions in it, he or she can use dashboards as described here.

Using Dashboards as a Consumer

A dashboard-only consumer is someone who is not a full member of a QDS account, but is allowed to use dashboards. Full QDS-account members can also use dashboards without restrictions.

As a dashboard-only consumer you can do the following with the dashboards in your folder:

  • Run a dashboard as published (the base view); OR
  • Change the values of the dashboard parameters, if any (for example, change the date or customer set) and run the dashboard with those values.

You can also see and run dashboards in other users’ folders (under Users), or in the Common folder, if you have been granted permission by the associated notebook owner.

As a dashboard-only user you cannot:

  • Create a dashboard.
  • Publish a dashboard.
Using a Dashboard

Proceed as follows:

  1. From the main menu of the QDS UI, navigate to Dashboards.

  2. Choose a dashboard.

  3. If necessary, click Enable Interactive Mode. This starts the cluster that runs the underlying notebook.

    • In interactive mode, you are insulated from changes by other users of the dashboard, or by the associated notebook owner. The dashboard will reflect only the changes that you make.
    • In non-interactive mode, the dashboard will reflect any changes made by other users (to the dashboard or to the underlying notebook) while you are working.
  4. Run the dashboard. Either:

    • Run it as is (the base view); OR
    • Change the value of some or all parameters (if any) and run the dashboard.

    You can export the output as a PDF.

Understanding the Different Dashboard Modes

By default the dashboard is in a Read-only (compute is DOWN) or in an Editing (compute is UP) mode.

The associated notebook owner can see the dashboard in the Editing mode if the compute is UP. Pull down the Editing mode drop-down list and you can see Enable Interactive Mode which enables you to go into personalized view. The Interactive Mode lets you create a personalized view where you (or any user) can change the paragraph parameters (if any) and see a different result than the actual dashboard.

You (or any user) must have an Execute permission to enter into the Interactive Mode.

Changing the Dashboard Color Theme

A QDS user with allowed Update permission can change the theme of the dashboard. The half black-and-white circle icon in the top-left of the dashboard lets you do that. The icon is as illustrated here.

_images/DBThemeChange.png

Click the theme icon and you can see different themes. Choose the one that you want to use and the new theme immediately gets applied on to the dashboard.

Viewing the Source Notebook from Dashboard

As a notebook owner or a consumer, you can view the source notebook that is associated with the dashboard.

Steps
  1. From the Home menu, click Dashboards. The Dashboards page is displayed.

  2. On the left navigation pane, identify the dashboard for which you want to view the source notebook.

  3. Right-click on the required dashboard or click on the Settings icon next to the dashboard, and select View Notebook from the menu as shown below.

    _images/view-notebook.png

The source notebook is opened in the Notebooks page, in a separate tab.

Viewing All Dashboards

If you have the Update permission, you can view all the dashboards associated with the notebook.

Steps
  1. From the Home menu, click Notebooks. The Notebooks page is displayed.

  2. On the left navigation pane, select the required notebook.

  3. On the top right corner, click on the Manage notebook dashboard(s) icon as shown below.

    _images/manage-dashboard.png

The associated dashboards are displayed on the Dashboards pane on the right side as shown below.

_images/dashboard-window.png
Searching Dashboards

You can search and view a list of scheduled or non-scheduled dashboards from the list of dashboards.

  1. Navigate to the Dashboards UI.

  2. Click on the search icon on the left pane. The filter fields are displayed as shown in the following figure.

    _images/dash-filter.png
  3. From the Scheduled filter, select the appropriate value.

  4. Specify any other filters if required.

  5. Click Apply.

Emailing a Dashboard

You can email a dashboard as an attachment in PDF, PNG, and HTML formats. You can also subscribe to emails that contain the dashboard report at every dashboard refresh for scheduled dashboards.

Note

Subscribing to emails that contain the dashboard report at every dashboard refresh is not enabled for all users by default. Create a ticket with Qubole Support to enable this feature on the QDS account.

Note

To email a dashboard, you must have the create permission to the Commands resource as described in Resources, Actions, and What they Mean.

  1. In the Dashboard page, click on the gear icon as shown in the following figure.

    _images/dash-settings.png
  2. Select Email as attachment.

  3. In the Email Dashboard dialog box, select the required format from the drop-down list. By default, PDF is selected.

  4. Enter the email address. If you want to send the attachment to multiple recipients, then add comma separated email addresses.

  5. Click Send.

The following figure shows a sample Email Dashboard dialog box.

_images/email-dashboard.png

Note

If a dashboard fails to render within 3 minutes, then the email option fails.

You can also email dashboards as attachments by using the command API. See Submit a Notebook Convert Command.

Subscribing to Dashboard Report

For scheduled dashboards, you can optionally subscribe to emails that contain the dashboard report at every dashboard refresh. The report can be in one of the following formats: PDF, PNG, or HTML.

  1. On the Dashboards page, select the gear icon against the dashboard in the left panel or in the top-left corner of the dashboard.

  2. Select Configure Dashboard from the menu. The Configure Dashboard dialog box is displayed as shown below.

    _images/subscribe-email.png
  3. Ensure that the Schedule Dashboard option is selected.

  4. Select the Send as Email option.

  5. Select the appropriate format from the Attachment Format drop-down list. By default, HTML is selected.

  6. Enter the email address and click Save.

Downloading a Dashboard

You can download a dashboard in PDF, PNG, and HTML formats.

Note

To download a dashboard, you must have the create permission to the Commands resource as described in Resources, Actions, and What they Mean.

  1. In the Dashboards page, click on the Settings icon as shown in the following figure.

    _images/dash-settings1.png
  2. Select Download As.

  3. In the Download Dashboard As dialog box, select the required format from the drop-down list. By default, PDF is selected.

  4. Click Download.

The following figure shows a sample Download Dashboard As dialog box.

_images/download-dashboard.png

Note

If a dashboard fails to render within 3 minutes, then the download option fails.

You can also download dashboards by using the command API. See Submit a Notebook Convert Command

Package Management

The Package Management feature of Qubole enables you to create and manage environments.

Note

Package Management is a Beta feature. Package management is enabled by default for new accounts. Existing account users must create a ticket with Qubole Support to enable package management.

Package Management provides the Environments UI with the following capabilities:

  • Create an environment with Python and R version selection.
  • An environment loaded with default Anaconda packages.
  • An environment loaded with the CRAN package repo, which you can use to install any R packages.
  • Distributed installation of packages in a running Spark application or Airflow workflow.

package-management-environment-api-index provides a list of APIs for creating, editing, cloning, and viewing an environment, and for attaching a cluster to an environment.

Starting from R59, Qubole provides a new Package Management UI Environments with new features.

Note

The new Package management UI is not enabled by default. Create a ticket with Qubole Support to enable the new Package Management UI.

Using the new Package Management UI

Qubole has redesigned the Package Management UI called as Environments with certain new features.

Note

The new Package management UI is not enabled by default. Create a ticket with Qubole Support to enable the new Package Management UI.

Navigate to Control Panel >> Environments page to launch the new Package Management UI.

You can manage Python and R packages in Spark applications. QDS automatically attaches an environment with Python version 3.7 to an Airflow 1.8.2 cluster.

Note

Package management with Python 3.5 or 3.7 is supported on Airflow clusters.

From the new Package Management UI, you can perform the following tasks:

Creating an Environment
  1. Navigate to the new Environments page.

  2. Click on the +New button on the left navigation pane. The create environment dialog appears as shown below.

    _images/create-env-new.png
  3. Enter a name and description for the environment in the respective fields.

  4. Select the appropriate Python Version and R Version from the drop-down lists.

  5. Click Add. A new environment is created and displayed as shown below.

    _images/sample-env1.png

A newly created environment by default contains the Anaconda distribution of R and Python packages and a list of pre-installed Python and R packages. For more information about viewing packages, see Viewing Packages.

Attaching a Cluster to an Environment

You can attach environments to Spark and Airflow clusters. For Spark clusters, a Conda virtual environment is created for Python and R environments.

You can attach an environment only to a cluster that is down. You can attach only one cluster to an environment.

Click on the Cluster drop-down list on the top-right corner of the Environments page, and select the appropriate cluster.

_images/cluster-attach.png

If you want to detach a cluster, select Detach Cluster from the Cluster drop-down list.

In the Spark cluster, Python and R Conda environments are located in /usr/lib/envs/ (existing package management) or in /usr/lib/environs/ (new package management). The spark.pyspark.python configuration in /usr/lib/spark/conf/spark-defaults.conf points to the Python version installed in the Conda virtual environment for a Spark cluster.

In a Spark notebook associated with a cluster attached to the package management environment, set these configuration parameters in its interpreter settings to point to the virtual environment:

  • Set zeppelin.R.cmd to cluster_env_default_r
  • Set zeppelin.pyspark.python to cluster_env_default_py

You can also attach an environment with Python 3.7 to a cluster running Airflow 1.8.2; any previous environment must be detached from the cluster first. For more information, see Configuring an Airflow Cluster.

If you detach a Python 3.7 environment from an Airflow 1.8.2 cluster, you must attach another Python 3.7 environment or the cluster will not start.

Adding Python Packages

You can add Python packages either from Conda or PyPI.

  1. In the environment, select the Python tab, and click + Add Python Package tab.

    The Add Package(s) dialog appears as shown below.

    _images/add-py-pkg.png
  2. Select the Conda or PyPI as Package repo.

  3. Select the input mode as Simple or Advanced.

    The Simple mode is the default input mode. Add the name of the package in the Name field. As you type the name, an autocomplete list appears and you can select the package name. The version is optional field and it can be incremental as shown below. If you just mention the package name, then the latest version of the package is installed.

    _images/add-py-simple-new.png

    Note

    If you upgrade or downgrade a Python package, the changed version is reflected only after you restart the Spark interpreter. Interpreter Operations lists the restart and other Spark interpreter operations.

    In the Advanced mode, the autocomplete list does not appear. Enter the package name. You can add multiple names of packages as a comma-separated list. You can also mention a specific version of the package, for example, numpy==1.1. For downgrading, you can just mention the version number to which you want to downgrade. If you just mention the package name, then the latest version of the package is installed.

    _images/add-py-adv-new.png
  4. Click Add. The packages are marked as Pending Packages.

  5. Start the cluster for the installation to complete. After the installation is complete, the packages are listed in the Name column as User mentioned in the Installed By column. If the cluster is not attached to the environment, attach a cluster and start the cluster to complete the installation.

Uploading Egg or Wheel Packages

You can add Egg or Wheel packages in the Python Conda environment.

Note

This feature is not enabled by default. Create a ticket with Qubole Support to enable this feature.

  1. In the environment, select the Python tab.

  2. Click +Upload Egg/Wheel.

    The Upload Package dialog appears as shown below.

    _images/egg-wheel.png
  3. Select Egg or Wheel from the Upload Type drop-down list.

  4. Perform one of the following steps:

    • Depending on whether the package is on object storage or local storage, enter the object storage path or path in the local storage.
    • Click to browse for the file, or drag and drop the file.
  5. Click Upload. The packages are marked as Pending Packages.

  6. Start the cluster for the installation to complete. After the installation is complete, the packages are listed in the Name column as User Package Dependency mentioned in the Installed By column. If the cluster is not attached to the environment, attach a cluster and start the cluster to complete the installation.

Adding R Packages

You can add R packages either from Conda or CRAN.

  1. In the environment, select the R tab, and click + Add R Package tab.

    The Add Package(s) dialog appears as shown below.

    _images/add-r-pkg.png
  2. Select the Conda or CRAN as Package repo.

  3. Select the input mode as Simple or Advanced.

    The Simple mode is the default input mode. Add the name of the package in the Name field. As you type the name, an autocomplete list appears and you can select the package name. The version is optional field and it can be incremental as shown below. If you just mention the package name, then the latest version of the package is installed.

    _images/add-r-pkg-simple.png

    In the Advanced mode, the autocomplete list does not appear. Enter the package name. You can add multiple names of packages as a comma-separated list. You can also mention a specific version of the package, for example, r.dbi==2.3.2. For downgrading, you can just mention the version number to which you want to downgrade. If you just mention the package name, then the latest version of the package is installed.

    _images/add-r-pkg-adv.png
  4. Click Add. The packages are marked as Pending Packages.

  5. Start the cluster for the installation to complete. After the installation is complete, the packages are listed in the Name column as User mentioned in the Installed By column. If the cluster is not attached to the environment, attach a cluster and start the cluster to complete the installation.

Viewing Packages

You can view the user installed packages, system installed packages, and user package dependencies.

  1. Select Python or R tab.

  2. Click the filter icon in the Installed By column as shown below.

    _images/view-pkg.png
  3. Select the type of packages you want to view, and click OK.

    The list of packages are displayed as shown below.

    _images/view-pkg-list.png
Modifying Channels

You can modify the channels for Conda, PyPI, and CRAN packages to add custom channels, and install the packages from these custom channels.

Note

This feature is not enabled by default. Create a ticket with Qubole Support to enable this feature.

  1. Select Python or R tab.

  2. Click Modify Channels.

    The Modify Channels dialog is displayed as shown below.

    _images/modify-channels.png
  3. Select Conda, PyPI, or CRAN from the Channel Type drop-down list.

  4. Enter the name of new channels in the Channel Priority List.

    The leftmost channel has the highest priority while the rightmost channel has the least priority.

  5. Click Add.

Viewing Activity History

You can view the activity history for each environment.

  1. Select Python or R tab.

  2. Click View Activity History.

    The Activity History is displayed as shown below.

    _images/activity-history.png
  3. Expand the package to view details.

  4. To restore an environment to a previous success state, click Restore.

  5. To view logs, click Logs.

Updating the Packages

You can update the installed packages by performing the following steps:

  1. Select Python or R tab.

  2. Select the required package from the table as shown below.

    _images/update-pkg.png
  3. Click Update Selected Packages. The Update Package(s) dialog appears as shown below.

    _images/update-pkg1.png
  4. Select the appropriate Package Repo and enter the version details.

  5. Click Update.

Editing an Environment

You can edit the name and description of an environment.

  1. From the left navigation pane, hover the mouse on the required environment.

  2. Click on Gear (settings icon) as shown below.

    _images/EnvironSettings.png
  3. Click Edit from the menu.

    The edit environment dialog appears as shown below.

    _images/edit-env-newui.png
  4. Edit the values and click Save.

Cloning an Environment

When you want to use the same environment on a different cluster, clone it and attach it to that cluster. You can attach an environment to only one cluster.

  1. From the left navigation pane, hover the mouse on the required environment.

  2. Click on Gear (settings icon) as shown below.

    _images/EnvironSettings.png
  3. Click Clone from the menu.

    The clone environment dialog appears as shown below.

    _images/clone-pkg-newui.png

    By default, a suffix to the name that is <environment name>-clone is added in the Name field. You can retain that name or change it. You can also change the description. You cannot change application versions.

  4. Click Clone.

Using the Default Package Management UI

You can use the Environments page in the Control Panel of the QDS UI to manage Python and R packages in Spark applications; and in addition, QDS automatically attaches an environment with Python version 3.7 to an Airflow 1.8.2 cluster.

Note

Package management with Python 3.5 or 3.7 is supported on Airflow clusters.

The following table lists the supported Python and R versions in the existing and new package management.

Users Package Management Supported Python Versions Supported R Versions
Existing users Existing package management 2.7 and 3.5 3.3
Existing users New package management 2.7 and 3.7 3.5
New users New package management 2.7 and 3.7 3.5

Use the Environments tab for:

Creating an Environment

Navigate to the Environments page in the Control Panel and choose New to create a new environment. The following dialog appears:

To create an environment, perform these steps:

  1. Name the environment.
  2. Provide a description for the environment.
  3. Select the Python Version and R Version from the drop-down menus in the dialog that pops up. See QDS Components: Supported Versions and Cloud Platforms for more information about versions.
_images/Environment.png
  1. Click Create. A new environment is created and displayed:
_images/NewEnv.png

A newly created environment by default contains the Anaconda distribution of R and Python packages and a list of pre-installed Python and R packages. Click See list of pre-installed packages. See also Viewing the List of Pre-installed Python and R Packages.

You can also edit or clone an environment, as described under Editing an Environment and Cloning an Environment.

Attaching a Cluster to an Environment

Click Edit against Cluster Attached to attach an environment to a cluster. After you click Edit, you can see a drop-down list of available Spark clusters or Airflow clusters, for example:

_images/AttachClustertoEnv.png

You can attach an environment only to a cluster that is down. You can attach only one cluster to an environment.

Select the cluster that you want to attach to the environment and click Attach Cluster.

You can attach environments to Spark clusters. A Conda virtual environment gets created for Python and R environments. In the Spark cluster, Python and R Conda environments are located in /usr/lib/envs/ (existing package management) or in /usr/lib/environs/ (new package management). The spark.pyspark.python configuration in /usr/lib/spark/conf/spark-defaults.conf points to the Python version installed in the Conda virtual environment for a Spark cluster.

In a Spark notebook associated with a cluster attached to the package management environment, configure these in its interpreter settings to point to the virtual environment:

  • Set zeppelin.R.cmd to cluster_env_default_r
  • Set zeppelin.pyspark.python to cluster_env_default_py

You can also attach an environment with Python 3.7 to a cluster running Airflow 1.8.2; any previous environment must be detached from the cluster first. For more information, see Configuring an Airflow Cluster.

To detach a cluster from an environment, click the Delete icon next to the cluster ID. The cluster must be down. If you detach a Python 3.7 environment from an Airflow 1.8.2 cluster, you must attach another Python 3.7 environment or the cluster will not start.

Adding a Python or R Package

A newly created environment contains the Anaconda distribution of R and Python packages by default. An environment also supports the conda-forge channel which supports more packages. Questions about Package Management provides answers to questions related to adding packages.

Note

You can install Python packages either from Conda or PyPI from the Python Package Repo drop-down list for faster installation of packages. This feature is available by default for users of the new accounts. Users of the older accounts should contact Qubole Support to enable this feature. If you have restrictive egress rules on your cluster, you must allow the following repositories to use the full package management:

  • By default: https://repo.continuum.io/, https://conda.anaconda.org/, https://pypi.org/
  • For CRAN packages: https://cran.r-project.org/, http://cran.cnr.berkeley.edu/ (or allow redirection from http://cran.us.r-project.org) , http://cran.us.r-project.org/

To add Python or R packages, click Add against Packages in a specific environment. The Add Packages dialog appears as shown here for the new accounts.

_images/AddPackages1.png

The Add Packages dialog appears as shown here for the old accounts.

_images/AddPackages.png

Perform these steps:

  1. For new accounts, Conda Environment shows Python and Python Package Repo shows Conda by default. You can change Python Package Repo to PyPI based on packages to be installed. Select the appropriate packages.

    For old accounts, by default, Conda Environment shows Python. You can choose R Packages as the source from the list to install an R package.

  2. Adding source supports two input modes: Simple and Advanced. The Simple mode is the default input mode and add the name of the package in the Name field.

    As you try to type the name, an autocomplete list appears and the package name can be added and the version is optional and it can be incremental as shown here.

    _images/SimpleModePackage.png _images/SimpleModePackage2.png

    If you just mention the package name, then the latest version of the package is installed.

    Note

    If you upgrade or downgrade a Python package, the changed version is reflected only after you restart the Spark interpreter. Interpreter Operations lists the restart and other Spark interpreter operations.

    If you choose the Advanced mode, it shows suggestions and as you start typing the package name, you can see the autocomplete list as shown here.

    _images/AutoCompletePackage.png

    In the Advanced mode, you can add multiple names of packages as a comma-separated list. You can also mention a specific version of the package, for example, numpy==1.1. For downgrading, you can just mention the version number to which you want to downgrade. If you just mention the package name, then the latest version of the package is installed.

    Qubole supports adding a R Package from the CRAN package repo. This feature enhancement is not available by default. Create a ticket with Qubole Support to enable this feature on a QDS account. Qubole allows you to add an R package from the CRAN package repo only in the Advanced Mode. To add a R package from the CRAN package repo, follow these steps:

    1. Click Add Package.
    2. Select R Packages as the Source.
    3. In the CRAN package, you can enter a comma-separated R package names. You can also simultaneously install packages from the Conda Packages. The Conda Packages as well as CRAN Packages text fields accept a comma-separated list of packages.

    Here is an example of the UI dialog to add R Packages in the Advanced Mode.

    _images/CranPackage-R.png

    After adding Python or R package, click Add. The status of the package is shown as Pending. You must start the cluster for the installation to complete. After the installation is complete, the status is shown as Installed.

    The following figure shows a sample package with packages in Pending state.

    _images/Package-Initial.png

Note

After you install the packages, you cannot remove the package from your environment. However, you can delete the failed packages by clicking on the corresponding Delete icon.

Editing an Environment

You can edit an existing environment. In the left-navigation bar, you can see a Gear (settings icon) if you do a mouse hover on a specific environment. Click the icon and you can see these options.

_images/EnvironSettings.png

Click Edit and you can see the dialog as shown here.

_images/EditEnv.png

You can edit the name and description of an environment. After changing the name and/or description, click Edit. You can click Cancel if you do not want to edit the environment.

Cloning an Environment

When you want to use the same environment on a different cluster, clone it and attach it to that cluster. (An environment can be attached to only one cluster). In the left-navigation bar, you can see a Gear (settings icon) if you do a mouse hover on a specific environment. Click the icon and you can see these options.

_images/EnvironSettings.png

Click Clone and you can see the dialog as shown here.

_images/CloneEnv.png

By default, a suffix to the name that is <environment name>-clone is added in the Name field. You can retain that name or change it. You can also change the description. You cannot change application versions. After doing the changes, click Clone. You can click Cancel if you do not want to clone the environment.

Managing Permissions of an Environment

Here, you can set permission for an environment. By default, all users in a Qubole account have read access on the environment but you can change the access. You can override the environment access that is granted at the account-level in the Control Panel. If you are part of the system-admin group or any group which have full access on the Environments and Packages resource, then you can manage permissions. For more information, see Managing Roles.

A system-admin and the owner can manage the permissions of a environment by default. Perform the following steps to manage a environment’s permissions:

  1. Click the gear box icon next to the environment and click Manage Permissions from the list of options (that are as displayed here).

    _images/EnvironSettings.png
  2. The dialog to manage permissions for a specific environment is displayed as shown in the following figure.

    _images/ManagePerm-PM.png
  3. You can set the following environment-level permissions for a user or a group:

    • Read: Set it if you want to change a user/group’s read access to this specific environment.
    • Update: Set it if you want a user/group to have write privileges for this specific environment.
    • Delete: Set it if you want a user/group who can delete this specific environment.
    • Manage: Set it if you want a user/group to grant and manage access to other users/groups for accessing this specific environment.
  4. You can add any number of permissions to the environment by clicking Add Permission.

  5. You can click the delete icon against a permission to delete it.

  6. Click Save for setting permissions to the user/group. Click Cancel to go back to the previous tab.

Deleting an Environment

You can delete an environment. In the left-navigation bar, you can see a Gear (settings icon) if you do a mouse hover on a specific environment. Click the icon and you can see these options.

_images/EnvironSettings.png

Click Delete to remove the environment.

Migrating Existing Interpreters to use the Package Management

Even after attaching a Spark cluster to an environment, existing Spark interpreters in the notebook keep using the system/virtualenv Python and system R. To use the environment, change Python and R interpreter property values in the existing interpreter to use Anaconda-specific Python and R. Change these interpreter property values:

  • Set zeppelin.R.cmd to cluster_env_default_r.
  • Set zeppelin.pyspark.python to cluster_env_default_py.

The interpreter automatically restarts after its properties change.

However, a new Spark (not a cloned cluster) cluster, which is attached to an environment contains the default Spark Interpreter set to Anaconda-specific Python and R that is cluster_env_default_py and cluster_env_default_r. Similarly, a new interpreter on an existing cluster uses the Anaconda-specific Python and R.

Note

After a cluster is detached from an environment, the Spark interpreter (existing or new) falls back to system/virtualenv Python and system R.

Viewing the List of Pre-installed Python and R Packages

A newly created environment contains Anaconda distribution of R and Python packages by default and a list of pre-installed Python and R packages. Click See list of pre-installed packages. For more information, see Using the Default Package Management UI. By default, the list displays Python Packages. Click the R Packages tab to see that list. The pre-installed packages are separately listed in:

List of Pre-installed Python Packages

Packages are available in the environment at /usr/lib/envs/<env-ID-and-version-details>.

# packages in environment at /usr/lib/environs/e-a-2019.03-py-2.7.15:
#
# Name                    Version                   Build  Channel
_ipyw_jlab_nb_ext_conf    0.1.0                    py27_0
alabaster                 0.7.12                   py27_0
anaconda-client           1.7.2                    py27_0
anaconda-navigator        1.9.7                    py27_0
anaconda-project          0.8.2                    py27_0
asn1crypto                0.24.0                   py27_0
astroid                   1.6.5                    py27_0
astropy                   2.0.9            py27hdd07704_0
atomicwrites              1.3.0                    py27_1
attrs                     19.1.0                   py27_1
babel                     2.6.0                    py27_0
backports                 1.0                      py27_1
backports.functools_lru_cache 1.5                      py27_1
backports.os              0.1.1                    py27_0
backports.shutil_get_terminal_size 1.0.0                    py27_2
backports_abc             0.5              py27h7b3c97b_0
beautifulsoup4            4.7.1                    py27_1
bitarray                  0.8.3            py27h14c3975_0
bkcharts                  0.2              py27h241ae91_0
blas                      1.0                         mkl
bleach                    3.1.0                    py27_0
blosc                     1.15.0               hd408876_0
bokeh                     1.0.4                    py27_0
boto                      2.49.0                   py27_0
bottleneck                1.2.1            py27h035aef0_1
bzip2                     1.0.6                h14c3975_5
ca-certificates           2019.1.23                     0
cairo                     1.14.12              h8948797_3
cdecimal                  2.3              py27h14c3975_3
certifi                   2019.3.9                 py27_0
cffi                      1.12.2           py27h2e261b9_1
chardet                   3.0.4                    py27_1
click                     7.0                      py27_0
cloudpickle               0.8.0                    py27_0
clyent                    1.2.2                    py27_1
colorama                  0.4.1                    py27_0
conda-verify              3.1.1                    py27_0
configparser              3.7.3                    py27_1
contextlib2               0.5.5            py27hbf4c468_0
cryptography              2.6.1            py27h1ba5d50_0
curl                      7.64.0               hbc83047_2
cycler                    0.10.0           py27hc7354d3_0
cython                    0.29.6           py27he6710b0_0
cytoolz                   0.9.0.1          py27h14c3975_1
dask                      1.1.4                    py27_1
dask-core                 1.1.4                    py27_1
dbus                      1.13.6               h746ee38_0
decorator                 4.4.0                    py27_1
defusedxml                0.5.0                    py27_1
distributed               1.26.0                   py27_1
docutils                  0.14             py27hae222c1_0
entrypoints               0.3                      py27_0
enum34                    1.1.6                    py27_1
et_xmlfile                1.0.1            py27h75840f5_0
expat                     2.2.6                he6710b0_0
fastcache                 1.0.2            py27h14c3975_2
filelock                  3.0.10                   py27_0
flask                     1.0.2                    py27_1
fontconfig                2.13.0               h9420a91_0
freetype                  2.9.1                h8a8886c_1
fribidi                   1.0.5                h7b6447c_0
funcsigs                  1.0.2            py27h83f16ab_0
functools32               3.2.3.2                  py27_1
future                    0.17.1                   py27_0
futures                   3.2.0                    py27_0
get_terminal_size         1.0.0                haa9412d_0
gevent                    1.4.0            py27h7b6447c_0
glib                      2.56.2               hd408876_0
glob2                     0.6                      py27_1
gmp                       6.1.2                h6c8ec71_1
gmpy2                     2.0.8            py27h10f8cd9_2
graphite2                 1.3.13               h23475e2_0
greenlet                  0.4.15           py27h7b6447c_0
grin                      1.2.1                    py27_4
gst-plugins-base          1.14.0               hbbd80ab_1
gstreamer                 1.14.0               hb453b48_1
h5py                      2.9.0            py27h7918eee_0
harfbuzz                  1.8.8                hffaf4a1_0
hdf5                      1.10.4               hb1b8bf9_0
heapdict                  1.0.0                    py27_2
html5lib                  1.0.1                    py27_0
icu                       58.2                 h9c2bf20_1
idna                      2.8                      py27_0
imageio                   2.5.0                    py27_0
imagesize                 1.1.0                    py27_0
importlib_metadata        0.8                      py27_0
intel-openmp              2019.3                      199
ipaddress                 1.0.22                   py27_0
ipykernel                 4.10.0                   py27_0
ipython                   5.8.0                    py27_0
ipython_genutils          0.2.0            py27h89fb69b_0
ipywidgets                7.4.2                    py27_0
isort                     4.3.16                   py27_0
itsdangerous              1.1.0                    py27_0
jbig                      2.1                  hdba287a_0
jdcal                     1.4                      py27_0
jedi                      0.13.3                   py27_0
jinja2                    2.10                     py27_0
jpeg                      9b                   h024ee3a_2
jsonschema                3.0.1                    py27_0
jupyter                   1.0.0                    py27_7
jupyter_client            5.2.4                    py27_0
jupyter_console           5.2.0                    py27_1
jupyter_core              4.4.0                    py27_0
jupyterlab                0.33.11                  py27_0
jupyterlab_launcher       0.11.2           py27h28b3542_0
kiwisolver                1.0.1            py27hf484d3e_0
krb5                      1.16.1               h173b8e3_7
lazy-object-proxy         1.3.1            py27h14c3975_2
libarchive                3.3.3                h5d8350f_5
libcurl                   7.64.0               h20c2e04_2
libedit                   3.1.20181209         hc058e9b_0
libffi                    3.2.1                hd88cf55_4
libgcc-ng                 8.2.0                hdf63c60_1
libgfortran-ng            7.3.0                hdf63c60_0
liblief                   0.9.0                h7725739_2
libpng                    1.6.36               hbc83047_0
libsodium                 1.0.16               h1bed415_0
libssh2                   1.8.0                h1ba5d50_4
libstdcxx-ng              8.2.0                hdf63c60_1
libtiff                   4.0.10               h2733197_2
libtool                   2.4.6                h7b6447c_5
libuuid                   1.0.3                h1bed415_2
libxcb                    1.13                 h1bed415_1
libxml2                   2.9.9                he19cac6_0
libxslt                   1.1.33               h7d1a2b0_0
linecache2                1.0.0                    py27_0
llvmlite                  0.28.0           py27hd408876_0
locket                    0.2.0            py27h73929a2_1
lxml                      4.3.2            py27hefd8a0e_0
lz4-c                     1.8.1.2              h14c3975_0
lzo                       2.10                 h49e0be7_2
markupsafe                1.1.1            py27h7b6447c_0
matplotlib                2.2.3            py27hb69df0a_0
mccabe                    0.6.1                    py27_1
mistune                   0.8.4            py27h7b6447c_0
mkl                       2019.3                      199
mkl-service               1.1.2            py27he904b0f_5
mkl_fft                   1.0.10           py27ha843d7b_0
mkl_random                1.0.2            py27hd81dba3_0
more-itertools            5.0.0                    py27_0
mpc                       1.1.0                h10f8cd9_1
mpfr                      4.0.1                hdf1c602_3
mpmath                    1.1.0                    py27_0
msgpack-python            0.6.1            py27hfd86e86_1
multipledispatch          0.6.0                    py27_0
navigator-updater         0.2.1                    py27_0
nbconvert                 5.4.1                    py27_3
nbformat                  4.4.0            py27hed7f2b2_0
ncurses                   6.1                  he6710b0_1
networkx                  2.2                      py27_1
nltk                      3.4                      py27_1
nose                      1.3.7                    py27_2
notebook                  5.7.8                    py27_0
numba                     0.43.1           py27h962f231_0
numexpr                   2.6.9            py27h9e4a6bb_0
numpy                     1.16.2           py27h7e9f1db_0
numpy-base                1.16.2           py27hde5b4d6_0
numpydoc                  0.8.0                    py27_0
olefile                   0.46                     py27_0
openpyxl                  2.6.1                    py27_1
openssl                   1.1.1b               h7b6447c_1
packaging                 19.0                     py27_0
pandas                    0.24.2           py27he6710b0_0
pandoc                    2.2.3.2                       0
pandocfilters             1.4.2                    py27_1
pango                     1.42.4               h049681c_0
parso                     0.3.4                    py27_0
partd                     0.3.10                   py27_1
patchelf                  0.9                  he6710b0_3
path.py                   11.5.0                   py27_0
pathlib2                  2.3.3                    py27_0
patsy                     0.5.1                    py27_0
pcre                      8.43                 he6710b0_0
pep8                      1.7.1                    py27_0
pexpect                   4.6.0                    py27_0
pickleshare               0.7.5                    py27_0
pillow                    5.4.1            py27h34e0f95_0
pip                       19.0.3                   py27_0
pixman                    0.38.0               h7b6447c_0
pkginfo                   1.5.0.1                  py27_0
pluggy                    0.9.0                    py27_0
ply                       3.11                     py27_0
prometheus_client         0.6.0                    py27_0
prompt_toolkit            1.0.15           py27h1b593e1_0
psutil                    5.6.1            py27h7b6447c_0
ptyprocess                0.6.0                    py27_0
py                        1.8.0                    py27_0
py-lief                   0.9.0            py27h7725739_2
pycairo                   1.18.0           py27h2a1e443_0
pycodestyle               2.5.0                    py27_0
pycosat                   0.6.3            py27h14c3975_0
pycparser                 2.19                     py27_0
pycrypto                  2.6.1            py27h14c3975_9
pycurl                    7.43.0.2         py27h1ba5d50_0
pyflakes                  2.1.1                    py27_0
pygments                  2.3.1                    py27_0
pylint                    1.9.2                    py27_0
pyodbc                    4.0.26           py27he6710b0_0
pyopenssl                 19.0.0                   py27_0
pyparsing                 2.3.1                    py27_0
pyqt                      5.9.2            py27h05f1152_2
pyrsistent                0.14.11          py27h7b6447c_0
pysocks                   1.6.8                    py27_0
pytables                  3.5.1            py27h71ec239_0
pytest                    4.3.1                    py27_0
python                    2.7.15               h9bab390_6
python-dateutil           2.8.0                    py27_0
python-libarchive-c       2.8                      py27_6
pytz                      2018.9                   py27_0
pywavelets                1.0.2            py27hdd07704_0
pyyaml                    5.1              py27h7b6447c_0
pyzmq                     18.0.0           py27he6710b0_0
qt                        5.9.7                h5867ecd_1
qtawesome                 0.5.7                    py27_1
qtconsole                 4.4.3                    py27_0
qtpy                      1.7.0                    py27_1
readline                  7.0                  h7b6447c_5
requests                  2.21.0                   py27_0
rope                      0.12.0                   py27_0
ruamel_yaml               0.15.46          py27h14c3975_0
scandir                   1.10.0           py27h7b6447c_0
scikit-image              0.14.2           py27he6710b0_0
scikit-learn              0.20.3           py27hd81dba3_0
scipy                     1.2.1            py27h7c811a0_0
seaborn                   0.9.0                    py27_0
send2trash                1.5.0                    py27_0
setuptools                40.8.0                   py27_0
simplegeneric             0.8.1                    py27_2
singledispatch            3.4.0.3          py27h9bcb476_0
sip                       4.19.8           py27hf484d3e_0
six                       1.12.0                   py27_0
snappy                    1.1.7                hbae5bb6_3
snowballstemmer           1.2.1            py27h44e2768_0
sortedcollections         1.1.2                    py27_0
sortedcontainers          2.1.0                    py27_0
soupsieve                 1.8                      py27_0
sphinx                    1.8.5                    py27_0
sphinxcontrib             1.0                      py27_1
sphinxcontrib-websupport  1.1.0                    py27_1
spyder                    3.3.3                    py27_0
spyder-kernels            0.4.2                    py27_0
sqlalchemy                1.3.1            py27h7b6447c_0
sqlite                    3.27.2               h7b6447c_0
ssl_match_hostname        3.7.0.1                  py27_0
statsmodels               0.9.0            py27h035aef0_0
subprocess32              3.5.3            py27h7b6447c_0
sympy                     1.3                      py27_0
tblib                     1.3.2            py27h51fe5ba_0
terminado                 0.8.1                    py27_1
testpath                  0.4.2                    py27_0
tk                        8.6.8                hbc83047_0
toolz                     0.9.0                    py27_0
tornado                   5.1.1            py27h7b6447c_0
tqdm                      4.31.1                   py27_1
traceback2                1.4.0                    py27_0
traitlets                 4.3.2            py27hd6ce930_0
typing                    3.6.6                    py27_0
unicodecsv                0.14.1           py27h5062da9_0
unittest2                 1.1.0                    py27_0
unixodbc                  2.3.7                h14c3975_0
urllib3                   1.24.1                   py27_0
wcwidth                   0.1.7            py27h9e3e1ab_0
webencodings              0.5.1                    py27_1
werkzeug                  0.14.1                   py27_0
wheel                     0.33.1                   py27_0
widgetsnbextension        3.4.2                    py27_0
wrapt                     1.11.1           py27h7b6447c_0
wurlitzer                 1.0.2                    py27_0
xlrd                      1.2.0                    py27_0
xlsxwriter                1.1.5                    py27_0
xlwt                      1.3.0            py27h3d85d97_0
xz                        5.2.4                h14c3975_4
yaml                      0.1.7                had09818_2
zeromq                    4.3.1                he6710b0_3
zict                      0.1.4                    py27_0
zipp                      0.3.3                    py27_1
zlib                      1.2.11               h7b6447c_3
zstd                      1.3.7                h0b5b093_0
List of Pre-installed R Packages

Packages are available in the environment at /usr/lib/envs/<env-ID-and-version-details>.

# packages in environment at /usr/lib/environs/e-a-2019.03-r-3.5.1:
#
# Name                    Version                   Build  Channel
_ipyw_jlab_nb_ext_conf    0.1.0                    py37_0
_r-mutex                  1.0.0               anacondar_1
alabaster                 0.7.12                   py37_0
anaconda-client           1.7.2                    py37_0
anaconda-navigator        1.9.7                    py37_0
anaconda-project          0.8.2                    py37_0
asn1crypto                0.24.0                   py37_0
astroid                   2.2.5                    py37_0
astropy                   3.1.2            py37h7b6447c_0
atomicwrites              1.3.0                    py37_1
attrs                     19.1.0                   py37_1
babel                     2.6.0                    py37_0
backcall                  0.1.0                    py37_0
backports                 1.0                      py37_1
backports.os              0.1.1                    py37_0
backports.shutil_get_terminal_size 1.0.0                    py37_2
beautifulsoup4            4.7.1                    py37_1
binutils_impl_linux-64    2.31.1               h6176602_1
binutils_linux-64         2.31.1               h6176602_6
bitarray                  0.8.3            py37h14c3975_0
bkcharts                  0.2                      py37_0
blas                      1.0                         mkl
bleach                    3.1.0                    py37_0
blosc                     1.15.0               hd408876_0
bokeh                     1.0.4                    py37_0
boto                      2.49.0                   py37_0
bottleneck                1.2.1            py37h035aef0_1
bwidget                   1.9.11                        1
bzip2                     1.0.6                h14c3975_5
ca-certificates           2019.1.23                     0
cairo                     1.14.12              h8948797_3
certifi                   2019.3.9                 py37_0
cffi                      1.12.2           py37h2e261b9_1
chardet                   3.0.4                    py37_1
click                     7.0                      py37_0
cloudpickle               0.8.0                    py37_0
clyent                    1.2.2                    py37_1
colorama                  0.4.1                    py37_0
conda-verify              3.1.1                    py37_0
contextlib2               0.5.5                    py37_0
cryptography              2.6.1            py37h1ba5d50_0
curl                      7.64.0               hbc83047_2
cycler                    0.10.0                   py37_0
cython                    0.29.6           py37he6710b0_0
cytoolz                   0.9.0.1          py37h14c3975_1
dask                      1.1.4                    py37_1
dask-core                 1.1.4                    py37_1
dbus                      1.13.6               h746ee38_0
decorator                 4.4.0                    py37_1
defusedxml                0.5.0                    py37_1
distributed               1.26.0                   py37_1
docutils                  0.14                     py37_0
entrypoints               0.3                      py37_0
et_xmlfile                1.0.1                    py37_0
expat                     2.2.6                he6710b0_0
fastcache                 1.0.2            py37h14c3975_2
filelock                  3.0.10                   py37_0
flask                     1.0.2                    py37_1
fontconfig                2.13.0               h9420a91_0
freetype                  2.9.1                h8a8886c_1
fribidi                   1.0.5                h7b6447c_0
future                    0.17.1                   py37_0
gcc_impl_linux-64         7.3.0                habb00fd_1
gcc_linux-64              7.3.0                h553295d_6
get_terminal_size         1.0.0                haa9412d_0
gevent                    1.4.0            py37h7b6447c_0
gfortran_impl_linux-64    7.3.0                hdf63c60_1
gfortran_linux-64         7.3.0                h553295d_6
glib                      2.56.2               hd408876_0
glob2                     0.6                      py37_1
gmp                       6.1.2                h6c8ec71_1
gmpy2                     2.0.8            py37h10f8cd9_2
graphite2                 1.3.13               h23475e2_0
greenlet                  0.4.15           py37h7b6447c_0
gsl                       2.4                  h14c3975_4
gst-plugins-base          1.14.0               hbbd80ab_1
gstreamer                 1.14.0               hb453b48_1
gxx_impl_linux-64         7.3.0                hdf63c60_1
gxx_linux-64              7.3.0                h553295d_6
h5py                      2.9.0            py37h7918eee_0
harfbuzz                  1.8.8                hffaf4a1_0
hdf5                      1.10.4               hb1b8bf9_0
heapdict                  1.0.0                    py37_2
html5lib                  1.0.1                    py37_0
icu                       58.2                 h9c2bf20_1
idna                      2.8                      py37_0
imageio                   2.5.0                    py37_0
imagesize                 1.1.0                    py37_0
importlib_metadata        0.8                      py37_0
intel-openmp              2019.3                      199
ipykernel                 5.1.0            py37h39e3cac_0
ipython                   7.4.0            py37h39e3cac_0
ipython_genutils          0.2.0                    py37_0
ipywidgets                7.4.2                    py37_0
isort                     4.3.16                   py37_0
itsdangerous              1.1.0                    py37_0
jbig                      2.1                  hdba287a_0
jdcal                     1.4                      py37_0
jedi                      0.13.3                   py37_0
jeepney                   0.4                      py37_0
jinja2                    2.10                     py37_0
jpeg                      9b                   h024ee3a_2
jsonschema                3.0.1                    py37_0
jupyter                   1.0.0                    py37_7
jupyter_client            5.2.4                    py37_0
jupyter_console           6.0.0                    py37_0
jupyter_core              4.4.0                    py37_0
jupyterlab                0.35.4           py37hf63ae98_0
jupyterlab_server         0.2.0                    py37_0
keyring                   18.0.0                   py37_0
kiwisolver                1.0.1            py37hf484d3e_0
krb5                      1.16.1               h173b8e3_7
lazy-object-proxy         1.3.1            py37h14c3975_2
libarchive                3.3.3                h5d8350f_5
libcurl                   7.64.0               h20c2e04_2
libedit                   3.1.20181209         hc058e9b_0
libffi                    3.2.1                hd88cf55_4
libgcc-ng                 8.2.0                hdf63c60_1
libgfortran-ng            7.3.0                hdf63c60_0
liblief                   0.9.0                h7725739_2
libpng                    1.6.36               hbc83047_0
libsodium                 1.0.16               h1bed415_0
libssh2                   1.8.0                h1ba5d50_4
libstdcxx-ng              8.2.0                hdf63c60_1
libtiff                   4.0.10               h2733197_2
libtool                   2.4.6                h7b6447c_5
libuuid                   1.0.3                h1bed415_2
libxcb                    1.13                 h1bed415_1
libxml2                   2.9.9                he19cac6_0
libxslt                   1.1.33               h7d1a2b0_0
llvmlite                  0.28.0           py37hd408876_0
locket                    0.2.0                    py37_1
lxml                      4.3.2            py37hefd8a0e_0
lz4-c                     1.8.1.2              h14c3975_0
lzo                       2.10                 h49e0be7_2
make                      4.2.1                h1bed415_1
markupsafe                1.1.1            py37h7b6447c_0
matplotlib                3.0.3            py37h5429711_0
mccabe                    0.6.1                    py37_1
mistune                   0.8.4            py37h7b6447c_0
mkl                       2019.3                      199
mkl-service               1.1.2            py37he904b0f_5
mkl_fft                   1.0.10           py37ha843d7b_0
mkl_random                1.0.2            py37hd81dba3_0
more-itertools            6.0.0                    py37_0
mpc                       1.1.0                h10f8cd9_1
mpfr                      4.0.1                hdf1c602_3
mpmath                    1.1.0                    py37_0
msgpack-python            0.6.1            py37hfd86e86_1
multipledispatch          0.6.0                    py37_0
navigator-updater         0.2.1                    py37_0
nbconvert                 5.4.1                    py37_3
nbformat                  4.4.0                    py37_0
ncurses                   6.1                  he6710b0_1
networkx                  2.2                      py37_1
nltk                      3.4                      py37_1
nose                      1.3.7                    py37_2
notebook                  5.7.8                    py37_0
numba                     0.43.1           py37h962f231_0
numexpr                   2.6.9            py37h9e4a6bb_0
numpy                     1.16.2           py37h7e9f1db_0
numpy-base                1.16.2           py37hde5b4d6_0
numpydoc                  0.8.0                    py37_0
olefile                   0.46                     py37_0
openpyxl                  2.6.1                    py37_1
openssl                   1.1.1b               h7b6447c_1
packaging                 19.0                     py37_0
pandas                    0.24.2           py37he6710b0_0
pandoc                    2.2.3.2                       0
pandocfilters             1.4.2                    py37_1
pango                     1.42.4               h049681c_0
parso                     0.3.4                    py37_0
partd                     0.3.10                   py37_1
patchelf                  0.9                  he6710b0_3
path.py                   11.5.0                   py37_0
pathlib2                  2.3.3                    py37_0
patsy                     0.5.1                    py37_0
pcre                      8.43                 he6710b0_0
pep8                      1.7.1                    py37_0
pexpect                   4.6.0                    py37_0
pickleshare               0.7.5                    py37_0
pillow                    5.4.1            py37h34e0f95_0
pip                       19.0.3                   py37_0
pixman                    0.38.0               h7b6447c_0
pkginfo                   1.5.0.1                  py37_0
pluggy                    0.9.0                    py37_0
ply                       3.11                     py37_0
prometheus_client         0.6.0                    py37_0
prompt_toolkit            2.0.9                    py37_0
psutil                    5.6.1            py37h7b6447c_0
ptyprocess                0.6.0                    py37_0
py                        1.8.0                    py37_0
py-lief                   0.9.0            py37h7725739_2
pycodestyle               2.5.0                    py37_0
pycosat                   0.6.3            py37h14c3975_0
pycparser                 2.19                     py37_0
pycrypto                  2.6.1            py37h14c3975_9
pycurl                    7.43.0.2         py37h1ba5d50_0
pyflakes                  2.1.1                    py37_0
pygments                  2.3.1                    py37_0
pylint                    2.3.1                    py37_0
pyodbc                    4.0.26           py37he6710b0_0
pyopenssl                 19.0.0                   py37_0
pyparsing                 2.3.1                    py37_0
pyqt                      5.9.2            py37h05f1152_2
pyrsistent                0.14.11          py37h7b6447c_0
pysocks                   1.6.8                    py37_0
pytables                  3.5.1            py37h71ec239_0
pytest                    4.3.1                    py37_0
pytest-arraydiff          0.3              py37h39e3cac_0
pytest-astropy            0.5.0                    py37_0
pytest-doctestplus        0.3.0                    py37_0
pytest-openfiles          0.3.2                    py37_0
pytest-remotedata         0.3.1                    py37_0
python                    3.7.3                h0371630_0
python-dateutil           2.8.0                    py37_0
python-libarchive-c       2.8                      py37_6
pytz                      2018.9                   py37_0
pywavelets                1.0.2            py37hdd07704_0
pyyaml                    5.1              py37h7b6447c_0
pyzmq                     17.1.2           py37h14c3975_0
qt                        5.9.7                h5867ecd_1
qtawesome                 0.5.7                    py37_1
qtconsole                 4.4.3                    py37_0
qtpy                      1.7.0                    py37_1
r                         3.5.1                    r351_0
r-abind                   1.4_5            r351h6115d3f_0
r-assertthat              0.2.0            r351h6115d3f_0
r-backports               1.1.2            r351h96ca727_0
r-base                    3.5.1                h1e0a451_2
r-base64enc               0.1_3            r351h96ca727_4
r-bh                      1.66.0_1         r351h6115d3f_0
r-bindr                   0.1.1            r351h6115d3f_0
r-bindrcpp                0.2.2            r351h29659fb_0
r-boot                    1.3_20           r351hf348343_0
r-broom                   0.5.0            r351h6115d3f_0
r-callr                   2.0.4            r351h6115d3f_0
r-caret                   6.0_80           r351h96ca727_0
r-cellranger              1.1.0            r351h6115d3f_0
r-class                   7.3_14           r351hd10c6a6_4
r-cli                     1.0.0            r351h6115d3f_1
r-clipr                   0.4.1            r351h6115d3f_0
r-cluster                 2.0.7_1          r351hac1494b_0
r-codetools               0.2_15           r351h6115d3f_0
r-colorspace              1.3_2            r351h96ca727_0
r-crayon                  1.3.4            r351h6115d3f_0
r-curl                    3.2              r351hadc6856_1
r-cvst                    0.2_2            r351h6115d3f_0
r-data.table              1.11.4           r351h96ca727_0
r-dbi                     1.0.0            r351h6115d3f_0
r-dbplyr                  1.2.2            r351hf348343_0
r-ddalpha                 1.3.4            r351h80f5a37_0
r-deoptimr                1.0_8            r351h6115d3f_0
r-devtools                1.13.6           r351h6115d3f_0
r-dichromat               2.0_0            r351h6115d3f_4
r-digest                  0.6.15           r351h96ca727_0
r-dimred                  0.1.0            r351h6115d3f_0
r-dplyr                   0.7.6            r351h29659fb_0
r-drr                     0.0.3            r351h6115d3f_0
r-essentials              3.5.1                    r351_0
r-evaluate                0.11             r351h6115d3f_0
r-fansi                   0.2.3            r351h96ca727_0
r-forcats                 0.3.0            r351h6115d3f_0
r-foreach                 1.4.4            r351h6115d3f_0
r-foreign                 0.8_71           r351h96ca727_0
r-formatr                 1.5              r351h6115d3f_0
r-geometry                0.3_6            r351h96ca727_0
r-ggplot2                 3.0.0            r351h6115d3f_0
r-git2r                   0.23.0           r351h96ca727_1
r-glmnet                  2.0_16           r351ha65eedd_0
r-glue                    1.3.0            r351h96ca727_0
r-gower                   0.1.2            r351h96ca727_0
r-gtable                  0.2.0            r351h6115d3f_0
r-haven                   1.1.2            r351h29659fb_0
r-hexbin                  1.27.2           r351ha65eedd_0
r-highr                   0.7              r351h6115d3f_0
r-hms                     0.4.2            r351h6115d3f_0
r-htmltools               0.3.6            r351h29659fb_0
r-htmlwidgets             1.2              r351h6115d3f_0
r-httpuv                  1.4.5            r351h29659fb_0
r-httr                    1.3.1            r351h6115d3f_1
r-ipred                   0.9_6            r351h96ca727_0
r-irdisplay               0.5.0            r351h6115d3f_0
r-irkernel                0.8.12                   r351_0
r-iterators               1.0.10           r351h6115d3f_0
r-jsonlite                1.5              r351h96ca727_0
r-kernlab                 0.9_26           r351h80f5a37_0
r-kernsmooth              2.23_15          r351hac1494b_4
r-knitr                   1.20             r351h6115d3f_0
r-labeling                0.3              r351h6115d3f_4
r-later                   0.7.3            r351h29659fb_0
r-lattice                 0.20_35          r351h96ca727_0
r-lava                    1.6.2            r351h6115d3f_0
r-lazyeval                0.2.1            r351h96ca727_0
r-lubridate               1.7.4            r351h29659fb_0
r-magic                   1.5_8            r351h6115d3f_0
r-magrittr                1.5              r351h6115d3f_4
r-maps                    3.3.0            r351h96ca727_0
r-markdown                0.8              r351h96ca727_0
r-mass                    7.3_50           r351h96ca727_0
r-matrix                  1.2_14           r351h96ca727_0
r-memoise                 1.1.0            r351h6115d3f_0
r-mgcv                    1.8_24           r351h96ca727_0
r-mime                    0.5              r351h96ca727_0
r-modelmetrics            1.1.0            r351h29659fb_0
r-modelr                  0.1.2            r351h6115d3f_0
r-munsell                 0.5.0            r351h6115d3f_0
r-nlme                    3.1_137          r351ha65eedd_0
r-nnet                    7.3_12           r351h96ca727_0
r-numderiv                2016.8_1         r351h6115d3f_0
r-openssl                 1.0.2            r351h96ca727_1
r-pbdzmq                  0.3_3            r351h29659fb_0
r-pillar                  1.3.0            r351h6115d3f_0
r-pkgconfig               2.0.1            r351h6115d3f_0
r-plogr                   0.2.0            r351h6115d3f_0
r-pls                     2.6_0            r351h6115d3f_0
r-plyr                    1.8.4            r351h29659fb_0
r-praise                  1.0.0            r351h6115d3f_4
r-processx                3.1.0            r351h29659fb_0
r-prodlim                 2018.04.18       r351h29659fb_0
r-promises                1.0.1            r351h29659fb_0
r-purrr                   0.2.5            r351h96ca727_0
r-quantmod                0.4_13           r351h6115d3f_0
r-r6                      2.2.2            r351h6115d3f_0
r-randomforest            4.6_14           r351ha65eedd_0
r-rbokeh                  0.6.3                    r351_0
r-rcolorbrewer            1.1_2            r351h6115d3f_0
r-rcpp                    0.12.18          r351h29659fb_0
r-rcpproll                0.3.0            r351h29659fb_0
r-readr                   1.1.1            r351h29659fb_0
r-readxl                  1.1.0            r351h29659fb_0
r-recipes                 0.1.3            r351h6115d3f_0
r-recommended             3.5.1                    r351_0
r-rematch                 1.0.1            r351h6115d3f_0
r-repr                    0.15.0           r351h6115d3f_0
r-reprex                  0.2.0            r351h6115d3f_0
r-reshape2                1.4.3            r351h29659fb_0
r-rlang                   0.2.1            r351h96ca727_0
r-rmarkdown               1.10             r351h6115d3f_0
r-robustbase              0.93_2           r351ha65eedd_0
r-rpart                   4.1_13           r351hd10c6a6_0
r-rprojroot               1.3_2            r351h6115d3f_0
r-rstudioapi              0.7              r351h6115d3f_0
r-rvest                   0.3.2            r351h6115d3f_0
r-scales                  0.5.0            r351h29659fb_0
r-selectr                 0.4_1            r351h6115d3f_0
r-sfsmisc                 1.1_2            r351h6115d3f_0
r-shiny                   1.1.0            r351h6115d3f_0
r-sourcetools             0.1.7            r351h29659fb_0
r-spatial                 7.3_11           r351hd10c6a6_4
r-squarem                 2017.10_1        r351h6115d3f_0
r-stringi                 1.2.4            r351h29659fb_0
r-stringr                 1.3.1            r351h6115d3f_0
r-survival                2.42_6           r351h96ca727_0
r-testthat                2.0.0            r351h29659fb_0
r-tibble                  1.4.2            r351h96ca727_0
r-tidyr                   0.8.1            r351h29659fb_0
r-tidyselect              0.2.4            r351h29659fb_0
r-tidyverse               1.2.1            r351h6115d3f_0
r-timedate                3043.102         r351h6115d3f_0
r-tinytex                 0.6              r351h6115d3f_0
r-ttr                     0.23_3           r351ha65eedd_0
r-utf8                    1.1.4            r351h96ca727_0
r-uuid                    0.1_2            r351h96ca727_4
r-viridislite             0.3.0            r351h6115d3f_0
r-whisker                 0.3_2            r351hf348343_4
r-withr                   2.1.2            r351h6115d3f_0
r-xfun                    0.3              r351h6115d3f_0
r-xml2                    1.2.0            r351h29659fb_0
r-xtable                  1.8_2            r351h6115d3f_0
r-xts                     0.11_0           r351h96ca727_0
r-yaml                    2.2.0            r351h96ca727_0
r-zoo                     1.8_3            r351h96ca727_0
readline                  7.0                  h7b6447c_5
requests                  2.21.0                   py37_0
rope                      0.12.0                   py37_0
ruamel_yaml               0.15.46          py37h14c3975_0
scikit-image              0.14.2           py37he6710b0_0
scikit-learn              0.20.3           py37hd81dba3_0
scipy                     1.2.1            py37h7c811a0_0
seaborn                   0.9.0                    py37_0
secretstorage             3.1.1                    py37_0
send2trash                1.5.0                    py37_0
setuptools                40.8.0                   py37_0
simplegeneric             0.8.1                    py37_2
singledispatch            3.4.0.3                  py37_0
sip                       4.19.8           py37hf484d3e_0
six                       1.12.0                   py37_0
snappy                    1.1.7                hbae5bb6_3
snowballstemmer           1.2.1                    py37_0
sortedcollections         1.1.2                    py37_0
sortedcontainers          2.1.0                    py37_0
soupsieve                 1.8                      py37_0
sphinx                    1.8.5                    py37_0
sphinxcontrib             1.0                      py37_1
sphinxcontrib-websupport  1.1.0                    py37_1
spyder                    3.3.3                    py37_0
spyder-kernels            0.4.2                    py37_0
sqlalchemy                1.3.1            py37h7b6447c_0
sqlite                    3.27.2               h7b6447c_0
statsmodels               0.9.0            py37h035aef0_0
sympy                     1.3                      py37_0
tblib                     1.3.2                    py37_0
terminado                 0.8.1                    py37_1
testpath                  0.4.2                    py37_0
tk                        8.6.8                hbc83047_0
tktable                   2.10                 h14c3975_0
toolz                     0.9.0                    py37_0
tornado                   6.0.2            py37h7b6447c_0
tqdm                      4.31.1                   py37_1
traitlets                 4.3.2                    py37_0
unicodecsv                0.14.1                   py37_0
unixodbc                  2.3.7                h14c3975_0
urllib3                   1.24.1                   py37_0
wcwidth                   0.1.7                    py37_0
webencodings              0.5.1                    py37_1
werkzeug                  0.14.1                   py37_0
wheel                     0.33.1                   py37_0
widgetsnbextension        3.4.2                    py37_0
wrapt                     1.11.1           py37h7b6447c_0
wurlitzer                 1.0.2                    py37_0
xlrd                      1.2.0                    py37_0
xlsxwriter                1.1.5                    py37_0
xlwt                      1.3.0                    py37_0
xz                        5.2.4                h14c3975_4
yaml                      0.1.7                had09818_2
zeromq                    4.2.5                hf484d3e_1
zict                      0.1.4                    py37_0
zipp                      0.3.3                    py37_1
zlib                      1.2.11               h7b6447c_3
zstd                      1.3.7                h0b5b093_0

Macros

Macros can be used with commands and job schedules on QDS.

Macros can be accessed or defined using the Javascript language. Only assignment statements are valid. Loops, function definitions, and all other language constructs are not supported. Assignment statements can use all operators and functions defined for the objects used in the statements. Defined macros can be used in subsequent statements.

Javascript Language and Modules

The following Javascript libraries are available.

Library Description Link to Documentation
Moment.js Provides many date/time related functions. Ensure that the moment.js timezone functionality matches with the timezone used by the scheduler. Qubole uses Moment JS version 2.6.0. Moment.js
Moment-tokens Provides strftime formats Moment-tokens
Macros in Scheduler

Macros in Scheduler describes in detail about the macros in scheduler.

Macros in QDS UI

Currently Qubole supports macros for creating jobs/schedules on the Scheduler page. For more details, see Creating a New Schedule.

Qubole Product Suite

Managing Sessions

How Sessions Work

Hive allows you to embed code (Python scripts, shell scripts, Java functions) in SQL queries. This is a way to add functionality that is not natively present in HiveQL. See this Hive page for more information and examples.

Qubole simulates this functionality by associating every command in QDS with a session. Sessions allow you to create temporary data sets, configure parameters which can be used to tune query behavior, and add your code as scripts to QDS to run transformations within HiveQL. These parameters, datasets, and user-defined transformations are active only for the session. A session’s duration is configurable; the default is two hours.

To ensure that scripts are accessible to Qubole’s Hive clusters, upload them to your Cloud storage. A script can be in any Cloud location that is readable using the keys associated with the account, but Qubole recommends that you place scripts in the default location’s scripts folder. Once you have uploaded a script, add it to your session by means of an add file … command.

Creating and Managing Sessions

Navigate to the Sessions page in the Control Panel.

To create a session, click the Add icon Add1Icon at the top right corner of the Sessions page.

A dialog appears:

_images/CreateSession.png

Select the cluster on which you want to create a session from the drop-down list and click Create Session. A session is created; for example:

_images/Sessions.png

The Sessions page contains:

  • Id: The session ID.

  • Cluster Id: The ID of the cluster on which session is created.

  • Start Time: The start time of the session

  • Duration: The default duration is two hours. You can change the number of hours to any value between 1 and 6.

  • Commands: The number of commands that have been run. To see the commands, click the number. This takes you to the Compose tab of the Workbench page.

  • Action: Click the down arrow in the Action column to see a list of actions:

    _images/SessionAction.png

    You can perform the following operations on an existing session:

    • View Commands: Select this to see the commands that are running in the session; the Session Details dialog appears. Click the down arrow in Action column:

      _images/ViewCommand.png

      You can choose to go to the Workbench page, or to Remove the command.

    • Change Duration: Select this to change the session duration. The Set Duration dialog appears:

      _images/ChangeDuration.png

      The default is two hours. To change it, enter a new value between 1 and 6, and click OK. Click Cancel to restore the previous setting.

    • Deactivate: Select this to deactivate a session. You can re-activate a deactivated session within two hours. To re-activate a deactivated session, click the down arrow in the Action column and click Activate.

    • Delete: Select this to delete the session.

Mapping of Cluster and Command Types

The following table shows which types of commands and queries run on each type of Qubole cluster.

Caution

Run only Hive DDL commands on a Presto cluster. Running Hive DML commands on a Presto cluster is not supported.

Command Type Airflow Cluster Support Hadoop-2 Cluster Support Presto Cluster Support Spark Cluster Support
DB Query Not Applicable Not Applicable Not Applicable Not Applicable
Hadoop command No Yes No No
Hive query No Yes Only Hive DDL commands No
Presto query No No Yes No
Shell command Yes Yes No Yes
Spark command No No No Yes

Note

DB Query runs without a cluster.

See also How-To (QDS Guides, Tasks, and FAQs).

Administration Guide

This guide explains how to configure and manage Qubole accounts and clusters. It also explains core concepts such as autoscaling.

Introduction

This guide contains information on administering the Qubole Data Service (QDS). It is intended for system administrators and users managing QDS.

Cluster Administration

This section explains the topics related to the cluster administration.

Managing Clusters

QDS pre-configures a set of clusters to run the jobs and queries you initiate from the Workbench page.

You can use the QDS UI to modify these clusters, and create and change new ones. This page provides detailed instructions.

QDS starts clusters when they are needed and resizes them according to the query workload.

Note

  • By default, QDS shuts down a cluster down after two hours of inactivity (that is, when no command has been running during the past two hours). This default is configurable; set Idle Cluster Timeout as explained below.
  • A cluster lifespan depends on a session in a user account; the cluster runs as long as there is an active cluster session. By default, a session stays alive for two hours of inactivity and can run for any amount of time as long as commands are running. Use the Sessions tab in Control Panel to create new sessions, terminate sessions, or extend sessions. For more information, see Managing Sessions.
Other Useful Pages
To Get Started
  • To see what clusters are active, and for information about each cluster, navigate to the Clusters page.
  • To add a new cluster, click New on the Clusters page; to modify an existing cluster, click the Edit button next to that that cluster.

Now follow instructions under Modifying Cluster Settings for GCP below.

Modifying Cluster Settings for GCP

Under the Configuration tab, you can add or modify:

  • Cluster Labels: A cluster can have one or more labels separated by a commas. You can make a cluster the default cluster by including the label default.

  • Hive, Spark, or Presto Version: For the engine type you have chosen, choose a supported version from the drop-down list.

  • Node Types: Choose the Google Cloud machine types for your QDS cluster nodes. These are virtual machines (VMs), also known as instances.

    • Coordinator Node Type: You can change the Coordinator node type from the default by selecting a different type from the drop-down list.
    • Worker Node Type: You can change the worker node type from the default by selecting a different type from the drop-down list.
  • Use Multiple Worker Node Types: - (Hadoop 2 and Spark clusters) See Configuring Heterogeneous Worker Nodes in the QDS UI.

  • Minimum Worker Nodes: Enter the minimum number of worker nodes if you want to change it (the default is 1). See Autoscaling in Qubole Clusters for information about how QDS uses the minimum and maximum node count to keep your cluster running at maximum efficiency.

  • Maximum Worker Nodes: Enter the maximum number of worker nodes if you want to change it (the default is 1).

  • Disk Storage Settings: The next five configuration settings determine whether your cluster uses local SSD storage disks

    or persistent disks (either standard storage or SSD) or a combination of the two.

    • Local Disk Count (375GB SSD): Choose any number up to 8 of these per GCP project. The limit of 8 is because there is a limit of total local SSD storage per GCP project. Storage limits may vary between accounts, so check your GCP account details to see how many local SSD disks you can have.
    • Persistent Disk Volume Count: Google Cloud persistent disks provide storage for HDFS and other data created and used by jobs in progress. The default for this field is zero; change it if you want to add local storage in addition to the small boot disk Google Cloud provides by default. For more information, see Google Cloud persistent disks.
    • Persistent Disk Type: You can choose SSDs (solid state disks) or standard disks.
    • Persistent Disk Size: Enter the size in gigabytes (GB) of each persistent volume to be added.
    • Enable Persistent Disk Upscaling (Hive and Spark only): Check this box if you are adding persistent disks and want to allow QDS to increase disk storage dynamically if capacity is running low.
  • Node Bootstrap File: You can append the name of a node bootstrap script to the default path.

  • Disable Automatic Cluster Termination: Check this box if you do not want QDS to terminate idle clusters automatically. Qubole recommends that you leave this box unchecked.

  • Idle Cluster Timeout: Optionally specify how long (in hours) QDS should wait to terminate an idle cluster. The default is 2 hours; to change it, enter a number between 0 and 6. This will override the timeout set at the account level.

Note

See aggressive-downscaling-azure for information about an additional set of capabilities that currently require a Qubole Support ticket.

Under the Composition tab you can add or modify:

  • Coordinator and Minimum Worker Nodes: Choose preemptible or standard (non-preemptible) instances for the core nodes in your QDS cluster.
  • Autoscaling Worker Nodes: Choose preemptible or standard (non-preemptible) instances for the non-core nodes in your QDS cluster. These are nodes that QDS adds or removes depending on the workload.
  • Preemptible Nodes (%): The percentage of non-core nodes that are to be preemptible. The remainder will be standard GCP instances.
  • Fallback to On-demand Nodes: If you check this box, and QDS is unable to upscale your cluster with as many preemptible instances as you have specified, QDS will make up the shortfall with standard (non-preemptible) nodes. See Autoscaling in Qubole Clusters for a full discussion.
  • Use Qubole Placement Policy: If this box is checked, QDS will make a best effort to place one copy of each data block on a stable (non-preemptible) node. Qubole recommends that you leave this box checked.
  • Cool-down Period: Choose how long, in minutes, QDS should wait before terminating an idle node. See Cool-Down Period.

Under the Advanced Configuration tab you can add or modify:

  • Region: Click on the drop-down list to choose the geographical location.
  • Zone: Click on the drop-down list to choose the zone within the geographical location.
  • Network: Choose a GCP network from the drop-down menu or accept the default.
  • Subnetwork: Choose an IPv4 address from the drop-down menu or accept the default.
  • Coordinator Static IP: Optionally provide a static IP address for the cluster Coordinator node.
  • Bastion Node: Optionally provide the public IP address of a Bastion node to be used for access to private subnets.
  • Custom Tags: You can create tags to be applied to the GCP virtual machines.
  • Override Hadoop Configuration Variables: For a Hadoop (Hive) cluster, enter Hadoop variables here if you want to override the defaults (Recommended Configuration) that Qubole uses. See also Advanced Configuration: Modifying Hadoop Cluster Settings.
  • Fair Scheduler Configuration: For a Hadoop or Spark cluster, enter Hadoop Fair Scheduler values if you want to override the defaults that Qubole uses.
  • Default Fair Scheduler Queue: Specify the default Fair Scheduler queue (used if no queue is specified when the job is submitted).
  • Override Spark Configuration: For a Spark cluster, enter Spark settings here if you want to override the defaults (Recommended Configuration) that Qubole uses. See also Advanced Configuration: Modifying Hadoop Cluster Settings.
  • Python and R Version: For a Spark cluster, choose the versions from the drop-down menu.
  • HIVE SETTINGS: See Configuring a HiveServer2 Cluster.
  • MONITORING: See Advanced configuration: Modifying Cluster Monitoring Settings.
  • Customer SSH Public Key: The public key from an SSH public-private key pair, used to log in to QDS cluster nodes.

Qubole Public Key cannot be changed. QDS uses this key to gain access to the cluster and run commands on it.

Note

You can improve the security of a cluster by authorizing Qubole to generate a unique SSH key every time a cluster is started; create a ticket with Qubole Support to enable this capability. A unique SSH key is generated by default in the https://in.qubole.com , https://us.qubole.com, and https://wellness.qubole.com environments.

Once the SSH key is enabled, QDS uses the unique SSH key to interact with the cluster. If you want to use this capability to control QDS access to a Bastion node communicating with a cluster running on a private subnet, Qubole support provides you with an account-level key that you must authorize on the Bastion node. To get the account-level key, use this API or navigate to the cluster’s Advanced Configuration, the account-level SSH key is displayed in EC2 Settings as described in modify-ec2-settings.

When you are satisfied with your changes, click Create or Update.

Advanced Configuration: Modifying Hadoop Cluster Settings

Under HADOOP CLUSTER SETTINGS, you can:

  • Specify the Default Fair Scheduler Queue if the queue is not submitted during job submission.
  • Override Hadoop Configuration Variables for the Worker Node Type specified in the Cluster Settings section. The settings shown in the Recommended Configuration field are used unless you override them.
  • Set Fair Scheduler Configuration values to override the default values.

Hadoop-specific Options provides more description about the options.

Note

Recommissioning can be enabled on Hadoop clusters as an Override Hadoop Configuration Variable. See Enable Recommissioning for more information.

See Enabling Container Packing in Hadoop 2 and Spark for more information on how to more effectively downscale in an Hadoop 2 cluster.

If the cluster type is Spark, Spark configuration is set in the Recommended Spark Configuration, which is in addition to Hadoop Cluster Settings as described in Configuring a Spark Cluster.

Advanced configuration: Modifying Cluster Monitoring Settings

Under MONITORING:

  • Enable Ganglia Monitoring: Ganglia monitoring is enabled automatically if you use Datadog; otherwise it is disabled by default. For more information on Ganglia monitoring, see Performance Monitoring with Ganglia.
Applying Your Changes

Under each tab on the Clusters page, there a right pane (Summary) from which you can:

  • Review and edit your changes.
  • Click Create to create a new cluster.
  • Click Update to change an existing cluster’s configuration.

If you are not satisfied with your changes:

  • Click Previous to go back the previous tab.
  • Click Cancel to leave settings unchanged.
Configuring Clusters

Every new account is configured with some clusters by default; these are sufficient to run small test workloads. This section and Managing Clusters explain how to modify these default clusters and add and modify new ones:

Cluster Settings Page

To use the QDS UI add or modify a cluster, choose Clusters from the drop-down list on the QDS main menu, then choose New and the cluster type and click on Create, or choose Edit to change the configuration of an existing cluster.

See Managing Clusters for more information.

The following sections explain the different options available on the Cluster Settings page.

General Cluster Configuration

Many of the cluster configuration options are common across different types of clusters. Let us cover them first by going over some of the most important categories.

Cluster Labels

As explained in Cluster Labels, each cluster has one or more labels that are used to route Qubole commands. In the first form entry, you can assign one or more comma-separated labels to a cluster.

Cluster Type

QDS supports the following cluster types:

  • Airflow (not configured by default).

    Airflow is a platform for programmatically authoring, scheduling, and monitoring workflows. It supports integration with third-party platforms. You can author complex directed acyclic graphs (DAGs) of tasks inside Airflow. It comes packaged with a rich feature set, which is essential to the ETL world. The rich user interface and command-line utilities make it easy to visualize pipelines running in production, monitor progress, and troubleshoot issues as required. To know more about Qubole Airflow, see Airflow.

  • Hadoop 2 (one Hadoop 2 cluster is configured by default in all cases).

    Hadoop 2 clusters run a version of Hadoop API compatible with Apache Hadoop 2.6. Hadoop 2 clusters use Apache YARN cluster manager and are tuned for running MapReduce and other applications.

  • Spark (one Spark cluster is configured by default in all cases).

    Spark clusters allow you to run applications based on supported Apache Spark versions. Spark has a fast in-memory processing engine that is ideally suited for iterative applications like machine learning. Qubole’s offering integrates Spark with the YARN cluster manager.

Cluster Size and Instance Types

From a performance standpoint, this is one of the most critical sets of parameters:

  • Set a Minimum and Maximum Worker Nodes for a cluster (in addition to one fixed Coordinatorr node).

Note

All Qubole clusters autoscale up and down automatically within the minimum and maximum range set in this section.

  • Coordinator and Woker Node Type

    Select the Worker Node Type according to the characteristics of the application. A memory-intensive application would benefit from memory-rich nodes (such as r3 node types in AWS, or E2-64 V3 in Azure), while a CPU-intensive application would benefit from instances with higher compute power (such as the c3 types in AWS or E2-64 V3 in Azure).

The Coordinator Node Type is usually determined by the size of the cluster. For smaller clusters and workloads, small
instances suffice. But for extremely large clusters (or for running a large number of concurrent applications), Qubole recommends large-memory machines.

QDS uses Linux instances as cluster nodes.

Node Bootstrap File

This field provides the location of a Bash script used for installing custom software packages on cluster nodes.

Advanced applications often require custom software to be installed as a prerequisite. A Hadoop Mapper Python script may require access to SciPy/NumPy, for example, and this is often best arranged by simply installing these packages (using yum for example) by means of the node bootstrap script. See Understanding a Node Bootstrap Script for more information.

The account’s storage credentials are used to read the script, which runs with root privileges on both Coordinator and worker nodes; on worker nodes, make it runs before any task is launched on behalf of the application.

Note

QDS does not check the exit status of the script. If software installation fails and it is unsafe to run user applications in this case, you should shut the machine down from the bootstrap script.

Qubole recommends installing or updating custom Python libraries after activating Qubole’s Virtual Environment and installing libraries in it.

See Running Node Bootstrap and Ad hoc Scripts on a Cluster for more information on running node bootstrap scripts.

Other Settings

See Managing Clusters.

GCP Settings

To use the QDS UI to add or modify a GCP cluster, choose Clusters from the drop-down list on the QDS main menu, then choose New and the cluster type and click on Create, or choose Edit to change the configuration of an existing cluster. Configure the cluster as described under Modifying Cluster Settings for GCP.

Understanding the Soft-enforced Cluster Permissions

QDS supports soft enforcement of cluster permissions at the object level. On the Manage Permissions dialog of a specific cluster, when you select one permission, then additional cluster permissions are automatically selected. You can still disable those additional permissions in the UI before saving.

Qubole highly recommends a user to accept the enforced permissions. For example, the Read permission is enforced with the Start permission. If you decide to uncheck the Read permission on the UI, Qubole warns you that the product experience is not optimal with this warning: Unchecking these permissions will disable certain capabilities and might lead to errors.

The permissions that are enforced on the cluster are provided in this table.

  Enforced Permissions
Read Not applicable
Start Read
Update Read and Start
Clone Read, Manage, Update
Terminate Read, Manage, Update, and Clone
Delete Read, Manage, Update, Clone, and Terminate

The different cluster permissions are:

  • Read - This permission allows/denies a user permission to view the specific cluster. The UI does not display a cluster for which a user does not have read permission. This implies that all other actions even if granted are ineffective on the UI.
  • Update - This permission allows/denies a user permission to edit the cluster configuration. When no update permission is set, it also means that the cluster cannot be cloned from the UI.
  • Delete - This permission allows/denies a user permission to delete the cluster.
  • Start - This permission allows/denies a user permission to start the cluster. A user is not allowed to run a command on a cluster on which he or she is denied Start permission even when that user has permission to run commands. This behavior is not enabled by default and you can enable this behavior by creating a ticket with Qubole Support.
  • Terminate - This permission allows/denies a user permission to terminate the cluster.
  • Clone - This permission allows/denies a user permission to clone the cluster. The Update permission must be granted to clone a cluster.
  • Manage - This permission allows/denies a user permission to manage this cluster’s permissions.

Note

For important information about how permissions interact, see Understanding the Precedence of Cluster Permissions.

Managing Cluster Permissions through the UI

QDS supports setting permissions for a specific cluster in the Clusters UI page in addition to the Object Policy REST API. For more information on the API, see Set Object Policy for a Cluster. You can allow/deny the cluster’s access to a user/group.

This feature is not enabled by default. Create a ticket with Qubole Support to get this feature enabled for the QDS account.

Note

Understanding the Soft-enforced Cluster Permissions provides the list of cluster permissions that would be enforced with one cluster permission.

To allow/deny permissions to a specific cluster through the UI, perform these steps:

  1. Navigate to the Clusters page and click or do a mouse hover on the ellipsis that is against the specific cluster to which you want to restrict permissions. You must be a system-admin, owner of the cluster, or a user who has the Manage permission to set permissions and even see Manage Permissions. Clicking or a mouse hover on the ellipsis displays this drop down list for a running cluster.

    _images/cloneNdefaultCluster.png
  2. Click Manage Permissions (it is also displayed in the drop-down list in the ellipsis for an inactive cluster). The Manage Permissions dialog is displayed as shown here.

    _images/ClusterPermission.png
  3. Select a user/group that you want to allow or deny access to. Some permissions are set by default and the current permissions are displayed for the selected user/group. To allow select the check box below the cluster policy action, and to deny, uncheck the check box below the cluster policy action or do not select the check box. The different cluster permissions are:

    • Read - This permission allows/denies a user to view the specific cluster. The UI will not display a cluster for which a user does not have read permission. This implies that all other actions even if granted are ineffective on the UI.
    • Update - This permission allows/denies a user to edit the cluster configuration. When no update permission is set, it also means that the cluster cannot be cloned from the UI.
    • Delete - This permission allows/denies a user to delete the cluster.
    • Start - This permission allows/denies a user to start the cluster. A user is not allowed to run a command on a cluster on which the Start permission is denied to that user even when he has access permission to run commands. This behavior is not enabled by default; you can enable it by creating a ticket with Qubole Support.
    • Clone - This permission allows/denies a user to clone the cluster. The Update permission must be granted to clone a cluster.
    • Manage - This permission allows/denies a user to manage this cluster’s permissions.

    Note

    If you a allow a user with a permission who is part of the group that has restricted access, then that user is allowed access and vice versa. For more information, see Understanding the Precedence of Cluster Permissions.

  4. Click Add New Permission to assign permissions to another user/group. Allow/deny cluster permissions as described in the step 3. Specific cluster permissions for a user and a group are illustrated in this sample.

    _images/ClusterPermission1.png
  5. Click Save after assigning cluster permissions to the user(s)/group(s).

Understanding the Precedence of Cluster Permissions

The precedence of cluster permissions are mentioned below:

  • The cluster owner and system-admin have all permissions that cannot be revoked.

  • Users take precedence over groups.

  • A user who is not assigned with any specific permissions inherits them from the group that he is part of.

  • If the cluster ACL permissions are defined by a user, who is the current owner, then that user has all access by default. Even if there is a access control set for deny. Basically QDS honors ownership over object ACLs.

  • If the cluster ACL permissions are not defined by a user, who is the current owner, QDS allows that user to do cluster operations if there is no explicit deny permission set for that user. But if a READ permission is denied to the user, then the user cannot see that specific cluster in the Clusters list. The read permission is denied as shown in this example.

    _images/ClusterPermsn2.png
Autoscaling in Qubole Clusters
What Autoscaling Is

Autoscaling is a mechanism built in to QDS that adds and removes cluster nodes while the cluster is running, to keep just the right number of nodes running to handle the workload. Autoscaling automatically adds resources when computing or storage demand increases, while keeping the number of nodes at the minimum needed to meet your processing needs efficiently.

How Autoscaling Works

When you configure a cluster, you choose the minimum and maximum number of nodes the cluster will contain (Minimum Worker Nodes and Maximum Worker Nodes, respectively). While the cluster is running, QDS continuously monitors the progress of tasks and the number of active nodes to make sure that:

  • Tasks are being performed efficiently (so as to complete within an amount of time that is set by a configurable default).
  • No more nodes are active than are needed to complete the tasks.

If the first criterion is at risk, and adding nodes will correct the problem, QDS adds as many nodes as are needed, up to the Maximum Worker Nodes. This is called upscaling.

If the second criterion is not being met, QDS removes idle nodes, down to the Minimum Worker Nodes. This is called downscaling, or decommissioning.

The topics that follow provide details:

See also Autoscaling in Presto Clusters.

Types of Nodes

Autoscaling operates only on the nodes that comprise the difference between the Minimum Worker Nodes and Maximum Worker Nodes (the values you specified in the QDS Cluster UI when you configured the cluster), and affects worker nodes only; these are referred to as autoscaling nodes.

The Coordinator Node, and the nodes comprising the Minimum Worker Nodes, are the stable core of the cluster; they normally remain running as long as the cluster itself is running; these are called core nodes.

Preemptible VMs and On-demand Instances on GCP

Preemptible VM instances on GCP have the following characteristics:

  • The cost for preemptible VMs is much lower than the cost for on-demand instances.
  • The price is fixed, based on the instance type, and does not fluctuate. For details about the pricing of preemptible instances, see Google Compute Engine Pricing in the GCP documentation.
  • GCP Compute Engine can terminate (preempt) your preemptible VMs at any time if it needs to use them for other tasks.
  • Compute Engine always terminates preemptible VMs after they run for 24 hours, if not sooner. But certain actions reset the 24-hour counter for a preemptible VM, for instance, stopping and restarting the instance.

For more information, see Preemptible VM-based Autoscaling in Google Cloud Platform below.

Upscaling
Launching a Cluster

QDS launches clusters automatically when applications need them. If the application needs a cluster that is not running, QDS launches it with the minimum number of nodes, and scales up as needed toward the maximum.

Upscaling Criteria

QDS bases upscaling decisions on:

  • The rate of progress of the jobs that are running.
  • Whether faster throughput can be achieved by adding nodes.

Assuming the cluster is running fewer than the configured maximum number of nodes, QDS activates more nodes if, and only if, the configured SLA (Service Level Agreement) will not be met at the current rate of progress, and adding the nodes will improve the rate.

Even if the SLA is not being met, QDS does not add nodes if the workload cannot be distributed more efficiently across more nodes. For example, if three tasks are distributed across three nodes, and progress is slow because the tasks are large and resource-intensive, adding more nodes will not help because the tasks cannot be broken down any further. On the other hand, if tasks are waiting to start because the existing nodes do not have the capacity to run them, then QDS will add nodes.

Note

In a heterogeneous cluster, upscaling can cause the actual number of nodes running in the cluster to exceed the configured Maximum Worker Nodes. See Why is my cluster scaling beyond the configured maximum number of nodes?.

Disk Upscaling on Hadoop MRv2 GCP Clusters

Disk upscaling dynamically adds volumes to GCP VMs that are approaching the limits of their storage capacity. You can enable disk upscaling on Hadoop MRv2 clusters, including clusters running Spark and Tez jobs as well as those running MapReduce jobs.

When you enable disk upscaling for a node, you also specify:

  • The maximum number of disks that QDS can add to a node (Maximum Data Disk Count on the Clusters page in the QDS UI).
  • The minimum percentage of storage that must be available on the node (Free Space Threshold %). When available storage drops below this percentage, QDS adds one or more disks until free space is at or above the minimum percentage, or the node has reached its Maximum Data Disk Count. The default is 25%.
  • The absolute amount of storage that must be available on the node, in gigabytes (Absolute Free Space Threshold). When available storage drops below this amount, QDS adds one or more disks until free space is at or above the minimum amount, or the node has reached its Maximum Data Disk Count. The default is 100 GB.

In addition, QDS monitors the rate at which running Hadoop jobs are using up storage, and from this computes when more storage will be needed.

QDS autoscaling adds storage but does not remove it directly, because this involves reducing the filesystem size, a risky operation. The storage is removed when the node is decommissioned.

Reducers-based Upscaling on Hadoop MRv2 Clusters

Hadoop MRv2 clusters can upscale on the basis of the number of Reducers. This configuration is disabled by default. Enable it by setting mapred.reducer.autoscale.factor=1 as a Hadoop override.

Downscaling

QDS bases downscaling decisions on the following factors.

Downscaling Criteria

A node is a candidate for decommissioning only if:

  • The cluster is larger than its configured minimum size.

  • No tasks are running.

  • The node is not storing any shuffle data (data from Map tasks for Reduce tasks that are still running).

  • Enough cluster storage will be left after shutting down the node to hold the data that must be kept (including HDFS replicas).

    Note

    This storage consideration does not apply to Presto clusters.

See also Aggressive Downscaling.

Note

In Hadoop MRv2, you can control the maximum number of nodes that can be downscaled simultaneously by setting mapred.hustler.downscaling.nodes.max.request to the maximum you want; the default is 500.

Downscaling Exception for Hadoop 2 and Spark Clusters: Hadoop 2 and Spark clusters do not downscale to a single worker node once they have been upscaled. When Minimum Worker Nodes is set to 1, the cluster starts with a single worker node, but once upscaled, it never downscales to fewer than two worker nodes. This is because decommissioning slows down greatly if there is only one usable node left for HDFS, so nodes doing no work may be left running, waiting to be decommissioned. You can override this behaviour by setting mapred.allow.single.worker.node to true and restarting the cluster.

Container Packing in Hadoop 2 and Spark

QDS allows you to pack YARN containers on Hadoop MRv2 (Hadoop 2) and Spark clusters.

Container packing is enabled by default for GCP clusters.

Container packing causes the scheduler to pack containers on a subset of nodes instead of distributing them across all the nodes of the cluster. This increases the probability of some nodes remaining unused; these nodes become eligible for downscaling, reducing your cost.

Packing works by separating nodes into three sets:

  • Nodes with no containers (the Low set)
  • Nodes with memory utilization greater than the threshold (the High set)
  • All other nodes (the Medium set)

YARN schedules each container request in this order: nodes in the Medium set first, nodes in the Low set next, nodes in the High set last. For more information, see Enabling Container Packing in Hadoop 2 and Spark.

Graceful Shutdown of a Node

If all of the downscaling criteria are met, QDS starts decommissioning the node. QDS ensures a graceful shutdown by:

  • Waiting for all tasks to complete.

  • Ensuring that the node does not accept any new tasks.

  • Transferring HDFS block replicas to other nodes.

    Note

    Data transfer is not needed in Presto clusters.

Recommissioning a Node

If more jobs enter the pipeline while a node is being decommissioned, and the remaining nodes cannot handle them, the node is recommissioned – the decommissioning process stops and the node is reactivated as a member of the cluster and starts accepting tasks again.

Recommissioning takes precedence over launching new instances: when handling an upscaling request, QDS launches new nodes only if the need cannot be met by recommissioning nodes that are being decommissioned.

Recommissioning is preferable to starting a new instance because:

  • It is more efficient, avoiding bootstrapping a new node.
  • It is cheaper than provisioning a new node.
Dynamic Downscaling

Dynamic downscaling is triggered when you reduce the maximum size of a cluster while it’s running. The subsections that follow explain how it works. First you need to understand what happens when you decrease (or increase) the size of a running cluster.

Effects of Changing Worker Nodes Variables while the Cluster is Running: You can change the Minimum Worker Nodes and Maximum Worker Nodes while the cluster is running. You do this via the Cluster Settings screen in the QDS UI, just as you would if the cluster were down.

To force the change to take effect dynamically (while the cluster is running) you must push it, as described here. Exactly what happens then depends on the current state of the cluster and configuration settings. Here are the details for both variables.

Minimum Worker Nodes: An increase or reduction in the minimum count takes effect dynamically by default. (On a Hadoop MRv2 cluster, this happens because mapred.refresh.min.cluster.size is set to true by default. Similarly, on a Presto cluster, the configuration reloader mechanism detects the change.)

Maximum Worker Nodes:

  • An increase in the maximum count takes effect dynamically.
  • A reduction in the maximum count produces the following behavior:
    • If the current cluster size is smaller than the new maximum, the change takes effect dynamically. For example, if the maximum is 15, 10 nodes are currently running and you reduce the maximum count to 12, 12 will be the maximum from now on.
    • If the current cluster size is greater than the new maximum, QDS begins reducing the cluster to the new maximum, and subsequent upscaling will not exceed the new maximum. In this case, the default behavior for reducing the number of running nodes is dynamic downscaling.

How Dynamic Downscaling Works:

If you decrease the Maximum Worker Nodes while the cluster is running, and more than the new maximum number of nodes are actually running, then QDS begins dynamic downscaling.

If dynamic downscaling is triggered, QDS selects the nodes that are:

  • closest to completing their tasks
  • (in the case of Hadoop MRv2 clusters) closest to the time limit for their containers

Once selected, these nodes stop accepting new jobs and QDS shuts them down gracefully until the cluster is at its new maximum size (or the maximum needed for the current workload, whichever is smaller).

Note

In a Spark cluster, a node selected for dynamic downscaling may not be removed immediately in some cases– for example, if a Notebook or other long-running Spark application has executors running on the node, or if the node is storing Shuffle data locally.

Aggressive Downscaling

Aggressive Downscaling refers to a set of QDS capabilities that are enabled by default for GCP clusters. See Aggressive Downscaling for more information.

Shutting Down an Idle Cluster

By default, QDS shuts the cluster down completely if both of the following are true:

  • There have been no jobs in the cluster over a configurable period.
  • At least one node is close to its hourly boundary (not applicable if Aggressive Downscaling is enabled) and no tasks are running on it.

You can change this behavior by disabling automatic cluster termination, but Qubole recommends that you leave it enabled – inadvertently allowing an idle cluster to keep running can become an expensive mistake.

Preemptible VM-based Autoscaling in Google Cloud Platform

For clusters on GCP, you can create and run preemptible VMs for a much lower price than you would pay for on-demand instances.

In the QDS UI, you can configure a percentage of your instances to be preemptible. You do this via the Composition tab in either the New Cluster or Edit Cluster screen. In the Summary section, click edit next to Composition. The number you put in the Preemptible Nodes (%) field specifies the maximum percentage of autoscaling nodes that QDS can launch as preemptible VMs:

_images/GCPSetPercentPreemptibleVMs.png

Qubole recommends using one of the following approaches to combining on-demand instances with preemptible VMs:

  • Use on-demand instances for your core nodes and a combination of on-demand instances and preemptible VMs for the autoscaling nodes.
  • Use preemptible VMs for both core nodes and autoscaling nodes.

Normally, the core nodes in a cluster are run on stable on-demand instances, except where an unexpected termination of the entire cluster is considered worth risking in order to obtain lower costs. Autoscaling nodes, on the other hand, can be preemptible without a risk that the cluster could be unexpectedly terminated. For more information on preemptible instances, see Preemptible VM Instances in the GCP documentation.

Rebalancing

Using preemptible VMs on GCP significantly reduces your cost, but fluctuations in the market may mean that QDS cannot always obtain as many preemptible instances as your cluster specification calls for. (QDS tries to obtain the preemptible instances for a configurable number of minutes before giving up.)

For example, suppose your cluster needs to scale up by four additional nodes, but only two preemptible instances that meet your requirements (out of the maximum of four you specified) are available. In this case, QDS will launch the two preemptible instances, and (by default) make up the shortfall by also launching two on-demand instances, meaning that you will be paying more than you had hoped in the case of those two instances. (You can change this default behavior in the QDS UI on the Add Cluster and Cluster Settings pages, by un-checking Fallback to on demand under the Cluster Composition tab.)

Whenever the cluster is running a greater proportion of on-demand instances than you have specified, QDS works to remedy the situation by monitoring the preemptible market, and replacing the on-demand nodes with preemptible instances as soon as suitable instances become available. This is called Rebalancing.

Note

Rebalancing is supported in Hadoop MRv2 and Spark clusters only.

How Autoscaling Works in Practice
Hadoop MRv2

Here’s how autoscaling works on Hadoop MRv2 (Hadoop 2 (Hive) clusters):

In Hadoop MRv2, you can control the maximum number of nodes that can be downscaled simultaneously by means of mapred.hustler.downscaling.nodes.max.request. Its default value is 500.

  • Each node in the cluster reports its launch time to the ResourceManager, which keeps track of how long each node has been running.
  • YARN ApplicationMasters request YARN resources (containers) for each Mapper and Reducer task. If the cluster does not have enough resources to meet these requests, the requests remain pending.
  • On the basis of a pre-configured threshold for completing tasks (for example, two minutes), and the number of pending requests, ApplicationMasters create special autoscaling container requests.
  • The ResourceManager sums the ApplicationMasters’ autoscaling container requests, and on that basis adds more nodes (up to the configured Maximum Worker Nodes).
  • Whenever a node approaches its hourly boundary, the ResourceManager checks to see if any task or shuffle process is running on this node. If not, the ResourceManager decommissions the node.

Note

  • You can improve autoscaling efficiency by enabling container packing.
  • You can control the maximum number of nodes that can be downscaled simultaneously by setting mapred.hustler.downscaling.nodes.max.request to the maximum you want; the default is 500.
Presto

Autoscaling in Presto Clusters explains how autoscaling works in Presto.

Spark

Here’s how autoscaling works on a Spark cluster:

  • You can configure Spark autoscaling at the cluster level and at the job level.
  • Spark applications consume YARN resources (containers); QDS monitors container usage and launches new nodes (up to the configured Maximum Worker Nodes) as needed.
  • If you have enabled job-level autoscaling, QDS monitors the running jobs and their rate of progress, and launches new executors as needed (and hence new nodes if necessary).
  • As jobs complete, QDS selects candidates for downscaling and initiates Graceful Shutdown of those nodes that meet the criteria.

For a detailed discussion and instructions, see Autoscaling in Spark.

Note

You can improve autoscaling efficiency by enabling container packing.

Tez on Hadoop 2 (Hive) Clusters

Here’s how autoscaling works on a Hadoop 2 (Hive) cluster where Tez is the execution engine:

Note

Tez is not supported on all Cloud platforms.

  • Each node in the cluster reports its launch time to the ResourceManager, which keeps track of how long each node has been running.
  • YARN ApplicationMasters request YARN resources (containers) for each Mapper and Reducer task. If the cluster does not have enough resources to meet these requests, the requests remain pending.
  • ApplicationMasters monitor the progress of the DAG (on the Mapper nodes) and calculate how long it will take to finish their tasks at the current rate.
  • On the basis of a pre-configured threshold for completing tasks (for example, two minutes), and the number of pending requests, ApplicationMasters create special autoscaling container requests.
  • The ResourceManager sums the ApplicationMasters’ autoscaling container requests, and on that basis adds more nodes (up to the configured Maximum Worker Nodes).
  • Whenever a node approaches its hourly boundary, the ResourceManager checks to see if any task or shuffle process is running on this node. If not, the ResourceManager decommissions the node.
For More Information

For more information about configuring and managing QDS clusters, see:

Using GCP Preemptible Instances in Qubole Clusters

Google Cloud Platform (GCP) offers two types of instances that are suitable for use as Qubole cluster nodes: on-demand instances and preemptible instances.

On-demand instances: These are normal compute instances. An on-demand instance is likely to remain available for the life of the cluster, helping to ensure that the cluster performs its work smoothly and reliably. The disadvantage of on-demand instances is cost: a cluster composed entirely or mostly of on-demand nodes may be many times more expensive than a similarly-sized cluster composed partly of preemptible nodes.

Preemptible instances: Preemptible instances are short-lived, up to 24 hours, and Compute Engine can terminate these instances at any time. But preemptible instances are much cheaper than on-demand instances, so if your applications can withstand possible instance terminations, then preemptible instances can reduce your Compute Engine costs significantly.

To compare prices between on-demand instances and preemptible instances, see Google Compute Engine Pricing in the GCP documentation.

Cluster Composition Choices

You can choose to create a cluster in any of the following configurations:

  • On-demand nodes only
  • Preemptible nodes only
  • A mix of preemptible and on-demand nodes

For most purposes, the third option is the best because it provides a balance between cost and stability.

The remainder of this section focuses on the settings and mechanisms Qubole provides to help safeguard the overall functioning of a cluster that includes preemptible nodes.

How You Configure Preemptible Instances into a Qubole Cluster

The items to configure when creating a new cluster in GCP are these:

  • Coordinator and Minimum Worker Nodes: Choose whether Coordinator and minimum worker nodes should be on-demand or preemptible. These nodes are essential to the functioning of the cluster, so you will normally choose on-demand nodes for them.
  • Auto-scaling Worker Nodes: Choose whether the auto-scaling nodes should be on-demand or preemptible. Auto-scaling worker nodes are not as essential to the functioning of the cluster, so you might consider preemptible nodes for them.
  • Preemptible Nodes Percentage: In a mixed cluster, this specifies the maximum percentage of autoscaling nodes that can be preemptible instances. Autoscaling nodes are those that comprise the difference between the Minimum Worker Nodes and the Maximum Worker Nodes. Qubole adds and removes these nodes according to the cluster workload. In a preemptible-only cluster, this is always set to 100.
  • Fallback to On-demand Nodes: You should normally choose Fallback to On-demand Nodes (check the box). This option causes Qubole to launch on-demand instances if it cannot obtain enough preemptible instances when adding nodes during autoscaling. This means that the cluster could possibly at times consist entirely of on-demand nodes, if no preemptible nodes are available. But unless cost is all-important, this is a sensible option to choose because it allows the cluster to do its work even if preemptible nodes are not available.
  • Use Qubole Placement Policy: If selected, this setting causes Qubole to make a best effort to store one replica of each HDFS block on a stable node (normally an on-demand node, except in the case of a preemptible-only cluster). Qubole recommends you select this option to prevent job failures that could occur if all replicas were lost as a result of Compute Engine reclaiming many preemptible instances at once.
  • Cool-Down Period: The time period in minutes that Qubole allows to elapse before removing nodes designated as not needed, based on the cluster’s current workload.
Configuring a Mixed Cluster (Normal and Preemptible Nodes)

You configure a mixed cluster by doing all of the following:

  • Setting Coordinator and Minimum Work Nodes to On-demand nodes.
  • Setting Auto-scaling Worker Nodes to Preemptible nodes.
  • Setting the Preemptible Nodes Percentage to a number less than 100.

This configures a cluster in which the core nodes (the Coordinator Node and the nodes comprising the Minimum Worker Nodes) are On-Demand instances, and a percentage of the autoscaling nodes are preemptible instances as specified by the Preemptible Nodes Percentage.

For example, if the Minimum Worker Nodes is 2 and the Maximum Worker Nodes is 10, and you set the Preemptible Nodes Percentage to 50, the resulting cluster will have, at any given time:

  • A minimum of 3 nodes: the Coordinator Node plus the Minimum Worker Nodes, all of them on-demand instances (the core nodes).
  • (Usually) a maximum of 11 nodes, of which up to 4 (50% of the difference between 2 and 10) will be preemptible instances, and the remainder on-demand instances. (The cluster size can occasionally rise above the maximum for brief periods while the cluster is autoscaling.)

Qubole also falls back to on-demand nodes when coordinator-and-minimum-number-of-nodes’ cluster composition is preemptible nodes.

How Qubole Manages Preemptible Nodes While the Cluster is Running

Qubole’s primary goal in managing cluster resources is productivity, making sure that the work you need to do gets done as efficiently and reliably as possible, and at the lowest cost that is consistent with that goal.

Qubole uses the following mechanisms to help ensure maximum productivity in running clusters that deploy preemptible instances:

  • The Fallback to On-demand Nodes option described above.
  • The Qubole Placement Policy described above.
Aggressive Downscaling

Aggressive Downscaling refers to a set of QDS capabilities that allow idle clusters and cluster nodes to be shut down as quickly and efficiently as possible. It comprises the following:

Read these sub-sections in conjunction with the Downscaling section of Autoscaling in Qubole Clusters. See also Understanding the QDS Cluster Lifecycle.

Faster Cluster Termination

QDS waits for a configurable period after the last command executes before terminating a cluster. This period is referred to as the Idle Cluster Timeout in the QDS UI. By default this is configurable in multiples of one hour; Aggressive Downscaling allows you to configure it in increments of a minute. You can configure this value at both the account level and the cluster level. If you set it at the cluster level, that value overrides the account-level value, which defaults to two hours. You can change a cluster’s Idle Cluster Timeout setting without restarting the cluster.

Note

QDS monitors the cluster every 5 minutes to see if it is eligible for shutdown. This can mean that a cluster is idle longer than the timeout you set. For example, if you set the Idle Cluster Timeout to five minutes, and QDS checks the cluster four minutes after the last command has completed, QDS will not shut down the cluster. If no further commands have executed by the next checkpoint, five minutes later, QDS will shut the cluster down. In this case the cluster has been idle nine minutes in all.

Exception for Spark Notebooks

Spark notebook interpreters have a separate timeout parameter (spark.qubole.idle.timeout) that defaults to one hour. A cluster will not shut down if an interpreter is running, so you should reduce the value of spark.qubole.idle.timeout if it’s greater than the Idle Cluster Timeout.

Faster Node Termination

The Downscaling section of Autoscaling in Qubole Clusters explains the conditions under which QDS decommissions a node and removes it from a running cluster. By default, these conditions include the concept of an hour boundary: if a node meets all other downscaling criteria, it becomes eligible for shutdown as it approaches an hourly increment of up-time. Aggressive Downscaling does away with this criterion: after you enable Aggressive Downscaling and restart the cluster, its nodes will be decommissioned as soon as they meet all of the other downscaling criteria.

Note

Cool-Down Period

Faster node termination could cause the cluster size to fluctuate too rapidly, so that nodes spend a disproportionate amount of time booting and shutting down, and users may have to wait unnecessarily for new nodes to start and run their commands. The Cool Down Period is designed to prevent this; it allows you to configure how long QDS waits before terminating a cluster node after it becomes idle.

When a node enters its Cool Down Period, QDS initiates graceful shutdown on that node, allowing the node to be either recommissioned or shut down, depending on the cluster workload.

The default value is 10 minutes for Hadoop (Hive) and Spark clusters, and 5 minutes for Presto clusters. The minimum value you should set in all cases is 5 minutes; a lower value may be greater than the time it takes to decommission the node.

Note

For Presto clusters, the Cool Down Period does not apply to individual nodes, but to the cluster as a whole: QDS starts to decommission Presto nodes only if it determines that the cluster has been underutilized throughout the Cool Down Period.

Configuring the Cool-Down Period

To change a cluster’s Cool Down Period to something other than the default, navigate to the Configuration tab of the Clusters page in the QDS UI, and set the value to 5 minutes or longer.

Note

If you set the Idle Cluster Timeout to a lower value than the Cool Down Period, the Idle Cluster Timeout takes precedence.

An Overview of Heterogeneous Nodes in Clusters

QDS supports heterogeneous Spark and Hadoop 2 clusters; this means that the worker nodes comprising the cluster can be of different instance types.

Configuring Heterogeneous Worker Nodes in the QDS UI

Managing Clusters describes how to edit the cluster configuration through the QDS UI.

Select Use Multiple Worker Node Types to configure heterogeneous worker nodes. The UI displays worker node type and weight.

Select the worker node type; its weight’s predetermined value is populated.

The default node weight is calculated as (memory of the node type / memory of the primary worker type).

Note

You must carefully pick instance types that have similar CPU and memory capacity. Choosing instances types with significantly different CPU and memory capacity may lead to degraded performance and increased query failures as the weakest configuration instance would be the bottleneck during query execution.

You can edit the worker node’s weight. Override the default weight if you want to base it on the number of CPUs, cost, or any other parameter.

The order of preference among worker nodes is set to the order in which worker node types are selected.

Click Add worker node type to add another worker node type. You can select a maximum of 10 worker node types.

Note

In a heterogeneous cluster, upscaling can cause the actual number of nodes running in the cluster to exceed the configured Maximum Worker Nodes. See Why is my cluster scaling beyond the configured maximum number of nodes?.

Using Heterogeneous Nodes in Hadoop and Spark Clusters

For a heterogeneous cluster:

  • The QDS UI displays supported instance types with weights based on instance memory. You can use the scrollable counter to change the weight as needed.
  • The first instance type must be the same as the cluster’s worker instance type and have a weight of 1.0. This is the primary instance type. Make sure that the first instance type is the primary instance type if you are using Qubole’s APIs to create a heterogeneous cluster.
  • QDS will try the rest of the instance types whenever it needs to provision nodes and when nodes from the earlier list are unavailable. The number of instances requested is decided by the weight.
Selecting different instance types using the QDS UI

See Configuring Heterogeneous Worker Nodes in the QDS UI for information about configuring heterogeneous nodes.

Setting up a Bastion Node on a GCP Cluster

Follow the instructions on this page to create Qubole clusters with bastion nodes on GCP. A bastion host is a special purpose GCP instance that provides SSH access to Qubole’s NAT gateway into your VPC and acts as a proxy to GCP instances running within your VPC.

Step 1: Creating the bastion node

Create a VM instance on the Google Cloud Console with the following specifications. This will serve as the bastion node.

  1. Select a region and a zone. They must match the region and zone of your cluster. This example uses us-east-1 as the region and us-east-1b as the zone.
  2. Select either Centos 7 or Red Hat Enterprise Linux 7 as the operating system.
  3. Add a network tag to this host. This will be used to assign firewall rules and control traffic in and out of the bastion. You can use any valid GCP network tag name. In this example, the network tag is gcp-bastion.

Note

  • Assign an static IP address to the Network Interface to avoid problems when restarting the instance.
  • Avoid using spot instances as a bastion node.

The VM will look similar to this:

_images/BastionGCP1.png
Step 2: Setting up firewall rules

From VPC Network -> Firewall Rules on the Google Cloud Console, add the following firewall rules.

  1. Allow ssh traffic on TCP:22 from Qubole’s NAT IP (34.73.1.130/32) to the bastion using the network tag created in step 1c above.
  2. Allow access on TCP:7000 from the cluster’s region’s IP address range to the bastion using the network tag created in step 1c above. The IP address range for a given region can be obtained by navigating to VPC Network > VPC Networks on the Google Cloud Console.

Once created, your firewall rules will look similar to this:

_images/BastionGCP2.png
Step 3: Configuring the bastion node

In this example, ssh is set up using the username bastion-user on the bastion node. You can set it up with a username of your choosing.

  1. Copy the ssh key for your cluster from the Account SSH key field in the Edit Cluster Settings > Advanced Configuration tab of the QDS UI.

  2. Add the ssh key from step 3a as an authorized user on the bastion node by ssh-ing into the bastion node and running the following commands on a shell as a root user.

    useradd bastion-user -p ''
    mkdir -p /home/bastion-user/.ssh
    chown -R bastion-user:bastion-user /home/bastion-user/.ssh
    
  1. Add the ssh key obtained from step 3a as an authorized user by opening up /home/bastion-user/.ssh/authorized_keys in an editor of your choice and pasting the key in the file.

  2. Run the following steps on the shell to complete the setup.

    bash -c 'echo "GatewayPorts yes" >> /etc/ssh/sshd_config'
    sudo service sshd restart
    
Step 4: Configuring your cluster

After configuring the bastion node, bring up the cluster in the Advanced Configuration tab of the QDS UI. The cluster settings should look similar to this:

_images/BastionGCP3.png
Step 5: Verifying your setup

Once your cluster is up, perform the following steps to verify the setup. This will confirm that the cluster is running successfully with a bastion node.

  1. Verify that port 7000 is open on the bastion node by running the following command on the bastion node:

    sudo netstat -nlp | grep 7000
    
  2. Verify that port 10000 is open on the coordinator node by running the following command on the coordinator node:

    sudo netstat -nlp | grep 10000
    
Performance Monitoring with Ganglia

Ganglia is a scalable, distributed system designed to monitor clusters and grids while minimizing the impact on the performance. When you enable Ganglia monitoring on a cluster, you can view the performance of the cluster as a whole as well as inspect the performance of individual node instances. You can also view various Hadoop metrics for each node instance.

How to Enable Ganglia Monitoring

Perform the following steps to enable Ganglia Monitoring:

  1. Sign in to your Qubole account.
  2. Navigate to Control Panel; the Clusters tab is displayed by default. Click the edit button for the cluster for which you want to enable Ganglia monitoring.
  3. On the Edit Cluster page, select Enable Ganglia Monitoring in the Cluster Settings section. The setting is applied when the cluster is restarted.
How to View Ganglia Metrics

Navigate to https://<your_platform>.qubole.com/ganglia-metrics-<cluster_id> to see the Ganglia metrics for a specific cluster; for example, for an AWS cluster with ID 18, go to https://api.qubole.com/ganglia-metrics-18.

Collecting Cluster Metrics

When Ganglia monitoring is enabled on a cluster, you can also collect the cluster metrics using the Cluster Metrics API.

Cluster and Node Termination Causes

Clusters terminate for various reasons and not necessarily by manual or a preset automatic termination. Cluster nodes get terminated during autoscaling or health check. The following lists different causes for the clusters and nodes’ termination:

Cluster Termination Causes

These are some of the reasons clusters terminate:

  • INACTIVITY: This is the reason displayed when a cluster gets terminated due to inactivity (self-explanatory). It is governed by Idle Cluster Timeout, which is configurable in hours and/or minutes. If no jobs are running on a cluster or if a cluster node reaches its timely boundary, then Qubole identifies such cluster as inactive and it terminates that cluster. For more information, see Shutting Down an Idle Cluster, aggressive-downscaling, and aggressive-downscaling-azure.
  • HEALTH_CHECK_FAILED: This is the reason generally displayed when QDS discovers an unhealthy cluster. For example, if there are no nodes in a running cluster, or the ResourceManager does not exist, QDS identifies the cluster as unhealthy and terminates it.
Node Termination Causes

These are some of the reasons due to which cluster nodes terminate:

  • User Initiated: This is the reason displayed when Qubole terminates the cluster node as part of autoscaling or automated cluster lifecycle management. It is also the reason displayed when a user manually terminates the cluster node.
  • Cluster Instance Terminated: This is the reason displayed when nodes are terminated during the cluster termination.
  • Cloud Provider Initiated: This is the reason displayed when nodes are terminated due to the Spot (AWS, Azure) or Preemptible VM (GCP) node interruption.
  • NA: This is the reason displayed when Qubole could not capture the cause for the cluster node termination.
  • Service Initiated: This is the reason displayed when there is a Spot node loss (initiated by the Cloud Provider).
  • Server.InternalError: This is the reason displayed when there is an internal server error due to which the cluster gets terminated. An error at the Cloud Provider’s end causes this error.
  • HEALTH_CHECK_FAILED: This is the reason displayed generally when there are unhealthy cluster nodes.

For more information, see Clusters.

Engines Administration

This section describes the administration of different query engines.

Hive Administration

This section contains information on administering the Hive service. It is intended for system administrators and users managing Hive in QDS.

Configuring a HiveServer2 Cluster

QDS supports HiveServer2 on Hadoop clusters.

Enable HiveServer2 in the QDS UI, in the Hive Settings section under the Advanced Configuration tab of the Clusters page.

You can configure HS2 clusters to use private IP for communication between the coordinator node and worker nodes at cluster level by passing hive.hs2.cluster.use.private.ip=true as an override in cluster’s Advanced Configuration > HIVE SETTINGS > Override Hive Configuration. To enable it on the account, create a ticket with Qubole Support to enable it on the QDS account.

Enabling HiveServer2

All queries run on a HiveServer2-enabled cluster use HiveServer2. If you want to enable HiveServer2 only for a specific query, without enabling it on a cluster, add this to the query:

set hive.use.hs2=true;

When HiveServer2 is enabled, all Hive queries are executed on the cluster coordinator node, including queries running DDL statements such as ALTER TABLE RECOVER PARTITIONS.

Note

Once Qubole has enabled Hive Authorization in your account:

  • QDS sets hive.security.authorization.enabled to true, and adds it to Hive’s Restricted List. This prevents users from bypassing Hive authorization when they run a query.
  • If you want to change the setting of hive.security.authorization.enabled at the cluster level, you can do so in the QDS UI: set it in the Override Hive Configuration field in the Hive Settings section under the Advanced Configuration tab of a Hadoop (Hive) cluster, then restart the cluster.
  • To change the setting at the account level, create a Qubole support ticket.
Verifying that Hive Queries are using HiveServer2

To verify that queries are being directed to HiveServer2, look in the query logs for statements similar to the following:

2016-11-08 04:19:22,485 INFO hivecli.py:412 - getStandaloneCmd - Using HS2
Connecting to jdbc:hive2://<master-dns>:10003
Connected to: Apache Hive (version 2.1.0)
Driver: Hive JDBC (version 2.1.0)
Transaction isolation: TRANSACTION_REPEATABLE_READ
Understanding Memory Allocation to HiveServer2

When HiveServer2 is enabled, QDS reserves memory for it when the cluster starts, allocating an approximate value of 25-30% (depending on the memory given to other daemons on the cluster) of the memory obtained from the YARN ResourceManager to HiveServer2. This has an impact on how many concurrent queries users can run on the cluster: QDS configures the concurrency limitation in the node bootstrap script, setting it to 500MB for one query.

HiveServer2 is deployed on the coordinator node, so you should configure a powerful instance type for it; the total RAM size must be at least 15 GB.

Create a ticket with Qubole Support for help if you are not sure about the optimal configuration.

Note

You can enable HiveServer2, with additional settings, through a Hadoop API call as described in hive-server-api-parameter.

Understanding Hive Metadata Caching

Hive Metadata Caching, supported on Hadoop clusters, reduces the split computation time for ORC files currently by caching the meta data required in split computation on Redis running on the coordinator node. It is very useful when the data contains many ORC files. Qubole plans to extend this feature support to Parquet files in the near future. Configure it in the QDS UI, in the Hive Settings section under the Advanced Configuration tab of the Clusters page:

_images/HiveServer2.png

If you do not see the Enable Hive Metadata Cache option for a Hadoop cluster, create a ticket with Qubole Support to enable it for your QDS account.

Note

As a prerequisite, enable Hive on coordinator or Hive on HiveServer2 before enabling Hive Metadata Cache.

Once metadata caching is enabled on a QDS account, it is enabled by default only on new Hadoop (Hive) clusters; it remains disabled on existing clusters.

Enabling metadata caching installs a Redis server on the cluster. You can turn caching on and off at the query level by setting hive.qubole.metadata.cache to true or false. You can also add this setting in the Hive bootstrap script.

You can also enable Hive Metadata caching through a REST API call as described in hive-server-api-parameter.

Setting Time-To-Live in the JVMs for DNS Lookups on a Running Cluster

Qubole supports configuring Time-To-Live (TTL) JVMs for DNS Lookups in a running cluster (except Airflow and Presto). This feature is not enabled by default; create a ticket with Qubole Support to enable it. The recommended value of TTL is 60 and its unit is seconds.

Understanding the Hive Metastore Server

Qubole uses Hive Metastore Server (HMS) or the thrift metastore server to respond to Hive queries.

Starting and Stopping HMS

The startup script of HMS is stored in the following location: /usr/lib/hive/bin/thrift-metastore.

Use this command to start HMS: sudo /usr/lib/hive/bin/thrift-metastore server start.

Use this command to stop HMS: sudo /usr/lib/hive/bin/thrift-metastore server stop.

Configuring the HMS Memory

Create a ticket with Qubole Support to increase the metastore’s maximum heap memory at the cluster level. (If you get the metastore’s memory increased at the account-level through Qubole Support, then it applies to all clusters.)

Understanding Qubole Hive Authorization

Hive authorization is one of the methods to authorize users for various accesses and privileges. Qubole provides SQL Standard-based authorization with some additional controls and differences from the open source. See SQL Standard Based Hive Authorization for more information.

Qubole’s Hive authorization is aimed at providing Qubole Hive users the ability to control granular access to Hive tables and columns. It is also aimed at providing granular control over the type of privileges a Hive user can have over a Hive table.

Warning

Hive notebooks are in the beta phase. As there may be potential security concerns to use it in production, you can experiment a Hive notebook and cannot use it for a production usage.

Understanding Privileges for Users and Roles

Privileges are granted to users and user-roles. A user can be assigned more than one role. These are the default roles available in Hive:

  • public - By default, all users are assigned with the public role.

  • admin - Only a few users are assigned with admin roles with all privileges. An admin can assign/unassign the admin role to a user.

    An admin can:

    • Create a role
    • Drop a role
    • Show roles
    • Show Principals
    • Use dfs, add, delete, compile, and reset commands. However, Qubole Hive authorization allows a user to add and delete commands, which is a variation from open source Hive. See Differences from the Open Source Hive for more information.
    • Add or drop functions and macros

When you run a Hive query command, Qubole Hive checks the privileges granted to you with the current role.

Required Privileges for Performing Hive Operations

These are the required privileges for performing Hive operations:

  • SELECT privilege: It provides read access to an object (table).
  • INSERT privilege: It provides ability for adding data to an object (table).
  • UPDATE privilege: It provides ability for running UPDATE queries on an object (table).
  • DELETE privilege: It provides ability for deleting data in an object (table).
  • ALL privilege: It provides all privileges. In other words, this privilege gets translated into all the above privileges.
Enabling Qubole Hive Authorization

Hive Authorization is not enabled by default. To enable Hive Authorization in a QDS account, create a Qubole support ticket.

Using Qubole Hive Authorization describes how to use the Qubole Hive authorization.

Note

Once Qubole has enabled Hive Authorization in your account:

  • QDS sets hive.security.authorization.enabled to true, and adds it to Hive’s Restricted List. This prevents users from bypassing Hive authorization when they run a query.
  • If you want to change the setting of hive.security.authorization.enabled at the cluster level, you can do so in the QDS UI: set it in the Override Hive Configuration field in the Hive Settings section under the Advanced Configuration tab of a Hadoop (Hive) cluster, then restart the cluster.
  • To change the setting at the account level, create a Qubole support ticket.
Differences from the Open Source Hive

Qubole Hive Authorization has the following differences from the open source Hive:

  • Qubole has enabled add/delete commands to users unlike in the open source Hive, where commands such as dfs, add, delete, compile, and reset are disabled.

  • Qubole has disabled filesystem-level checks. Open source Hive does filesystem-level checks to see if the user has READ, WRITE, and OWNERSHIP of the location hierarchy.

    Qubole has disabled the filesystem-level check for Cloud Object Storage due to following reasons:

    • The permissions does not translate well into READ, WRITE, or OWNERSHIP in case of Cloud Object Storage as they do for HDFS.
    • The permission checks occur for the entire location hierarchy. So, for a directory, Hive checks each file in that directory recursively for permissions. If a directory does not exist, then Hive recursively checks for permissions by going a level up until it reaches a directory. This type of Hive behavior in Cloud Object Storage would mean a lot of Cloud Object Storage calls leading to a huge command latency. By default, hive.authz.disable.fs.check is set to true. To revert to the open source Hive behavior, set hive.authz.disable.fs.check to false.
Known Issues in the Qubole Hive Authorization

The following are the known issues in the Qubole Hive Authorization in Qubole Hive 2.1:

  • Explain Queries do not check for the SELECT privilege.
  • Grant role with the admin option is not working with the IAM Roles authorization.
Using Qubole Hive Authorization

Understanding Qubole Hive Authorization describes Hive authorization, privileges, and known issues. Hive Authorization is not enabled in QDS by default. To enable it for your account, create a Qubole Support ticket.

Once Qubole has enabled Hive Authorization in your account, QDS sets hive.security.authorization.enabled to true, and adds it to Hive’s Restricted List. This prevents users from bypassing Hive authorization when they run a query. If you later want to change the setting of hive.security.authorization.enabled at the cluster level, you can do so in the QDS UI: set it in the Override Hive Configuration field in the Hive Settings section under the Advanced Configuration tab of a Hadoop (Hive) cluster, then restart the cluster. To change the setting at the account level, create a Qubole support ticket.

  • To use Hive tables, use <username>@<emaildomain.com> as the login username; for example, if your username is user1, log in as user1@xyz.com. The default password is empty.
  • QDS Hive has two users, user and admin as in open-source Hive, and two default roles, public and admin.
  • The admin user can create custom roles in addition to the default roles (for example, a role called finance).
  • As an admin, you can grant these roles to users.
  • The admin can also grant privileges to users as described in Understanding Privileges for Users and Roles. For example, you can grant SELECT and INSERT privilege to the finance role for the default_qubole_memtracker table.
  • You can set hive.qubole.authz.strict.show.tables=true as a Hadoop override on the Cluster page of the QDS UI, to allow users to see only tables they have SELECT access to when they run SHOW TABLES.

For more information, see the Hive.

Presto Administration

This section explains the topics related to the Presto administration.

  • Presto Configuration in QDS

    Configuring a Presto Cluster

    A single Qubole account can run multiple clusters. By default, Qubole provides a Presto cluster, along with Hadoop and Spark clusters, for each account.

    The following topics explain Presto custom configuration and the presto catalog properties:

    Note

    QDS provides the Presto Ruby client for better overall performance, processing DDL queries much faster and quickly reporting errors that a Presto cluster generates. For more information, see this blog.

    To view or edit a Presto cluster’s configuration, navigate to the Clusters page and select the cluster with the label presto.

    Click the edit icon in the Action column against a Presto cluster to edit the configuration.

    Note

    Presto queries are memory-intensive. Choose instance types with ample memory for both the coordinator and worker nodes.

    Presto versions 0.208 and 317 are the two supported stable versions.

    See QDS Components: Supported Versions and Cloud Platforms for the latest version information.

    Note

    Qubole can automatically terminate a Presto cluster with an invalid configuration. This capability is available for Beta access; Create a ticket with Qubole Support to enable it for your account.

    Check the logs in /usr/lib/presto/logs/server.log if there is a cluster failure or configuration error. See Presto FAQs for more information about Presto logs.

    On AWS, Azure, or GCP, select Enable Rubix to enable RubiX. See Configuring RubiX in Presto and Spark Clusters for more information.

    See Managing Clusters for more information on cluster configuration options that are common to all cluster types.

    Avoiding Stale Caches

    The cache parameters are useful to tweak if you expect data to change rapidly.

    Fo example, if a Hive table adds a new partition, it may take Presto 20 minutes to discover it. If you plan on changing existing files in the Cloud, you may want to make fileinfo expiration more aggressive. If you expect new files to land in a partition rapidly, you may want to reduce or disable the dirinfo cache.

    Understanding the Presto Engine Configuration

    The cluster settings page has a text box labelled Override Presto Configuration which you can use to customize a Presto cluster. An entry in this box can have multiple sections; each section should have a section title, which serves as the relative pathname of the configuration file in the etc directory of Presto, followed by the configuration. You can configure JVM settings, common Presto settings and connector settings from here. You can learn more about these sections in Presto’s official documentation. Here is an example custom configuration:

    jvm.config:
    -Xmx10g
    
    config.properties:
    ascm.enabled=false
    
    catalog/hive.properties:
    hadoop.cache.data.enabled=false
    

    Some important parameters for each configuration are covered in the following sections.

    jvm.config

    These are populated automatically and generally do not require custom values. These are used while launching Presto server JVM.

    Parameter Example Default Description
    -Xmx -Xmx10g 70% of Instance Memory 10 GB for JVM heap
    -XX:+ExitOnOutOfMemoryError false true Because an OutOfMemoryError typically leaves the JVM in an inconsistent state, Qubole forcibly terminates the process when it occurs by setting this JVM property.
    -Djdk.nio.maxCachedBufferSize 2097900 2097152 This property limits the amount of native memory used for NIO buffers by setting its default value. This prevents increase in the non-heap memory usage for the JVM process. Its value is set in bytes.
    Presto Configuration Properties

    The config.properties are described in the following section.

    Understanding the Autoscaling Properties

    Note

    In case of a Presto cluster, the P icon is marked for the Presto Overrides but the push operation is not applicable to all properties except a few autoscaling properties listed in the following table. If you try to push configuration properties (that you had removed), the value of such configuration properties do not get refreshed in the running cluster as it continues to be the same value used before.

    Parameter Examples Default Pushable into a running cluster Description
    ascm.enabled true, false true No Use this parameter to enable autoscaling.
    ascm.upscaling.enabled true, false true Yes Use this parameter to enable upscaling.
    ascm.downscaling.enabled true, false true Yes Use this parameter to enable downscaling.
    ascm.bds.target-latency 1m, 50s 1m Yes You can set time interval to change the target latency for the jobs. Increasing it makes autoscaling less aggressive.
    ascm.bds.interval 10s, 1m 10s No The periodic interval set after which reports are gathered and processed to find out the cluster’s optimal size.
    ascm.completition.base.percentage 1, 3 2 Yes The percentage is set for the two phases in the query execution during which Qubole does not consider query metrics in an autoscaling decision. The starting and ending phases of the query execution time are the two phases. The default value is 2. It implies that before 2% and after 98% of of the query completion, Qubole does not consider it in autoscaling decisions.
    ascm.downscaling.trigger.under-utilization-interval 5m, 45s 5m No The time interval during which all cycles of reports’ processing must suggest the cluster to scale down to actually scale down the cluster. For example, when this interval is set to 5m, it means that only during an interval of 5 minutes, when all reports suggest that cluster is being under-utilized, would the scaling logic decide to initiate down-scaling. This safeguards against temporary blips which would cause downscaling.
    ascm.downscaling.group-size 5, 8 5 Yes Down-scaling in steps and the value indicates the number of nodes that are removed per cycle of down-scaling.
    ascm.sizer.min-cluster-size 2, 3 1 Yes It denotes the minimum cluster size or the minimum number of cluster nodes. It is also available as a UI option on the Presto cluster UI.
    ascm.sizer.max-cluster-size 3, 6 2 Yes It denotes the maximum cluster size or the maximum number of cluster nodes. It is also available as a UI option on the Presto cluster UI.
    ascm.upscaling.trigger.over-utilization-interval 4m, 50s value of ascm.bds.interval No The time interval during which all cycles of reports’ processing must suggest the cluster to scale up to actually scale up the cluster.
    ascm.upscaling.group-size 9, 10 Infinite Yes Upscaling in steps and the value indicates the number of nodes that are added per cycle of up-scaling (capped by the maximum size set for the cluster).
    query-manager.required-workers 4, 6 NA No It is to set the number of worker nodes that must be present in the cluster before a query is scheduled to be run on the cluster. A query is scheduled only after the configured query-manager.required-workers-max-wait timeout. This is only supported with Presto 0.193 and later versions. For more information, see Configuring the Required Number of Worker Nodes.
    query-manager.required-workers-max-wait 7m, 9m 5m No It is the maximum time a query can wait before getting scheduled on the cluster if the required number of worker nodes set for query-manager.required-workers could not be provisioned. For more information, see Configuring the Required Number of Worker Nodes.
    Understanding the Query Execution Properties

    Note

    For information on disabling the reserved pool, see Disabling Reserved Pool.

    These query execution properties are applicable to Presto 0.208 and later versions.

    Parameter Examples Default Description
    query.max-concurrent-queries 2000 1000 It denotes the number of queries that can run in parallel.
    query.max-execution-time 20d, 45h 100d

    It denotes the time limit on the query execution time. It considers the time only spent in the query execution phase. The default value is 100 days. This parameter can be set in any of these time units:

    • Nano seconds denoted by ns
    • Microseconds denoted by us
    • Milliseconds denoted by ms
    • Seconds denoted by s
    • Minutes denoted by m
    • Hours denoted by h
    • Days denoted by d

    Its equivalent session property is query_max_execution_time, which can also be specified in any of the time units given above.

    query.max-memory-per-node 10GB, 20GB 30% of Heap memory It denotes the maximum amount of user memory that a query may use on a machine. The user memory is memory controllable by a user based on the query. For example, memory for performing aggregations, JOINs, Sorting and so on, is allocated from the user memory as the amount of memory required is based on the number of groups, JOIN keys or values to be sorted.
    query.max-memory 80GB, 20TB 100TB Maximum memory that a query can take aggregated across all nodes. To decrease or modify the default value, add it as a Presto override or set the query_max_memory session property.
    query.schedule-split-batch-size 1000, 10000 1000 Number of schedule splits at once
    query.max-queued-queries 6000 5000 Denotes the number of queries that can be queued. See Queue Configuration for more information on advanced queuing configuration options.
    optimizer.optimize-single-distinct false true This implies that the single distinct optimization tries to replace multiple DISTINCT clauses with a single GROUP BY clause, which can substantially speed up the query execution. It is only supported in Presto 0.193 and earlier versions.
    qubole-max-raw-input-data-size 1TB, 5GB NA You can set this property to limit the total bytes scanned for queries that get executed on a given cluster. Queries that exceed this limit fail with the RAW_INPUT_DATASIZE_READ_LIMIT_EXCEEDED exception. This ensures rogue queries do not run for a very long time. qubole_max_raw_input_datasize is the equivalent session property.
    query.max-total-memory-per-node 10GB, 21GB 30% of Heap memory It denotes the maximum amount of user and system memory that a query may use on a machine. A user cannot control the system memory. System memory is used during query execution by readers, writers, buffers for exchanging data between nodes and so on. The default value for query.max-total-memory-per-node must be greater than or equal to query.max-memory-per-node.
    memory.heap-headroom-per-node 10GB 20% of Heap memory It denotes the amount of memory on JVM heap set aside as headroom/buffer for allocations that are not tracked by Presto in user or system memory pools. The above default memory pool configuration for Presto 0.208 results in 30% of the heap for the reserved pool, 20% heap headroom for untracked memory allocations, and the remaining 50% of the heap for the general pool.
    Understanding the Task Management Properties
    Parameter Examples Default Description
    task.max-worker-threads 10, 20 4 * cores Maximum worker threads per JVM
    task.writer-count The value must be a power of 2. 1

    It is the number of concurrent Writer tasks per worker per query when inserting data through INSERT OR CREATE TABLE AS SELECT operations. You can set this property to make the data writes faster. The equivalent session property is task_writer_count and its value must also be a power of 2. For more information, see Configuring the Concurrent Writer Tasks Per Query.

    Caution

    Use this configuration judiciously to prevent overloading the cluster due to excessive resource utilization. So it is recommended to use higher value through session properties for queries which generate bigger outputs. For example, ETL jobs.

    Understanding the Timestamp Conversion Properties
    Parameter Examples Default Description
    client-session-time-zone Asia/Kolkata NA The timestamp fields in output are automatically converted into the timezone specified by this property. It is helpful when you are in a different timezone than the Presto Server in which case the timestamp fields in the output would be displayed in the server timezone if this configuration is not set.
    Understanding the Query Retry Mechanism Properties
    Parameter Examples Default Description
    retry.autoRetry true, false true It enables the Presto query retry mechanism feature at the cluster level.
    retrier.max-wait-time-local-memory-exceeded 2m, 2s 5m It is the maximum time to wait for Presto to give up on retrying while waiting for new nodes to join the cluster, if the query has failed with the LocalMemoryExceeded error. Its value is configured in seconds or minutes. For example, its value can be 2s, or 2m, and so on. Its default value is 5m. If a new node does not join the cluster within this time period, Qubole returns the original query failure response.
    retrier.max-wait-time-node-loss 2m, 2s 3m It is the maximum time to wait for Presto to give up on retrying while waiting for new nodes to join the cluster if the query has failed due to the Spot node loss. Its value is configured in seconds or minutes. For example, its value can be 2s, or 2m, and so on. Its default value is 3m. If a new node does not join the cluster within this configured time period, the failed query is retried on the smaller-sized cluster.
    retry.nodeLostErrors
    (written in the next column) It is a comma-separated list of Presto errors (in a string form) that signify the node loss. The default value of this property is "REMOTE_HOST_GONE","TOO_MANY_REQUESTS_FAILED","PAGE_TRANSPORT_TIMEOUT".
    Using the Catalog Configuration

    A Presto catalog consists of schemas and refers to a data source through a connector. Qubole allows you to add the catalog through a simplified way by just defining its properties through the Presto overrides on the Presto cluster. You can add the catalog using the syntax below through the Presto override.

    catalog/<catalog-name>.properties:
    <catalog property 1>
    <catalog property 2>
    .
    .
    .
    <catalog property n>
    
    catalog/hive.properties

    Qubole provides table-level security for Hive tables accessed through Presto. See Understanding Qubole Hive Authorization for more information.

    The following table describes the common Hive catalog properties.

    Parameter Examples Default Description Supported Presto Version
    hive.metastore-timeout 3m, 1h 3m Timeout for Hive metastore calls that is, it denotes how long a request waits to fetch data from the metastore before getting timed out. Presto versions 0.208 and 317
    hive.metastore-cache-ttl 5m, 20m 20m It denotes a data entry’s life duration in the metastore cache before it is evicted. Metastore caches tables, partitions, databases, and so on that are fetched from the Hive metastore. Configuring Thrift Metastore Server Interface for the Custom Metastore describes how to configure Hive Thrift Metastore Interface. Presto versions 0.208 and 317
    hive.metastore-cache-ttl-bulk 20m, 1d NA When you have a query that you need to run on hive.information_schema.columns, set this option as a Presto override. For example, hive.metastore-cache-ttl-bulk=24h. Enabling this option caches table entries for the configured duration, when the table info is fetched (in bulk) from the metastore. This makes fetching tables/columns through JBDC drivers faster. It is not supported in Presto version 317 and later. Presto 0.208 version and older
    hive.metastore-refresh-interval 10m, 20m 100m

    It is the time interval set for refreshing metastore cache. After each interval expires, metastore cache is refreshed. So, in case if you see stale results for a query, then running the same query would fetch results without the stale data (assuming the time interval is Suppose, assume that you disable this parameter or set it by adding a value that is higher than that of expired).

    hive.metastore-cache-ttl. Try running the query after the entry is evicted from the metastore cache. The executed query brings back the evicted entry into the cache and the stale data is returned in the query. Retrieving info from the metastore takes more time than reading from the cache.

    You can avoid seeing stale results in the executed query by setting this parameter to a value that is lesser than hive.metastore-cache-ttl. If you run a query after the refresh interval’s expiry, then the query quickly returns the cached entry and starts a background cache refresh. So, to get cached entries with higher TTL and faster cache refreshes, set the value of hive.metastore-cache-ttl higher than hive.metastore-refresh-interval.

    Presto versions 0.208 and 317
    hive.security allow-all, sql-standard allow-all sql-standard enables Hive authorization. See Understanding Qubole Hive Authorization for more information. Presto versions 0.208 and 317
    hive.skip-corrupt-records true, false false

    It is used to skip corrupt records in input formats other than orc, parquet and rcfile. You can also set it as a session property, as hive.skip_corrupt_records=true in a session when the active cluster does not have this configuration globally enabled. This configuration is supported only in Presto 0.180 and later versions.

    Note

    The behavior for the corrupted file is non-deterministic that is Presto might read some part of the file before hitting corrupt data and in such a case, the QDS record reader returns whatever it read until this point and skips the rest of the file.

    Presto versions 0.208 and 317
    hive.information-schema-presto-view-only true, false true It is enabled by default and hence, the information schema only includes the Presto views and not the Hive views. When it is set to false, the information schema includes both the Presto and Hive views. Presto versions 0.208 and 317
    hive.metastore.thrift.impersonation.enabled true, false false It adds impersonation support for calls to the Hive metastore when you enable this property. It allows Presto to impersonate the user, who runs the query to access the Hive metastore. Presto version 317
    hive.max-partitions-per-scan 100000, 150000 100000 It is the maximum number of partitions for a single table scan. Presto versions 0.208 and 317
    hive.max-execution-partitions-per-scan 180000, 150000 The configured value of hive.max-partitions-per-scan.

    You can use this property along with a relaxed limit on hive.max-partitions-per-scan when dynamic partition pruning is expected to reduce the number of partitions scanned at runtime.

    Note

    Using this runtime limit can cause Presto to scan data from hive.max-execution-partitions-per-scan partitions per table scan before it finds that it has breached the limit and fails the query.

    Presto versions 0.208 and 317
    Using the Qubole Presto Server Bootstrap

    The Qubole Presto Server Bootstrap is an alternative to the Node Bootstrap Script to install external jars such as presto-udfs before the Presto Server is started. The Presto server comes up before the node bootstrap process is completed. As such, installing external jars for example, Presto UDFs through the node bootstrap requires explicit restart of the Presto daemons. This can get problematic because the server may have already started running a task and thus restarting Presto daemons can cause query failures. Hence, Qubole Presto Server Bootstrap is better suited for such changes.

    The Qubole Presto Server Bootstrap is only supported in Presto 0.180 and later versions.

    Warning

    Use the Qubole Presto Server Bootstrap only if you want to execute some script before starting the Presto server. Any script that is part of this bootstrap increases the time taken to bring up the Presto server. Hence, the time taken by the Presto server to accept a query also increases. If there is no dependency in the current cluster node bootstrap script which requires restart of the Presto daemon to pick changes, then it is recommended to use cluster’s node bootstrap only.

    There are two ways to define the Qubole Presto Server Bootstrap:

    • bootstrap.properties - You can add the bootstrap script in it.
    • bootstrap-file-path - It is the location of the Presto Server Bootstrap file in the cloud object storage that contains the bootstrap. Specifying a bootstrap-file-path is recommended when the script is too long.

    To configure Qubole Presto Server Bootstrap for a given cluster, follow any one the below steps:

    • Through the cluster UI, add it in Advanced Configuration > PRESTO SETTINGS > Override Presto Configuration.
    • Through the REST API, add it using the custom_config parameter under presto_settings. For more information, see presto_settings.

    Caution

    Qubole Presto Server Bootstrap eliminates the need to restart the Presto daemons as such. Ensure that any explicit commands to restart or exit the Presto server are not included in the bootstrap script. The Presto server is brought up only after the Server Bootstrap is successfully executed. So it is important to verify that there are no errors in the bootstrap script. In addition, if any script or part of the script is migrated/copied from the existing cluster node bootstrap, then remove that bootstrap script or modify it appropriately to avoid the same script from running twice.

    Example of a Bootstrap Script Specified in the bootstrap.properties
    bootstrap.properties:
    mkdir /usr/lib/presto/plugin/udfs
    hadoop dfs -get <scheme>bucket/udfs_custom.jar /usr/lib/presto/plugin/udfs/
    
    Example of Specifying a Qubole Presto Server Bootstrap Location
    bootstrap-file-path:
    gs://bucket/existing-node-bootstrap-file.sh
    

    The existing-node-bootstrap-file.sh can contain the script that is shown in Example of a Bootstrap Script Specified in the bootstrap.properties. You can view the content of the existing-node-bootstrap-file.sh as follows:

    $ hadoop fs -cat <scheme>my-bucket/boostraps/existing-node-bootstrap-file.sh
    mkdir /usr/lib/presto/plugin/udfs
    hadoop dfs -get `gs://bucket/udfs_custom.jar /usr/lib/presto/plugin/udfs/
    $
    
    Using Presto UDFs as a Bootstrap Script

    Presto on Qubole provides UDFs as external jars, presto-udfs. You can add them through a Presto Server bootstrap under Advanced Configuration > PRESTO SETTINGS of the Presto cluster UI. You can pick one of the following UDFs (based on Presto version) and pass them as overrides in the Override Presto Configuration text box:

    Note

    The Presto jars below are in the AWS S3 storage location.

    • UDFs for Presto version 0.208

      bootstrap.properties:
      mkdir /usr/lib/presto/plugin/udfs
      hadoop dfs -get s3://paid-qubole/presto-udfs/udfs-2.0.3.jar /usr/lib/presto/plugin/udfs/
      
    • UDFs for Presto version 317

      bootstrap.properties:
      mkdir /usr/lib/presto/plugin/udfs
      hadoop dfs -get s3://paid-qubole/presto-udfs/udfs-3.0.0.jar /usr/lib/presto/plugin/udfs/
      
    Presto Server Bootstrap Logs

    The Presto server bootstrap logs are in /media/ephemeral0/presto/var/log/bootstrap.log.

  • External Data Source Access

  • Cluster Management

    Autoscaling in Presto Clusters

    Here’s how autoscaling works on a Presto cluster:

    • The Presto Server (running on the coordinator node) keeps track of the launch time of each node.
    • At regular intervals (10 seconds by default) the Presto Server takes a snapshot of the state of running queries, compares it with the previous snapshot, and estimates the time required to finish the queries. If this time exceeds a threshold value (set to one minute by default and configurable through ascm.bds.target-latency), the Presto Server adds more nodes to the cluster. For more information on ascm.bds.target-latency and other autoscaling properties, see Presto Configuration Properties.
    • If QDS determines that the cluster is running more nodes than it needs to complete the running queries within the threshold value, it begins to decommission the excess nodes.

    Note

    Because all processing in Presto is in-memory and no intermediate data is written to HDFS, the HDFS-related decommissioning tasks are not needed in a Presto cluster.

    After new nodes are added, you may notice that they sometimes are not being used by the queries already in progress. This is because new nodes are used by queries in progress only for certain operations such as TableScans and Partial Aggregations. You can run EXPLAIN (TYPE DISTRIBUTED) (see EXPLAIN) to see which of a running query’s operations can use the new nodes: look for operations that are part of Fragments and appear as [SOURCE].

    If no query in progress requires any of these operations, the new nodes remain unused initially. But all new queries started after the nodes are added can make use of the new nodes (irrespective of the types of operation in those queries).

    Configuring the Required Number of Worker Nodes

    Note

    This capability is supported only in Presto 0.193 and later versions.

    You can configure query-manager.required-workers as a cluster override to set the number of worker nodes that must be running before a query can be scheduled to run. This allows you to reduce the minimum size of Presto clusters to one without causing queries to fail because of limited resources. (While nodes are being requested from the Cloud provider and added to the cluster, queries are queued on Presto’s coordinator node. These queries are shown as Waiting for resources in the Presto web UI.)

    QDS waits for a maximum time of query-manager.required-workers-max-wait (default 5 minutes) for the configured number of nodes to be provisioned. Queries which do not require multiple worker nodes (for example, queries on JMX, system, and information schema connectors, or queries such as SELECT 1 and SHOW CATALOGS) are executed immediately. The cluster downscales to the minimum configured size when there are no active queries.

    Qubole allows overriding the cluster-level properties, query-manager.required-workers-max-wait and query-manager.required-workers at query-level through the corresponding session properties, required_workers_max_wait and required_workers.

    Let us consider this example.

    SET SESSION required_workers=5;
    SET SESSION required_workers_max_wait='3m';
    select * from foo;
    

    This ensures that the query is not scheduled until at least 5 nodes are in the cluster or until 3 minutes have elapsed.

    The number of worker nodes that autoscaling brings up is capped by the lower value between the cluster’s maximum size or resource groups’ maxNodeLimit (if it has been configured).

    This feature is useful for upscaling the cluster to handle scheduled ETLs and reporting jobs whose resource requirements are well known.

    Note

    These are autoscaling nodes and adhere to the existing cluster configuration for pre-emptible nodes.

    Controlling the Nodes’ Downscaling Velocity

    The autoscaling service for Presto triggers an action of removing the ascm.downscaling.group-size (with its default=5) nodes during each configured ascm.decision.interval (with its default=10s) if it calculates the optimal size of the cluster to be less than the current cluster size continuously for the configured Cool down period. This results in a downscaling profile where no nodes are removed during the Cool down period and nodes are removed very aggressively until the cluster reaches its optimal size.

    This figure illustrates the downscaling profile of cluster nodes.

    _images/Downscaling-profile.png

    To control the nodes’ downscaling velocity, Qubole provides a Presto cluster configuration override, ascm.downscaling.staggered=true. When you override it on the cluster, every time a downscaling action is triggered, the Cool down period is reset, which has a default value of 5 minutes. The next downscaling action is not triggered by the autoscaling service until it calculates the optimal size of the cluster to be less than the current cluster size continuously for the configured Cool down period. This results in a more gradual downscaling profile where ascm.downscaling.group-size nodes are removed in each Cool down period until the cluster reaches its optimal size.

    For better understanding, let us consider these two examples.

    Example 1: Consider a cluster without ascm.downscaling.staggered enabled.

    The configured Cool down period is 10m. The current cluster size is 12 and optimal size is 2 with ascm.downscaling.group-size=2.

    In this case, for 10 minutes no nodes are removed– that is while the Cool down period lasts. After that, 2 nodes are removed every 10 seconds until the cluster size is 2.

    The total time taken to get to optimal size is (cool down period + ((current - optimal)/group_size) * 10s) = 10 minutes and 50 seconds.

    Example 2: Consider a cluster with ascm.downscaling.staggered enabled.

    The configured Cool down period is 2m. The current cluster size is 12 and optimal size is 2 with ascm.downscaling.group-size=2.

    In this case, 2 nodes are removed every 2 minutes until the cluster size is 2.

    The total time taken to get to optimal size is ((current - optimal)/group_size) * cool down period) = 10 minutes.

    In addition, Presto also supports resource groups based dynamic cluster sizing at the cluster and account levels as described in Resource Groups based Dynamic Cluster Sizing in Presto.

    Decommissioning a Worker Node

    Qubole allows you to gracefully decommission a worker node during autoscaling through the cluster’s coordinator node. If you find a problematic worker node, then you can manually remove it using the cluster API as described in Remove a Node from a Cluster.

    Spot Rebalancing in Presto

    Note

    <short-lived compute instances> is referred to preemptible VMs in Qubole-on-GCP.

    Spot Rebalancing is supported in Presto. This helps in scenarios when the <short-lived compute instances> ratio of a running cluster falls short of the configured spot ratio due to unavailability or frequent terminations of spot nodes. The Spot rebalancer ensures that the cluster proactively recovers from this shortfall and it brings the cluster to a state where its <short-lived compute instances> ratio is as close as possible to its configured value.

    By default, after every 30 minutes, Qubole inspects the <short-lived compute instances> ratio of the cluster and attempts a rebalancing if the <short-lived compute instances> ratio falls short of the configured <short-lived compute instances> ratio. The time period for the <short-lived compute instances> ratio inspection is configurable using the ascm.node-rebalancer-cooldown-period parameter.

    An example of using this configuration is setting ascm.node-rebalancer-cooldown-period=1h in the Presto cluster overrides. If this example setting is used, Qubole inspects for a skewed <short-lived compute instances> ratio every hour instead of 30 minutes.

    Note

    Using very small values for ascm.node-rebalancer-cooldown-period can lead to an instability in the cluster’s state. This feature is only applicable to the aggressive downscaling feature, which must be enabled in a Qubole account.

    For more information, see Aggressive Downscaling.

    Spot Rebalancing Advanced Configuration Properties

    These are the two advanced configuration properties:

    • ascm.sizer.max-cluster-size-buffer-percentage: While rebalancing a running cluster, Qubole tries to gracefully replace the additional running On-Demand nodes. In that process, the cluster may have to add some nodes beyond its maximum size. This configuration controls the maximum limit you can go beyond the cluster’s maximum size while rebalancing. The default value for this configuration property is 10.

      For example, consider ascm.sizer.max-cluster-size-buffer-percentage=20, which means that the cluster size does not exceed beyond 20% of the maximum cluster size while rebalancing.

    • ascm.node-rebalancer-max-extra-stable-nodes.percentage: This configuration property decides the amount of skew in the spot ratio of running nodes that is allowed in the cluster. If the skew percentage is exceeds this configuration property’s value, Qubole attempts on rebalancing the cluster nodes to conform to the configured spot ratio. The default value for this configuration property is 10.

      For example, consider ascm.node-rebalancer-max-extra-stable-nodes.percentage=15, which means that the cluster nodes are rebalanced only if the skew in the spot ratio of running nodes exceeds 15%.

    Resource Groups based Dynamic Cluster Sizing in Presto

    Qubole has introduced dynamic sizing of Presto clusters based on resource groups. Users are assigned to Presto resource groups and each resource group has a configurable limit on the maximum number of nodes that it can scale the cluster upto independently. The maximum cluster size is calculated dynamically based on the active resource groups and the scaling limits.

    For information about autoscaling in Presto, see Autoscaling in Presto Clusters.

    If there is no limit on how much a single user can autoscale the cluster upto, then:

    • A single user can autoscale a cluster to its maximum cluster size
      • Maximum cluster size is often configured considering the user concurrency
    • Admins create multiple clusters with different maximum size for different group of users and as a result:
      • Managing multiple clusters becomes an exhausting task as configurations such as network changes and Presto features need to be applied to clusters followed by a rolling deploy.
      • Cost increases because each cluster runs at a minimum size so costs have to be paid for the minimum nodes for clusters even when they are idle

    Dynamic sizing of clusters resolves the above mentioned issues. The feature is supported in Presto 0.208 and later versions.

    New Resource Group Property for User Scaling Limit

    maxNodeLimit: It denotes the maximum number of nodes this group can request from the cluster. May be specified as an absolute number of nodes (that is 10) or as a percentage (that is 10%) of the cluster’s maximum size. Its defaults to 20%. It is an optional configuration.

    Configuration Properties for Dynamic Scaling Limits

    These are the configuration properties for dynamic scaling limits that you can set under etc/resource-groups.properties.

    • resource-groups.user-scaling-limits-enabled: It is a boolean value to enable resource groups based autoscaling. It defaults to false.
    • resource-groups.active-user-buffer-period: It is the time period for which a resource group is considered as active after its last query finishes. You must specify it as a duration (that is 5s). It defaults to 10 minutes (10m).

    With the dynamic scaling limits feature, an active user can scale the cluster upto only a certain limit. The maximum cluster size is then derived using current active users. When users from multiple resource groups are active, the maximum number of nodes that the cluster can autoscale to is the union of the individual maximum nodes limits. The maximum cluster size is never greater than the Maximum Worker Nodes in the cluster settings.

    Enable User Limits on Autoscaling through Resource Groups

    To enable user limits on autoscaling through resource groups at the account level, create a ticket with Qubole Support.

    To enable user limits on autoscaling through resource groups at the cluster level, add the following in the Presto cluster Overrides:

    resource-groups.properties:
    resource-groups.configuration-manager=file
    resource-groups.config-file=etc/default_resource_groups.json
    resource-groups.user-scaling-limits-enabled=true
    

    Once the feature is enabled, resource groups defined in the default_resource_groups.json file are used. The default resource groups json file used is:

    {
     "rootGroups": [
       {
         "name": "${USER}",
         "maxNodeLimit": "20%"
       }
     ],
     "selectors": [
       {
         "group": "${USER}"
       }
     ],
     "cpuQuotaPeriod": "1h"
    }
    
    Analyzing the default JSON File

    According to the default JSON file, every new user gets assigned to a new resource group. For example, user-1 is assigned to a resource group named user1, which is generated by expanding the resource group template ${USER}, each with a maximum node limit 20%.

    Let us consider a case of a cluster with Maximum Worker Nodes = 10 (from cluster settings).

    Note

    For additional examples, see Resource Groups Autoscaling Examples.

    In such a cluster configuration, the impact of number of active users on the maximum possible size of the cluster is captured in this table (where U1 = User 1, U2 = User 2 and so on).

    Active Users Possible Maximum Size Description
    U1 2 20% of 10 nodes
    U1, U2 4 2 nodes from each U1 and U2
    U1, U2, U3 6 2 nodes from each U1, U2, and U3
    U1, U2, U3, U4, U5, U6 10 Even though the combined potential cluster size is 12, the size is limited by 10 (maximum cluster size).
    Recommendations for Consolidating Multiple Clusters

    These are the recommendations:

    • Admins should do this configuration in steps that is, an admin should consolidate a few clusters at a time instead of configuring all settings together.
    • Admins should start with replicating a multi-cluster setup in Resource Groups as illustrated in the above example. This would be the least disruptive to end users.
    • An admin must select a larger coordinator node in consolidated clusters as it handles the load of multiple coordinator nodes of a multi-cluster setup.
    Resource Groups Autoscaling Examples

    This section provides a few examples of setting up user limits on autoscaling through Resource Groups in Presto. Resource Groups based Dynamic Cluster Sizing in Presto describes the dynamic cluster sizing in detail.

    Let us consider an organization with the following structure.

    _images/RGOrgStructure.png

    For the above organization, the Presto configuration would be as below.

    _images/RGClusterOverride.png

    Let consider another organization with user level limits that has the following structure.

    _images/RGuserlimits.png

    The Presto configuration is the same that is used in the above example except that the resource group configuration would differ as mentioned below.

    resource-groups.properties:
    resource-groups.configuration-manager=file
    resource-groups.config-file=etc/resource_groups.json
    resource-groups.user-scaling-limits-enabled=true
    
    resource_groups.json:
    {
      "rootGroups": [
        {
          "name": "BU1",
          "maxNodeLimit": "50%",
          "subGroups": [
            {
              "name": "Team1",
              "maxNodeLimit": "60%",
              "subGroups": [
                {
                  "name": "${USER}",
                  "maxNodeLimit": "40%"
                }
              ]
            },
            {
              "name": "Team2",
              "maxNodeLimit": "50%",
              "subGroups": [
                {
                  "name": "${USER}",
                  "maxNodeLimit": "80%"
                }
              ]
            }
          ]
        },
        {
          "name": "BU2",
          "maxNodeLimit": "80%",
          "subGroups": [
            {
              "name": "Team1",
              "maxNodeLimit": "20%"
            },
            {
              "name": "Team2",
              "maxNodeLimit": "80%",
              "subGroups": [
                {
                  "name": "${USER}",
                  "maxNodeLimit": "20%"
                }
              ]
            },
            {
              "name": "Team3",
              "maxNodeLimit": "30%"
            }
          ]
        },
        {
          "name": "BU3",
          "maxNodeLimit": "16%"
        }
      ],
      "selectors": [
        {
          "user": "User[1-3]",
          "group": "BU1.Team1.${USER}"
        },
        {
          "user": "User[4-5]",
          "group": "BU1.Team2.${USER}"
        },
            {
          "user": "*@team1.com",
          "group": "BU2.Team1"
        },
        {
          "user": "User[6-9]",
          "group": "BU2.Team2.${USER}"
        },
        {
          "user": "*@team3.com",
          "group": "BU2.Team3"
        },
        {
          "user": "User9",
          "group": "BU3"
        }
      ],
      "cpuQuotaPeriod": "1h"
    }
    
  • SQL

    JOIN Optimizations

    Presto on Qubole (version 0.208 and later) has the ability to do stats-based determination of the JOIN distribution type (between BROADCAST and PARTITIONED) and JOIN reordering by the following methods:

    Using the Hive Metastore

    When table statistics are present in the Hive metastore, Presto’s cost-based-optimizer tries to optimize a query plan by choosing the right type of the JOIN implementation on the basis of memory, CPU, and network cost for every JOIN node in the plan. As the schema evolves, statistics must be generated, maintained, and updated for correct estimates. To address the collection and maintenance of the statistics, Qubole provides an Automatic Stats Collection framework.

    For using the automatic JOIN type determination, you can set:

    • join-distribution-type=AUTOMATIC in config.properties at the cluster level
    • join_distribution_type=AUTOMATIC in session properties

    Note

    This feature cannot be used if the property distributed-join is already set in the session or cluster config.properties.

    For using the automatic JOIN reordering, you can set:

    • optimizer.join-reordering-strategy=AUTOMATIC in config.properties at the cluster level
    • join_reordering_strategy=AUTOMATIC in session properties
    Using Table Size

    Qubole has also introduced the notion of estimating table statistics on the basis of the table’s size on the storage layer. It is useful in cases where statistics for tables are not available. If Presto on Qubole is unable to find table statistics, it can fetch the size of the table on the storage layer and estimate the size and the number of rows in the table. This estimate can currently be used to determine the JOIN distribution type and reordering of tables in a multi-JOIN scenario.

    Note

    This feature is part of Gradual Rollout and it is only available for Hive tables.

    For enabling table size based stats to determine join distribution type, you can set:

    • At the cluster level under Advanced Configuration > PRESTO SETTINGS > Override Presto Configuration in the cluster’s UI.

      config.properties:
      join-distribution-type=AUTOMATIC
      enable-file-size-stats-join-type=true
      
    • In a query, as session properties

      join_distribution_type=AUTOMATIC
      enable_file_size_stats_join_type=true
      

    For enabling table size-based stats to determine JOIN order, you can set:

    • At the cluster level under Advanced Configuration > PRESTO SETTINGS > Override Presto Configuration in the cluster’s UI.

      config.properties:
      optimizer.join-reordering-strategy=AUTOMATIC
      join-distribution-type=AUTOMATIC
      enable-file-size-stats-join-reorder=true
      
    • In a query as session properties

      join_reordering_strategy=AUTOMATIC
      join_distribution_type=AUTOMATIC
      enable_file_size_stats_join_reorder=true
      
    Configuring Timeout for Fetching Hive Table Size

    hive.table-size-listing-timeout is the property that you can use to set the timeout for listing Hive table sizes. The determination of a table size on the storage layer is performed through a listing on the table’s storage location. The listing operation is bound by a timeout to avoid any significant delays in the query execution time. This configuration controls the timeout for the listing operation.

    An example usage of this configuration is adding hive.table-size-listing-timeout=2s to the Hive catalog properties. It would mean that the listing operation on a table’s storage location is bound to complete in 2 seconds. If it does not finish within the timeout, the table is considered to be very large.

    Metrics for JOINS in a Query

    The distribution type of JOINS in a query is visible in the Presto query info under the joinDistributionStats key name.

    Strict Mode for Presto Queries

    The Qubole Data Service (QDS) platform orchestrates thousands of clusters in the cloud for its customers on a daily basis. From cluster admins’ experience, it is learnt that even with Qubole’s accurate workload-aware autoscaling, there is always a ceiling for the cluster operation budget (so it is necessary to set the maximum number of nodes for the cluster). In general, a laxly-written SQL statement can lead to a lot of resource wastage and in case of resource contention, it can affect other workloads too.

    While handling massive data workloads, issues that arise typically include:

    • Scanning a large amount of data from the entire table
    • Having a massive CROSS JOIN between two large tables without CONDITIONS
    • Sorting millions of ROWs without LIMITS or reduced scope

    The above mentioned issues not only result in a poor user-experience but also inflate the cloud cost significantly, which can only be found in a hindsight.

    To overcome above issues, Qubole provides a feature known as Presto Strict Mode, which, once enabled, restricts users from executing certain queries.

    It supports three types of restrictions as mentioned here:

    • MANDATORY_PARTITION_CONSTRAINT: It restricts queries on a partitioned table to have a PREDICATE on at least one of the partition columns.

      Example:

      SELECT * FROM <TABLE_NAME> - This query fails with an error:

      Table scan on partitioned table: <TABLE_NAME> without filter or constraint.

      Whereas SELECT * FROM <TABLE_NAME> WHERE <predicate on partition> gets executed successfully.

      Qubole has fixed the MANDATORY_PARTITION_CONSTRAINT rule of Strict Mode in Presto 0.208 to allow queries, which use a predicate expression on any partitioned column while scanning a partitioned table.

    • DISALLOW_CROSS_JOIN: It restricts queries with a CROSS JOIN and thus such queries fail.

      Example:

      SELECT * FROM <TABLE_1> CROSS JOIN <TABLE_2> - This query fails with an error:

      Cross joins are not allowed when strict mode is enabled.

    • LIMITED_SORT: Queries are allowed to sort only a limited number of output rows.

      Example:

      SELECT * FROM <TABLE_1> ORDER BY <COL_2> - This query does fail displaying an error:

      Sorting without limit clause is not allowed when strict mode is enabled.

    Overtime, Qubole plans to extend this list of restrictions by adding more such constraints based on users’ feedback.

    Configuring Presto Strict Mode

    To enable Presto Strict Mode at the cluster level, set qubole-strict-mode-restrictions in etc/config.properties to add a semicolon separated list of restrictions.

    Example:

    config.properties:
    qubole-strict-mode-restrictions= MANDATORY_PARTITION_CONSTRAINT;LIMITED_SORT
    

    This restriction fails queries that are without partition constraints or the queries doing an unlimited SORT operation.

    Values supported for qubole-strict-mode-restrictions are:

    • NONE
    • MANDATORY_PARTITION_CONSTRAINT
    • DISALLOW_CROSS_JOIN
    • LIMITED_SORT

    You can add any combination of the above values as a semicolon-separated list. But if you set NONE as a value, then other restrictions are not applied.

    Note

    In case, if a query violates the Presto strict mode conditions and if the Presto strict mode is not enabled, then Qubole displays warnings in the specific query’s query info.

    To enable Presto Strict Mode at the account level, create a ticket with Qubole Support.

  • Migrations

    Migrating to Presto from Hive

    To migrate from Hive to Presto, you need to use the SQL syntax and semantics that Presto supports as it uses ANSI SQL syntax and semantics. ANSI SQL has many differences from HiveQL, which is a query language that Hive uses.

    During migration from Hive, you can use Qubole’s Presto UDFs open-source repository that contains Hive functions’ implementations.

    For more information, see the open-source Presto documentation for migrating from Hive.

Presto Best Practices

This section describes some best practices for Presto queries and it covers:

ORC Format

Qubole recommends that you use ORC file format; ORC outperforms text format considerably. For example, suppose you have you have a table nation in delimited form partitioned on column p. You can create the ORC version using this DDL as a Hive query.

DROP table if exists nation_orc;
CREATE table nation_orc like nation;
ALTER table nation_orc set fileformat orc;
ALTER table nation_orc set tblproperties ("orc.compress"="SNAPPY");
SET hive.exec.dynamic.partition = true;
SET hive.exec.dynamic.partition.mode = nonstrict;
INSERT INTO table nation_orc partition(p) SELECT * FROM nation;

If the nation table is not partitioned, replace the last 3 lines with the following:

INSERT INTO table nation_orc SELECT * FROM nation;

You can run queries against the newly generated table in Presto, and you should see a big difference in performance.

Sorting

ORC format supports skipping reading portions of files if the data is sorted (or mostly sorted) on the filtering columns. For example, if you expect that queries are often going to filter data on column n_name, you can include a SORT BY clause when using Hive to insert data into the ORC table; for example:

INSERT INTO table nation_orc partition(p) SELECT * FROM nation SORT BY n_name;

This helps with queries such as the following:

SELECT count(*) FROM nation_orc WHERE n_name=’AUSTRALIA’;
Specify JOIN Ordering

Presto does automatic JOIN re-ordering only when the feature is enabled. For more information, see Specifying JOIN Reordering. Otherwise, you need to make sure that smaller tables appear on the right side of the JOIN keyword. For example, if table A is larger than table B, write a JOIN query as follows:

SELECT count(*) FROM A JOIN B on (A.a=B.b)

A bad JOIN command can slow down a query as the hash table is created on the bigger table, and if that table does not fit into memory, it can cause out-of-memory (OOM) exceptions.

Specifying JOIN Reordering

Presto supports JOIN Reordering based on table statistics. It enables ability to pick optimal order for joining tables and it only works with INNER JOINS. This configuration is supported only in Presto 0.180 and later versions. Presto 0.208 has the open-source version of JOIN Reordering.

Note

As a prerequisite before using JOIN Reordering, ensure that the table statistics must be collected for all tables that are in the query.

Enable the JOIN Reordering feature in Presto 0.208 version by setting the reordering strategy and the number of reordered joins, which are described here:

  • optimizer.join-reordering-strategy: It accepts a string value and the accepted values are:

    • AUTOMATIC: It enumerates possible orders and uses statistics-based cost estimation for finding out the cost order which is the lesser compared to others. If the statistics are unavailable or if the computing cost fails, then ELIMINATE_CROSS_JOINS is used.
    • ELIMINATE_CROSS_JOINS: It is the default strategy and it reorders JOINs to remove cross JOINS where otherwise this strategy maintains the original query order. It also tries maintaining the original table order.
    • NONE: It indicates that the order the tables listed in the query is maintained.

    The equivalent session property is join_reordering_strategy.

  • optimizer.max-reordered-joins: It is the maximum number of joins that can be reordered at a time when optimizer.join-reordering-strategy is set to a cost-based value. Its default value is 9.

    Warning

    You should be cautious while increasing this property’s value as it can result in performance issues. The number of possible JOIN orders increases with the number of relations.

Enabling Dynamic Filter

Qubole supports the Dynamic Filter feature. It is a join optimization to improve performance of JOIN queries. It has been introduced to optimize Hash JOINs in Presto which can lead to significant speedup in relevant cases.

It is not enabled by default. Enable the Dynamic Filter feature as a session-level property using one of these commands based on the Presto version:

  • Set session dynamic_filtering = true in Presto 0.208 and earlier versions (earliest supported version is 0.180).
  • Set session enable_dynamic_filtering = true in Presto 317.

Enable the Dynamic Filter feature as a Presto override in the Presto cluster using one of these commands based on the Presto version:

  • Set experimental.dynamic-filtering-enabled=true in Presto 0.208 and earlier versions (earliest supported version is 0.180). It requires a cluster restart for the configuration to be effective.
  • Set experimental.enable-dynamic-filtering=true in Presto 317. It requires a cluster restart for the configuration to be effective.

Note

Qubole has introduced a feature to enable dynamic partition pruning for join queries on partitioned columns in Hive tables at account level. It is part of Gradual Rollout.

Qubole has added a configuration property, hive.max-execution-partitions-per-scan to limit the maximum number of partitions that a table scan is allowed to read during a query execution. hive.max-partitions-per-scan limits the the number of partitions per table scan during the planning stage before a query execution begins.

Qubole has extended the dynamic filter optimization to semi-join to take advantage of a selective build side in queries with the IN clause.

Example: SELECT COUNT(*) from store_sales where ss_sold_date_sk IN (SELECT s_closed_date_sk from store);

Reducing Data Scanned on the Probe Side

Dynamic filters are pushed down to ORC and Parquet readers to reduce data scanned on the probe side for partitioned as well as non-partitioned tables. Dynamic filters pushed down to ORC and Parquet readers are more effective in filtering data when it is ordered by JOIN keys.

Example: In the following query, ordering store_sales_sorted by ss_sold_date_sk during the ingestion immensely improves the effectiveness of dynamic filtering.

SELECT COUNT(*) from store_sales_sorted ss, store s where ss.ss_sold_date_sk = s.s_closed_date_sk;

Avoiding Stale Caches

It’s useful to tweak the cache parameters if you expect data to change rapidly. See catalog/hive.properties for more information.

For example, if a Hive table adds a new partition, it takes Presto 20 minutes to discover it. If you plan on changing existing files in the Cloud, you may want to make fileinfo expiration more aggressive. If you expect new files to land in a partition rapidly, you may want to reduce or disable the dirinfo cache.

Compressing Data Writes Through CTAS and INSERT Queries

Data writes can be compressed only when the target format is HiveIgnoreKeyTextOutputFormat. As INSERT OVERWRITE/INTO DIRECTORY uses HiveIgnoreKeyTextOutputFormat, the data written through it can also be compressed by setting the session-level property and codec. All SELECT queries with LIMIT > 1000 are converted into INSERT OVERWRITE/INTO DIRECTORY.

Presto returns the number of files written during a INSERT OVERWRITE DIRECTORY (IOD) query execution in QueryInfo. The Presto client in Qubole Control Plane later uses this information to wait for the returned number of files at the IOD location to be displayed. It fixes the eventual consistency issues while reading query results through the QDS UI.

The INSERT OVERWRITE DIRECTORY command accepts a custom delimiter, which must be an ASCII value. You can specify the ASCII values using double quotes, for example, "," or as a binary literal such as X'AA'.

Here is the syntax to specify a custom delimiter.

insert overwrite directory 's3://sample/defloc/presto_query_result/1/' DELIMITED BY <Custom delimiter> SELECT * FROM default.sample_table;

The <Custom delimiter> parameter does not accept multiple characters and non-ASCII characters as the parameter value.

Configuring Data Writes Compression in Presto

To compress data written from CTAS and INSERT queries to cloud directories, set hive.compression-codec in the Override Presto Configuration field under the Clusters > Advanced Configuration UI page. Set the compression codec under catalog/hive.properties as illustrated below.

catalog/hive.properties:
hive.compression-codec=GZIP

QDS supports the following compression codecs:

  • GZIP (default codec)
  • NONE (used when no compression is required)
  • SNAPPY

See Understanding the Presto Engine Configuration for more information on how to override the Presto configuration. For more information on the Hive connector, see Hive Connector.

When the codec is set, data writes from a successful execution of a CTAS/INSERT Presto query are compressed as per the compression-codec set and stored in the cloud.

To see the file content, navigate to Explore in the QDS UI and select the file under the My GCS tab.

Ignoring Corrupt Records in a Presto Query

Presto has added a new Hive connector configuration, hive.skip-corrupt-records to skip corrupt records in input formats other than orc, parquet and rcfile. It is set to false by default on a Presto cluster. Set hive.skip-corrupt-records=true for all queries on a Presto cluster to ignore corrupt records. This configuration is supported only in Presto 0.180 and later versions.

You can also set it as a session property as hive.skip_corrupt_records=true in a session when the active cluster does not have this configuration globally enabled.

Note

The behavior for the corrupted file is non-deterministic, that is Presto might read some part of the file before hitting corrupt data and in such a case, the QDS record-reader returns whatever it read until this point and skips the rest of the file.

Using the Custom Event Listener

Event listeners are invoked on each query creation, completion, and split completion. An event listener enables the development of the query performance and analysis plugins.

At a given point of time, only a single event listener can be active in a Presto cluster.

Perform these steps to install an event listener in the Presto cluster:

  1. Create an event listener. You can use this Presto event listener as a template.

  2. Build a JAR file and upload it to the cloud object store. For example, let us use s3://presto/plugins/event-listener.jar as the cloud object storage location.

  3. Download event-listener.jar on the Presto cluster using the Presto Server Bootstrap. You can add the Presto bootstrap properties as Presto overrides in the Presto cluster to download the JAR file. For downloading event-listener.jar, pass the following bootstrap properties as Presto overrides through the Override Presto Configuration UI option in the cluster’s Advanced Configuration tab.

    bootstrap.properties:
    
    mkdir /usr/lib/presto/plugin/event-listener
    cd /usr/lib/presto/plugin/event-listener
    hadoop fs -get s3://presto/plugins/event-listener.jar
    
  4. Configure Presto to use the event-listener through the Override Presto Configuration UI option in the cluster’s Advanced Configuration tab as shown below.

    event-listener.properties:
    event-listener.name=my-event-listener
    
  5. Restart the Presto cluster.

Using the Presto Query Retrying Mechanism

Qubole has added a query retry mechanism to handle query failures (if possible). It is useful in cases when Qubole adds nodes to the cluster during autoscaling or after a preemptible instance loss (that is when the cluster composition contains preemptible instances). The new query retry mechanism:

  • Retries a query which is failed with the LocalMemoryExceeded error when the new nodes are added to the cluster or in the process of being added to the cluster.
  • Retries a query which failed with an error due to the worker node loss.

In the above two scenarios, there is a waiting time period to let new nodes join the cluster before Presto retries the failed query. To avoid an endless waiting time period, Qubole has added appropriate timeouts. Qubole has also ensured that any actions performed by the failed query’s partial execution are rolled back before retrying the failed query.

Uses of the Query Retry Mechanism

The query retry mechanism is useful in these two cases:

  • When a query triggers upscaling but fails with the LocalMemoryExceeded error as it is run on a smaller-size cluster. The retry mechanism ensures that the failed query is automatically retried on that upscaled cluster.
  • When a preemptible node loss happens during the query execution. The retry mechanism ensures that the failed query is automatically retried when new nodes join the cluster (when there is a preemptible node loss, Qubole automatically adds new nodes to stabilize the cluster after it receives a preemptible termination notification. Hence, immediately after the preemptible node loss, a new node joins the cluster).
Disabling the Query Retry Mechanism

You can enable this feature at cluster and session levels by using the corresponding properties:

  • At the cluster level: Override retry.autoRetry=false in the Presto cluster overrides. On the Presto Cluster UI, you can override a cluster property under Advanced Configuration > PRESTO SETTINGS > Override Presto Configuration. This property is enabled by default.

  • At the session level: Set auto_retry=false in the specific query’s session. This property is enabled by default.

    Note

    The session property is more useful as an option to disable the retry feature at query level when autoRetry is enabled at the cluster level.

Configuring the Query Retry Mechanism

You can configure these parameters:

  • retrier.max-wait-time-local-memory-exceeded: It is the maximum time to wait for Presto to give up on retrying while waiting for new nodes to join the cluster, if the query has failed with the LocalMemoryExceeded error. Its value is configured in seconds or minutes. For example, its value can be 2s, or 2m, and so on. Its default value is 5m. If a new node does not join the cluster within this time period, Qubole returns the original query failure response.
  • retrier.max-wait-time-node-loss: It is the maximum time to wait for Presto to give up on retrying while waiting for new nodes to join the cluster if the query has failed due to the preemptible node loss. Its value is configured in seconds or minutes. For example, its value can be 2s, or 2m, and so on. Its default value is 3m. If a new node does not join the cluster within this configured time period, the failed query is retried on the smaller-sized cluster.
  • retry.nodeLostErrors: It is a comma-separated list of Presto errors (in a string form) that signify the node loss. The default value of this property is "REMOTE_HOST_GONE","TOO_MANY_REQUESTS_FAILED","PAGE_TRANSPORT_TIMEOUT".
Understanding the Query Retry Mechanism

The query retries can occur multiple times. By default, three retries can occur if all conditions are met. The conditions on which the retries happen are:

  • The error is retryable. Currently, LocalMemoryExceeded and node loss errors: REMOTE_HOST_GONE, TOO_MANY_REQUESTS_FAILED, and PAGE_TRANSPORT_TIMEOUT are considered retryable. This list of node loss errors is configurable using the retry.nodeLostErrors property.
  • INSERT OVERWRITE DIRECTORY, INSERT OVERWRITE TABLE, and CREATE TABLE AS SELECT (CTAS) queries are considered retryable. SELECT queries that do not return data before they fail are also retryable.
  • The actions of a failed query are rolled back successfully. If the rollback fails or if Qubole times out waiting, then it does not retry it.
  • A failed query has a chance to succeed if retried:
    • For the LocalMemoryExceeded error: The query has a chance to succeed if the current number of workers is greater than the number of workers handling the Aggregation stage. If Qubole times out waiting to get to this state, it does not retry.
    • For the node loss errors: The query has a chance to succeed if the current number of workers is greater than or equal to the number of workers that the query ran on earlier. If Qubole times out waiting to get to this state, it goes ahead with the retry as the query may still pass in the smaller-sized cluster.
Using the Spill to Disk Mechanism

Presto supports offloading intermediate operation results to disk for memory intensive operations. This is called Spill to Disk mechanism. It enables execution of queries which would otherwise fail due to memory requirements exceeding maximum memory per node limit (defined by query.max-memory-per-node). It is a best effort mechanism which increases the chances of success for queries with high memory requirements but it does not guarantee that all memory intensive queries succeed. For more information, see Spill to Disk.

Note

Qubole recommends using the Spill to Disk mechanism from Presto 0.208. For more information, see:

Enabling Spill to Disk Mechanism on a Presto Cluster

You can enable the Spill to Disk mechanism for a Presto cluster through the cluster configuration overrides as illustrated below.

config.properties:
experimental.spiller-spill-path=<path to the directory that will be used to write the spilled data>
experimental.spill-enabled=true
experimental.max-spill-per-node=250GB
experimental.query-max-spill-per-node=100GB
Enabling Spill to Disk Mechanism on a Session

To enable the Spill to Disk mechanism at the query level, use the session property as mentioned here.

set session spill_enabled =true

Note

Spill to Disk works only with local disks on worker nodes and so, it does not work with the cloud object storage (for example, S3).

You must set the directory to write the spilled data if you want to enable the Spill to Disk mechanism at the cluster level/session level. To set the location that is used to write the spilled data at a query level, use the cluster-level configuration property as mentioned here:

config.properties:
experimental.spiller-spill-path=<path to the directory that will be used to write the spilled data>

For more info on the cluster-level configuration, see Enabling Spill to Disk Mechanism on a Presto Cluster.

Spill Path on the Local Disk

Starting Presto version 0.208, there is a default value for a spill-path that is, the location on the disk where intermediate operation results are offloaded. The default location used for spilling with the current configuration is located on worker nodes at /media/ephemeral0/presto/spill_dir. The default directory allows you to easily enable the spill to disk configuration on a session or enable it at the cluster level for all queries by passing it as a Presto override.

Configuring the Maximum Spill Per Node

You should configure the experimental.max-spill-per-node property (size for maximum spill per node) by considering the free disk space on /media/ephemeral0.

Here is a sample command to check the disk space on /media/ephemeral0 along with its output.

[root@ip-<ip-address> ~]# df -ah
Filesystem      Size  Used Avail Use% Mounted on
proc               0     0     0    - /proc
sysfs              0     0     0    - /sys
/dev/xvda1       59G   34G   26G  58% /
devtmpfs         61G  3.9M   61G   1% /dev
devpts             0     0     0    - /dev/pts
tmpfs            61G     0   61G   0% /dev/shm
none               0     0     0    - /proc/sys/fs/binfmt_misc
/dev/xvdaa      296G   71M  293G   1% /media/ephemeral0
/dev/xvdp       197G   68M  195G   1% /media/ebs3
/dev/xvdo       197G   64M  195G   1% /media/ebs2

See also:

RubiX Administration

This section explains the topics related to the RubiX administration.

Introduction

RubiX is a light-weight data caching framework that can be used by a Big Data system that uses Hadoop filesystem interface. RubiX is designed to work with cloud storage systems such as AWS S3 and Azure Blob Storage. RubiX can be extended through plugins to support any engine that uses the Hadoop filesystem interface for access to data in any cloud object storage.

Note

Rubix is only supported on Presto clusters that use GCP.

For more information, see these blogs:

Open-Source RubiX

Qubole has open-sourced RubiX. The documentation is here:

Using RubiX in the QDS UI

For instructions on configuring and using RubiX via the QDS UI, see:

Configuring RubiX in Presto and Spark Clusters

To use RubiX in an existing Presto or Spark cluster, perform these steps:

  1. Navigate to the Clusters UI page.

  2. Click Edit against the Presto or Spark cluster on which you want to turn on RubiX.

  3. Go to the cluster’s Advanced Configuration tab.

  4. Select Enable Rubix that is above PRESTO SETTINGS on a Presto cluster. (On a Spark cluster, Enable Rubix is above SPARK SETTINGS.)

    The Enable Rubix checkbox is above PRESTO SETTINGS in a Presto cluster’s Advanced Configuration tab as illustrated below.

    _images/RubiXcheckbox.png

After you select Enable Rubix, QDS automatically configures RubiX to cache data in the cluster. To turn off RubiX on the cluster, unselect the Enable RubiX checkbox.

Note

Rubix is only supported on Presto clusters that use GCP.

Understanding the RubiX Configuration

Qubole has open-sourced RubiX. For more information on configuration, see RubiX Cache Manager.

Achieving Best Query Performance Using RubiX with Presto

Qubole has developed scheduling optimizations in Presto 0.208 and later versions to take advantage of the data cached on worker nodes with RubiX. This blog post provides explanation. It is part of Gradual Rollout.

To enable this capability, perform these steps:

  1. Qubole strongly recommends you to use SSD (Solid State Drives) or NVMe (Non-Volatile Memory Express) disks in the cluster with RubiX for optimum performance. For the list of EC2 instance types provisioned with local NVMe or SSD storage volume, see instance store.

    Using EBS (Elastic Block Storage) disks are discouraged with RubiX as they may cause performance degradation.

  2. Add node-scheduler.optimized-local-scheduling=true under the config.properties header in the Override Presto Configuration under the Advanced Configuration tab of the Presto cluster UI. Click Update after adding it as an override. This configuration is only required with Presto version 0.208 as optimized scheduler is used by default from Presto version 317. Configuring a Presto Cluster describes the Override Presto Configuration.

    For more information on cluster configuration options that are common to all cluster types, see Managing Clusters.

  3. Restart the cluster to apply the configuration.

RubiX Metrics

RubiX allows you to get the metrics through Bookkeeper Cache. In case of Presto, you can also get the metrics through JMX queries.

Note

Rubix is only supported on Presto clusters that use GCP.

Cache Metrics

For a list of metrics related to cache interactions, refer to Cache Metrics.

JMX Metrics

Only for RubiX in Presto, you can monitor the RubiX system through JMX queries.

Here is a sample Presto Rubix JMX query.

SELECT Node, CachedReads,
ROUND(ExtraReadFromRemote,2) AS ExtraReadFromRemote,
ROUND(HitRate,2) AS HitRate,
ROUND(MissRate,2) AS MissRate,
ROUND(NonLocalDataRead,2) AS NonLocalDataRead,
NonLocalReads,
ROUND(ReadFromCache,2) AS ReadFromCache,
ROUND(ReadFromRemote, 2) AS ReadFromRemote,
RemoteReads
FROM jmx.current."rubix:name=stats";
Viewing RubiX Metrics for Qubole Clusters

Note

Rubix is only supported on Presto clusters that use GCP.

You can monitor the Rubix metrics for the supported Qubole clusters from the Grafana Dashboard.

Note

RubiX dashboard is not enabled for all users by default. Create a ticket with Qubole Support to enable this feature on the QDS account.

Steps
  1. Navigate to the Clusters page.
  2. Select the required cluster.
  3. Navigate to Overview >> Resources, and select Prometheus Metrics. The Grafana dashboard opens in a separate tab.
  4. Click on Home on the top left corner, and select RubiX Dashboard.

The following figure shows a sample RubiX Dashboard with the details.

_images/rubix-dashboard.png
Spark Administration

This section explains the topics related to the Spark administration.

Configuring a Spark Cluster

By default, each account has a Spark cluster; this cluster is used automatically for Spark jobs and applications. You can add a new Spark cluster and edit the configuration of the default Spark cluster on the Clusters page. QDS clusters are configured with reasonable defaults.

Adding a New Spark Cluster
  1. Navigate to the Clusters page.

  2. Click the New button near the top left of the page.

  3. On the Create New Cluster page, choose Spark and click Next.

  4. Specify a label for a new cluster in the Cluster Labels field.

  5. Select the version from the Spark Version drop-down list.

    In the drop-down list, Spark 2.x-latest means the latest open-source maintenance version of 2.x. When a new maintenance version is released, Qubole Spark versions are automatically upgraded to that version. So if 2.2-latest currently points to 2.2.0, then when 2.2.1 is released, QDS Spark clusters running 2.2-latest will automatically start using 2.2.1 on a cluster restart. See QDS Components: Supported Versions and Cloud Platforms for more information about Spark versions in QDS.

  6. Select legacy or user as the Notebook Interpreter Mode from the drop-down list.

    For information on Notebook Interpreter Mode on a Spark cluster, see Using the User Interpreter Mode for Spark Notebooks.

  7. Select the coordinator node type and worker node type from the appropriate dropdown list.

    Note

    Qubole provides an option to disallow creation of Spark clusters with low memory instances (memory < 8 GB) for Spark clusters. This option is not available for all users by default. Create a ticket with Qubole Support to enable this option. With this option, the existing cluster that uses a low memory instance fails.

    Cluster autoscaling is enabled by default on Qubole Spark clusters. The default value of Maximum Worker Nodes is increased from 2 to 10.

  8. Enter other configuration details in the Composition and Advanced Configuration tabs and click Create.

The newly created cluster is displayed in the Clusters page.

Editing the Cluster Configuration
  1. Navigate to the Clusters page.

  2. Click the Edit button next to the cluster.

  3. Edit the required configuration.

    If you changed the Spark version, then restart the cluster for the changes to take effect.

  4. Click Update to save the configuration.

Note

After you modify any cluster configuration, you must restart the cluster for the changes to take effect.

Note

There is a known issue for Spark 2.2.0 in Qubole Spark: Avro write fails with org.apache.spark.SparkException: Task failed while writing rows. This is a known issue in the open-source code. As a workaround, append the following to your node bootstrap script:

rm -rf /usr/lib/spark/assembly/target/scala-2.11/jars/spark-avro_2.11-3.2.0.jar
/usr/lib/hadoop2/bin/hadoop fs -get s3://paid-qubole/spark/jars/spark-avro/spark-avro_2.11

See Managing Clusters for instructions on changing other cluster settings.

Viewing a Package Management Environment on the Spark Cluster UI

When you create a new Spark cluster, by default a package environment gets created and is attached to the cluster. This feature is not enabled by default. Create a ticket with Qubole Support to enable this feature on the QDS account.

You can attach a package management environment to an existing Spark cluster. For more information, see Using the Default Package Management UI.

Once an environment is attached to the cluster, you can see the ENVIRONMENT SETTINGS in the Spark cluster’s Advanced Configuration. Here is an environment attached to the Spark cluster.

_images/EnvironmentSettings.png

The default environment gets a list of pre-installed Python and R packages. To see the environment list, navigate to the Control Panel > Environments.

Configuring Heterogeneous Nodes in Spark Clusters

An Overview of Heterogeneous Nodes in Clusters explains how to configure heterogeneous nodes in Hadoop 2 and Spark clusters.

Overriding the Spark Default Configuration

Qubole provides a default configuration based on the Worker Node Type. The settings are used by Spark programs running in the cluster whether they are run from the UI, an API, or an SDK.

The figure belows shows the default configuration.

_images/spark-defaults.png

Note: Use the tooltip to get help on a field or check box.

_images/Help_Tooltip.png

To change or override the default configuration, provide the configuration values in the Override Spark Configuration Variables text box. Enter the configuration variables as follows:

In the first line, enter spark-defaults.conf:. Enter the <key> <value> pair in subsequent lines. Provide only one key-value pair per line; for example:

spark-defaults.conf:
spark.executor.cores 2
spark.executor.memory 10G

To apply the new settings, restart the cluster.

To handle different types of workloads (for example, memory-intensive versus compute-intensive) you can add clusters and configure each appropriately.

Setting Time-To-Live in the JVMs for DNS Lookups on a Running Cluster

Qubole now supports configuring Time-To-Live (TTL) JVMs for DNS Lookups in a running cluster (except Airflow and Presto). This feature is not enabled by default. Create a ticket with Qubole Support for enabling this feature on the QDS account. The recommended value of TTL is 60 and its unit is seconds.

Autoscaling in Spark

Each Spark cluster contains a configured maximum and minimum number of nodes. A cluster starts with the minimum number of nodes and can scale up to maximum. Later, it can scale back to the minimum, depending on the cluster workload. This topic explains how a Spark cluster and job applications autoscale, and discusses various settings to fine-tune autoscaling. See Autoscaling in Qubole Clusters for a broader discussion.

Advantages of Autoscaling

Autoscaling clusters provides the following benefits:

  • Adds nodes when the load is high
  • Contributes to good cost management as the cluster capacity is dynamically scaled up and down as required

Autoscaling Spark jobs provides the following benefits:

  • Decides on the optimum number of executors required for a Spark job based on the load
Understanding Spark Autoscaling Properties

The following table describes the autoscaling properties of Spark.

Note

Qubole supports open-source dynamic allocation properties in Spark 1.6.1 and later versions.

Property Name Default Value Description
spark.qubole.autoscaling.enabled true Enables autoscaling. Not applicable to Spark 1.6.1 and later versions.
spark.dynamicAllocation.enabled true Enables autoscaling. Only applicable to Spark 1.6.1 and later versions.
spark.dynamicAllocation.maxExecutors If it is not set, default is spark.executor.instances. The maximum number of executors to be used. Its Spark submit option is --max-executors.
spark.executor.instances If it is not set, default is 2. The minimum number of executors. Its Spark submit option is --num-executors.
spark.qubole.autoscaling.stagetime 2 * 60 * 1000 milliseconds If expectedRuntimeOfStage is greater than this value, increase the number of executors.
spark.qubole.autoscaling.memorythreshold 0.75 If memory used by the executors is greater than this value, increase the number of executors.
spark.qubole.autoscaling.memory.downscaleCachedExecutors true Executors with cached data are also downscaled by default. Set its value to false if you do not want downscaling in presence of cached data. It is not applicable to Spark 1.6.1 and later versions..
spark.dynamicAllocation.cachedExecutorIdleTimeout Infinity Timeout in seconds. If an executor with cached data has been idle for more than this configured timeout, it gets removed. It is applicable only to Spark 1.6.1, 1.6.2 and later versions.

Note

The spark.qubole.max.executors parameter is deprecated, however, it continues to work. If you specify both spark.qubole.max.executors and spark.dynamicAllocation.maxExecutors parameters, then spark.dynamicAllocation.maxExecutors overrides spark.qubole.max.executors.

Spark Configuration Recommendations

These are a few points to remember related to Spark cluster and job configuration in general:

  • Set --max-executors. Other parameters are ideally not required to be set as the default parameters are sufficient.
  • --num-executors or spark.executor.instances acts as a minimum number of executors with a default value of 2. The minimum number of executors does not imply that the Spark application waits for the specific minimum number of executors to launch, before it starts. The specific minimum number of executors only applies to autoscaling. For example, during the application start-up:
    1. If YARN is unable to schedule resources for --num-executors or spark.executor.instances, the Spark application starts with as many executors as it can schedule.
    2. Once --num-executors or spark.dynamicAllocation.minExecutors executors are allocated, it never goes below that number.
  • Try to avoid setting too many job-level parameters.

Note

--max-executors is the Spark submit option for spark.dynamicAllocation.maxExecutors and --num-executors is the Spark submit option for spark.executor.instances.

In Spark, autoscaling can be done at both the cluster level and the job level. See the following topics for more information:

Spark on Qubole’s capabilities include fine-grained downscaling, downscaling of cached executors after idle timeout, and support for open-source dynamic allocation configurations.

Autoscaling in Spark Clusters

A Spark cluster spins up with the configured minimum number nodes and can scale up to the maximum depending on the load. Once the load drops, the cluster scales down towards the minimum.

Qubole runs Spark on YARN: each Spark application is submitted as a YARN application. By default, Spark uses a static allocation of resources. That is, when you submit a job, exact resource requirements are specified. The application requests containers and YARN allocates the containers.

Here is an example of a Spark 2.0.0 cluster:

Property Name Property Value
minimum nodes 2
maximum nodes 10
node type (Choose a large instance type; for example 8 cores, 30G memory)
spark.dynamicAllocation.enabled true
yarn.nodemanager.resource.memory 26680 MB
spark.yarn.executor.memoryOverhead 1024 MB
spark.executor.memory 12 GB
spark.executor.cores 4

If a job with a minimum number of executors set to 4 is submitted to the cluster, YARN schedules two containers in the first worker node and the other two containers in the second worker node. The ApplicationMaster takes up an additional container.

Here is the logic to find the number of executors per node from the above example of a Spark 2.0.0 cluster.

Total memory = 30 GB
yarn.nodemanager.resource.memory = 26680 MB
If number of executor per node = 2

Total resource memory = number of executors per node * (spark.executor.memory + spark.yarn.executor.memoryOverhead)
That is 2 * (12 GB + 1 GB) = 26 GB

Which is equivalent to the value of yarn.nodemanager.resource.memory

Here is the logic to check whether the number of cores per executor is correct from the above example of a Spark 2.0.0 cluster.

Total number of cores = 8
If spark.executor.cores = 4 and number of executor per node = 2

Total number of cores = spark.executor.cores * number of executors per node

In the above table, spark.executor.cores = 4 and number of executors per node = 2
Hence, total number of cores = 4 * 2
Thus, the total number of cores = 8

Now, if you submit a new job to the same cluster in parallel, YARN does not have enough resources to run it, and this triggers Qubole’s YARN-level autoscaling: YARN figures out that two more nodes are required for the new job to run and requests the two nodes. These nodes are added to the current cluster, for a total of four nodes.

When the job completes, YARN recovers the resources. If the added nodes are idle and there is no active job, the cluster scales back to the minimum number of nodes.

Note

A node is available for downscaling under these conditions.

Autoscaling within a Spark Job

A Spark job uses a set of resources based on the number of executors. These executors are long-running Java Virtual Machines (JVMs) that are up during a Spark job’s lifetime. Statically determining the number of executors required by a Spark application may not get the best results. When you use the autoscaling feature within a Spark application, QDS monitors job progress at runtime and decides the optimum number of executors using SLA-based autoscaling.

By default, autoscaling within a Spark Job is enabled, with the following parameter set to true:

spark.qubole.autoscaling.enabled=true in Spark 1.6.0 and earlier versions

or

spark.dynamicAllocation.enabled=true in Spark 1.6.1 and later versions (including all versions supported by Azure and Oracle OCI.

Note

These settings become active only when you configure spark.dynamicAllocation.maxExecutors.

When the first Spark job is submitted, the Spark cluster starts with two nodes, the configured minimum. In the configuration described above, each node can have two executors. When the first Spark job is submitted, the cluster spins up with two large instances as worker nodes.

Depending on the job progress, or when new jobs are submitted, the Spark job-level autoscaler decides to add or release executors at runtime. The cluster starts with eight executors (running on two large instances) and can autoscale up to 20 executors (running on ten large instances). It downscales back towards the minimum eight executors if the workload declines.

Changing from Qubole Dynamic Allocation Strategy

Qubole supports open-source dynamic allocation strategy in addition to Qubole’s dynamic allocation strategy which is the default, that is spark.dynamicAllocation.strategy=org.apache.spark.dynamicallocation.QuboleAllocationStrategy.

To change the Qubole dynamic allocation strategy to open source dynamic allocation strategy, set spark.dynamicAllocation.strategy=org.apache.spark.dynamicallocation.DefaultAllocationStrategy. With this, you can as is use all open-source dynamic allocation configurations such as spark.dynamicAllocation.maxExecutors, spark.dynamicAllocation.minExecutors, and spark.dynamicAllocation.initialExecutors.

Autoscaling Examples

The following section describes different scenarios of autoscaling in Spark.

Autoscaling Nodes Running in a Single Cluster

For Spark clusters, autoscaling is enabled by default. QDS increases the number of nodes, up to the cluster’s maximum size, if multiple big jobs are submitted to the cluster.

Conversely, QDS reduces the number of nodes, down to the cluster’s minimum size, as the workload declines.

Upscaling a Single Memory Intensive Spark Job

You can set a limit on the executor memory a job can use by setting spark.executor.memory.

For example, in the cluster described above, if the executor memory is configured to be 25G and the worker nodes have 30GB of memory, only one executor can run on one node. The first Spark job starts with two executors (because the minimum number of nodes is set to two in this example).

The cluster can autoscale to a maximum of ten executors (because the maximum number of nodes is set to ten).

Running Many Jobs on a Single Cluster

You can set a limit on the maximum number of executors a job can use by setting the property spark.dynamicAllocation.maxExecutors. This configuration is usually preferred when there are many jobs in parallel and sharing the cluster resources becomes a necessity.

If the cluster resources are being fully used, new jobs either upscale the cluster if it is not yet at its maximum size, or wait until current jobs complete.

Autoscaling Executors in a Spark Job

By default, autoscaling of executors is enabled in a Spark job. The number of executors increases up to the maximum if the Spark job is long-running or memory-intensive.

Configuring Autoscaling Parameters for a Spark Job Stage Runtime

You can set a threshold for the job’s expected stage runtime by setting the property, spark.qubole.autoscaling.stagetime. Executors are added to the Spark job if the expected stage runtime is greater than the spark.qubole.autoscaling.stagetime value.

Note

The expected stage runtime is calculated only after the first task’s completion.

Adding Executors in a Single Spark Job with Memory-intensive Executors

You can set a threshold for the job’s expected stage runtime by setting the property, spark.qubole.autoscaling.memorythreshold, which is an autoscaling memory alogrithm. Executors are added to the Spark job if the executor memory exceeds spark.qubole.autoscaling.memorythreshold.

Encrypting and Authenticating Spark Data in Transit

Spark supports encryption and authentication of in-transit data. Authentication is done via shared secret; encryption uses the Simple Authentication and Security Layer (SASL). For more information, see this Spark page.

To enable encryption and authentication for a Spark cluster, proceed as follows:

  1. From the main menu navigate to the Clusters page.
  2. Choose Edit for the Spark cluster on which you want to enable encryption and authentication.
  3. In the Hadoop Cluster Settings section, add the following to the Override Hadoop Configuration Variables field:
spark.authenticate=true
  1. In the Spark Cluster Settings section, add the following to the Override Spark Configuration field:

    spark.authenticate=true
    spark.authenticate.enableSaslEncryption=true
    spark.network.sasl.serverAlwaysEncrypt=true
    
  2. If the cluster is running, restart it to apply these new settings.

All in-transit data will now be encrypted for all Spark jobs running on this cluster.

Understanding Authorization of Hive Objects in Spark

Spark on Qubole supports SQL Standard authorization of Hive objects in Spark 2.0 and later versions. With this feature, Spark honors the privileges and roles set in Hive as per Understanding Qubole Hive Authorization and offer Hive table data security through granular access to table data.

For more information on Hive authorization and privileges, see Understanding Qubole Hive Authorization. This feature is available for beta access. To enable it on a Qubole account, create a ticket with Qubole Support.

Spark on Qubole supports table-level security on all supported languages. This means that any Spark command accessing Hive objects, SQL, Scala, pyspark, or Spark R honors authorization.

For details on how to configure Hive Thrift Metastore Interface as a Spark cluster override, see Configuring Thrift Metastore Server Interface for the Custom Metastore.

Prerequisites for Enabling Authorization of Hive Objects in Spark

Authorization of Hive Objects is enabled on a QDS account with this prerequisite:

  • Per-user interpreter mode is enabled on all active Spark clusters. For more information on the user interpreter mode, see Using the User Interpreter Mode for Spark Notebooks. The legacy interpreter mode gets disabled after Hive authorization is enabled on the QDS account.
Running Hive Admin Commands Through SparkSQL

Starting with Spark 2.4, Spark on Qubole enables you to run Hive Admin commands through SparkSQL. A user with appropriate privileges can run the following commands:

  • Set role
  • Grant privilege (SELECT, INSERT, DELETE, UPDATE or ALL)
  • Revoke privilege (SELECT, INSERT, DELETE, UPDATE or ALL)
  • Grant role
  • Revoke role
  • Show Grant
  • Show current roles
  • Show roles
  • Show role grant
  • Show principals for role.

The syntax of Hive Admin commands for Spark is same as the Hive authorization commands. For more information about the syntax, see SQL Standard Based Hive Authorization

Limitations of Hive Admin Commands in Spark
  • Show Grant command: Currently, the ALL case is not supported. The supported forms of the Show Grant Command are as follows:

      SHOW GRANT USER user1 on TABLE table1;
      SHOW GRANT on TABLE table1;
    
    **Example of unsupported cases**
    
      .. sourcecode:: bash
    
         SHOW GRANT USER user1 on ALL;
         SHOW GRANT ON ALL;
    
  • Set Role command: None and setting of multiple roles at once are not supported.

    Example of unsupported cases

    SET ROLE NONE;
    SET ROLE role1, role2;
    
Known Issues in Authorization of Hive Objects in Spark

These are known issues only in Spark 2.0.0:

  • CREATE DATABASE does not pass the owner information. A temporary workaround would be to create databases using Hive.
  • In CREATE TABLE commands, permissions are not given to the owner of the table, hence any query made by the owner on the table created fails due to an authorization failure. A temporary workaround would be to create tables using Hive.
  • SHOW COLUMNS does not honor authorization and any user can perform that query on a table.

This is a known issue only in Spark 2.1.0:

  • ANALYZE TABLE does not honor authorization and any user can perform that query on a table.
Understanding the Spark Job Server

Qubole provides a Spark Job Server that enables sharing of Resilient Distributed Datasets (RDDs) in a Spark application among multiple Spark jobs. This enables use cases where you spin up a Spark application, run a job to load the RDDs, then use those RDDs for low-latency data access across multiple query jobs. For example, you can cache multiple data tables in memory, then run Spark SQL queries against those cached datasets for interactive ad-hoc analysis.

Besides this, you can also use the Job Server to reduce end-to-end latencies of small unrelated Spark jobs. In our tests, we noticed that using the Job Server brought end-to-end latencies of very small Spark jobs down from more than a minute to less than 10 seconds. The major reason for this performance improvement is that in case of the Job Server, you already have a Spark application running to which you submit the SQL query or Scala/Python snippet. On the other hand, without the Job Server, each SQL query or Scala/Python snippet submitted to Qubole’s API would start its own application. This happens because the API was designed to run standalone applications.

The following section describes how you can interact with the Spark Job Server using Qubole’s Python SDK. Spark Job Server support has been added in SDK version 1.8.0. So, you must update the SDK to that version or to a later version.

Qubole’s Spark Job Server is backed by Apache Zeppelin. The main abstraction in the Spark Job Server API is an app. It is used to store the configuration for the Spark application. In Zeppelin terms, an app is a combination of a notebook plus an interpreter.

How to use the Spark Job Server

Note

Ensure to upgrade qds-sdk-py to the latest version before creating an APP.

Create a new Spark APP and test it as shown in this example.

cd venvs/qds-sdk-py/qds-sdk-py/bin

//Upgrade QDS Python SDK
pip install --upgrade qds-sdk

//Listing clusters
qds.py --token=API-TOKEN --url=https://api.qubole.com/api --vv cluster list

//Creating an APP
qds.py --token=API-TOKEN --url=https://api.qubole.com/api --vv app create --name spark1-app --config spark.executor.memory=3g

//Listing Apps
qds.py --token=API-TOKEN --url=https://api.qubole.com/api --vv app list

//Testing SQL
qds.py --token=API-TOKEN --url=https://api.qubole.com/api --vv sparkcmd run --sql 'SELECT count(*) FROM default_qubole_memetracker' --cluster-label spark --app-id 343

Example: Running a Scala Job for Calculating the Pi Value

The following examples shows how to split a Scala job into 2 jobs (p1.scala and p2.scala). The Spark Job Server uses the result from the p1.scala job to print the Pi value as part of the second job, p2.scala.

//Run this job and the Spark job server loads RDDs that are used for low-latency data access across multiple query jobs.
p1.scala
import scala.math.random
import org.apache.spark._
val slices = 6
val n = 100000 * slices
val count = sc.parallelize(1 to n, slices).map { i =>
val x = random * 2 - 1
val y = random * 2 - 1
if (x*x + y*y < 1) 1 else 0
}.reduce(_ + _)
//Run this job and the Job server will substitute the values from the RDDs loaded after executing the ``p1.scala`` job.
p2.scala
println("Pi is roughly " + 4.0 * count / n)

Call the two jobs as shown below.

//Calling p1.scala
qds.py --token=API-TOKEN --url=https://api.qubole.com/api --vv sparkcmd run --script_location=scala-scripts/p1.scala
--cluster-label spark --app-id 346
//Calling p2.scala
qds.py --token=API-TOKEN --url=https://api.qubole.com/api --vv sparkcmd run --script_location=scala-scripts/p2.scala
--cluster-label spark --app-id 346

When an app is created, it is in the DOWN state and is not associated with any cluster. So, it can be run on any cluster. An app’s state changes to UP when you submit a command to it and specify a cluster label on which to run the command. As long as the app is in UP state, it remains associated with the cluster on which it was started. You can submit a command to an app by specifying the globally unique app ID (creating an app returns the unique app ID) and the cluster label where the app is running or yet to be run. The following command is an example.

$ qds.py sparkcmd run --script_location=some-spark-snippet.py --cluster-label spark --app-id 3

When a command is run on an app, the state of the cluster and app can vary as mentioned below:

  • The state of the cluster and app can both be DOWN. In this case, Qubole starts the cluster and later starts the app on this running cluster, and submits the snippet to the app.
  • When the cluster is running and only the app is DOWN, Qubole starts the app on this running cluster and submits the snippet to the app.
  • When the cluster and app are both UP, Qubole submits the snippet to the app.
  • When the app is UP but on a different cluster, an error message is displayed.

You can continue to submit multiple commands to the app and get results quickly. For example, the following command can also be submitted.

$ qds.py sparkcmd run --sql 'select count(*) from default_qubole_memetracker' --cluster-label spark --app-id 3

When you are done with submitting commands, you can mark the app as DOWN using the following command:

$ qds.py app stop 3

The app will get restarted when you submit another command to it.

When a cluster is terminated, all apps associated with it are automatically marked as DOWN.

You can list all the apps in an account using the following command:

$ qds.py app list

Response

[
    {
        "status": "DOWN",
        "kind": "spark",
        "note_id": null,
        "name": "app-with-beefy-executors",
        "interpreter_id": null,
        "created_at": "2015-10-30T23:42:15Z",
        "qbol_user_id": 1157,
        "cluster_label": null,
        "config": "{\"spark.executor.memory\":\"20g\",\"spark.executor.cores\":\"4\"}",
        "id": 3
    },
    {
        "status": "UP",
        "kind": "spark",
        "note_id": "2B4S9FQKS1446057961459",
        "name": "concurrent-sql",
        "interpreter_id": "2B5NE7CKT1446057961437",
        "created_at": "2015-10-31T18:45:05Z",
        "qbol_user_id": 1157,
        "cluster_label": "spark",
        "config": "{\"zeppelin.spark.concurrentSQL\":\"true\"}",
        "id": 5
    }
]

You can view a particular app using the following command.

$ qds.py app show 3

Response

{
    "status": "DOWN",
    "kind": "spark",
    "note_id": null,
    "name": "app-with-beefy-executors",
    "interpreter_id": null,
    "created_at": "2015-10-30T23:42:15Z",
    "qbol_user_id": 1157,
    "cluster_label": null,
    "config": "{\"spark.executor.memory\":\"20g\",\"spark.executor.cores\":\"4\"}",
    "id": 3
}

You can list, stop, and delete a Spark App as shown below.

//List the Spark Apps
qds.py --token=API-TOKEN --url=https://api.qubole.com/api --vv app list

//Stop a Spark App by specifying its ID
qds.py --token=API-TOKEN --url=https://api.qubole.com/api --vv app stop 343

//Delete a Spark App by specifying its ID
qds.py --token=API-TOKEN --url=https://api.qubole.com/api --vv app delete 343
Performing ELB Log Analysis

Here is an example, which shows how you can use the Spark Job Server to do ELB log analysis:

  1. Create a new app, which allows concurrent execution of SQL queries.

    $ qds.py app create --name elb-log-analysis-demo --config zeppelin.spark.concurrentSQL=true
    
  2. Submit elb-parsing-definitions.py script to this app ID.

    $ qds.py sparkcmd run --script_location=elb-parsing-definitions.py --cluster-label spark --app-id 3
    
  3. Submit the elb-log-location.py script to this app ID. This specifies the ELB log location and registers the cached data as a temporary table. You can execute this step multiple times to to cache different data locations in memory as different tables.

    $ qds.py sparkcmd run --script_location=elb-log-location.py --cluster-label spark --app-id 3
    
  4. Now that there is data in memory, it can be analyzed by running the following queries:

    $ qds.py sparkcmd run --sql 'select ssl_protocol, count(*) as cnt from logs group by ssl_protocol order by cnt desc' --cluster-label spark --app-id 3
    
    $ qds.py sparkcmd run --sql 'select * from logs where backend_status_code="500"' --cluster-label spark --app-id 3
    
    $ qds.py sparkcmd run --sql 'select * from logs order by backend_processing_time desc limit 10' --cluster-label spark --app-id 3
    

All these queries would return quickly because they use in-memory data.

But it is important to note that you can run any other unrelated query or program and even that would also return quickly because it would execute against an already running app. So, for example, you can run the following command:

$ qds.py sparkcmd run --sql 'select count(*) from default_qubole_memetracker' --cluster-label spark --app-id 3

The above query would return quickly as well.

For more information about Spark, see Spark.

For more information on using query engines, see Engines.

Identity and Access Management

Managing Your Accounts

You can have multiple accounts with one default account. Choose My Accounts in the Control Panel to add and manage accounts.

My Accounts is displayed as shown in the following figure.

_images/MyAccountsGCP.png

Switch to a different account if you prefer to use it.

You can also click the accounts drop-down list against Action to switch to a different account and clone an account.

See the following sub-topics for more information:

Configuring a Github Token

Under GitHub Token, click Configure to set a GitHub profile token in the current Qubole account. You can configure a GitHub token only in the current Qubole account or switch to the account in which you want to configure the GitHub token. See generate-git-token for more information.

After clicking Configure, a dialog to set the GitHub token is displayed. Add the token and click Save. Once a GitHub profile is configured, it is considered as a per account/per user setting. You can link notebooks to that GitHub profile after configuring the token. See GitHub Version Control for Zeppelin Notebooks for more information.

QDS encrypts GitHub tokens.

Switching Accounts

Do one of the following to switch to a different account:

  • Click on the drop-down next to the name of the current account near the top right of the each page:

    _images/SwitchAccount.png

    Choose the account you want to switch to.

  • Navigate to the My Accounts page and choose the account you want to switch to.

Using API Tokens

Only the API token of the default account can be reset. API tokens are used to authenticate with the API. An API token is for a user per account. This implies that a user, who is part of 10 accounts has 10 API tokens. A user with a single account has one API token.

An API token can be used to schedule jobs using external schedulers such as cron but it is not required when jobs are scheduled using the Qubole scheduler. The jobs are shown by the user whose API is being used. If it is required to use a single user for all scheduled jobs, create a common user to run them.

Managing Profile

To edit your profile and change the password, click My Profile in the Control Panel page. The My Profile page is displayed as shown in the following figure.

_images/MyProfile.png

Modify the name in the Full Name text field. To change the password as well, click Change Password; otherwise, click Save.

When you click Change Password, you get additional text fields to change the password, as shown in the following figure.

_images/MyProfileCPwd.png

Fill in the Current Password and New Password text fields. Make sure your password meets the requirements described below.

Confirm the new password in the Confirm Password text field. Click Save to save your name and password changes.

Handling QDS Account Password Expiration and Renewal

After 6 months, your password expires and a new password is required. QDS sends periodic alerts about password expiration, prompting you to renew the password.

If you forget to change the password before its expiration and try to log in, Qubole redirects you to the Reset Password page.

Qubole has added a password strength indicator on the Sign Up, Change Password, and Forgot Password pages. The password must contain:

  • A minimum of 8 characters
  • At least one alphabet in uppercase
  • At least one special character (all special characters are accepted)
  • At least one number

Here is an example of the password strength indicator on the Change Password page.

_images/PasswordStrength.png

Here is an example of the password strength on the User Activation page.

_images/UserActivation.png
Managing Roles

Watch video.

Manage roles from the Manage Roles page in the QDS Control Panel.

A role consists of one or more policies; a policy specifies a resource and the actions that can (or cannot) be performed on it; see Resources, Actions, and What they Mean. Assign roles to groups on the Manage Groups page. When inviting new users to join the account (via the Manage Users page), you can assign them to any existing group; the roles assigned to that group control what the user can do.

The sections that follow provide more information:

System-defined Roles

QDS creates these roles for every account:

  • system-admin - Provides full access to all QDS resources, via a group with the same name. When you sign up for a free trial of QDS, you are assigned to the System-admin group by default.
  • system-user - Allows users who belong to the group with this name to perform a more restricted set of actions, as shown on the Manage Roles page. This set of capabilities includes read access to every QDS resource, and should be sufficient for most QDS users who are not system administrators.
User-defined Roles

In addition to these system-defined roles, you can create roles to control access to specific resources. Assign a role to a group to control the actions of the users in that group. You can remove or modify any role you create.

When creating a new role, you may find it easiest to clone an existing role, then modify it by adding, removing, or changing policies. (You can’t directly modify the System-admin or System-user roles.)

Resources, Actions, and What they Mean

This section lists QDS resources (Account, Commands, etc.), the actions (create, read, update, etc.) that can be controlled on each (Access allowed or denied), and the effect of allowing or denying each action.

Points to Note
  • The default state of access for all actions on all resources is deny: if you don’t include a resource in a policy, all forms of access to that resource are denied by that policy; similarly, if you include a resource, only the actions you specifically allow are allowed by that policy for that resource.

  • If policies within a role conflict, explicitly or implicitly, the most restrictive policy takes precedence. This is useful for creating exceptions.

    For example, if a role includes a policy that specifies allow, Cluster, all, and another policy that specifies deny, Cluster, terminate, users belonging to a group that has that role will be able to perform all cluster tasks except stopping a cluster. Similarly, if you wanted to create a role that confers broad, but not unlimited capabilities, you could start out by cloning the system-admin role, and then explicitly deny access to certain resources, or certain actions on certain resources. For example, a role containing the two policies allow, All, all and deny, Account, all allows members of the group that has that role to do everything in QDS except see and manage account settings.

  • By contrast, when two or more roles that contain conflicting policies are assigned to the same group, the least restrictive role applies. For example, if Role1 denies create access to commands and Role2 allows it, and both roles are assigned to Group-DS, Role2’s policy takes precedence, allowing users in Group-DS to create and run commands.

Resources and Actions
  • All:

    • all
      • allow: allows all actions on all the resources listed below, providing complete control over the QDS resources in this account. By default, the person who originally signed up for the QDS account has this capability, by virtue of belonging to the system-admin group.
      • deny: denies all access to QDS. (But it’s better to simply disable the user from the Manage Users page.)
    • create - allows or prevents the create action on all QDS resources that support it.
    • read - allows or denies read-only access to all QDS resources.
    • update - allows or prevents the update action on all QDS resources that support it.
    • delete - allows or prevents the delete action on all QDS resources that support it.
  • Account:

    These capabilities are normally reserved for the system-admin role. The system-user role allows only read access to account settings.

    • all - allows or prevents all the actions that can be performed on the Account Settings and My Accounts pages in the QDS Control Panel.
    • read - allows or prevents read-only access to the account settings for this account.
    • update - allows or prevents all the actions that can be performed on this account on the Account Settings page.
    • auth_token - allows or prevents the visibility of API tokens and the access to reset the API tokens. API tokens provide the access to APIs and users should have the API tokens to access APIs.
  • App:

    These capabilities relate to the actions that can be performed on Spark applications by means of the Spark application server.

    • all - allows or prevents all the actions that can be performed on Spark applications.
    • create - allows or prevents creating Spark applications. The system-user role allows creating applications.
    • read - allows or denies read-only access to Spark applications.
    • start - allows or prevents starting Spark applications.
    • stop - allows or prevents stopping Spark applications.
    • delete - allows or prevents deleting Spark applications.
  • Clusters:

    Cluster configuration is normally reserved for the system-admin role. The system-user role allows viewing and starting clusters, but not creating, modifying, stopping or deleting them.

    • all - allows or prevents all the actions that can be performed on the Clusters page in the QDS Control Panel.
    • create - allows or prevents creating new clusters.
    • read - allows or prevents viewing configured clusters on the Clusters page.
    • update - allows or prevents modifying existing clusters.
    • delete - allows or prevents deleting clusters.
    • start - allows or prevents starting clusters manually. The system-user role allows this. (Many commands, such as Hadoop and Spark commands, can start a cluster automatically; that happens independently of this setting. Users who are allowed to run commands will have clusters started for them as needed.)
    • terminate - allows or prevents stopping clusters manually. (By default, idle clusters are shut down automatically.)
    • clone - allows or prevents cloning clusters.
  • Commands:

    You can allow or deny the following actions on all types of command supported by QDS (as shown on the Workbench page), or on one or more types you specify here (via the field labelled Select a command type).

    • all - allows or prevents all the actions that can be performed on the command type(s) you specify, or on all command types. For example, a policy specifying Allow, Commands, Hive Query, all allows one to create, read, update, and delete Hive queries.
    • create - allows or prevents creating commands of the specified type(s), or all types. For example, a policy specifying Allow, Commands, create (but not specifying any command type) allows one to create any type of command or query; this policy is part of the system-user role.
    • read - allows or prevents read-only access to the specified specified command type(s), or all types. For example, a policy specifying Allow, Commands, Hive Query, read allows one to see Hive queries that have been run, and their output, but not to run, modify, or delete them.
    • update - allows or prevents modifying commands of the specified type(s), or all command types. For example, a policy specifying Allow, Commands, update allows a user to kill a command or query issued by another user.
    • delete - deprecated; has no effect.
  • Data Preview:

    You can use the Data Preview resource to restrict the visibility of sensitive data on the Qubole UI.

    • read - allows or prevents read-only access to data in the Qubole user interface on the Explore, Analyze, and Notebooks screens.
  • Data Connections:

    Apart from read, Data Connection actions are normally reserved for the system-admin role.

    • all - allows or prevents all the actions that can be performed on data stores.
    • create - allows or prevents creating data stores.
    • read - allows or prevents read-only access to data stores. The system-user role allows only this type of access.
    • update - allows or prevents updating data stores.
    • delete - allows or prevents deleting data stores.
  • Data Connection Templates:

    Apart from read, Data Store actions are normally reserved for the system-admin role.

    • all - allows or prevents all the actions that can be performed on data stores.
    • create - allows or prevents creating data stores.
    • read - allows or prevents read-only access to data stores. The system-user role allows only this type of access.
    • update - allows or prevents updating data stores.
    • delete - allows or prevents deleting data stores.
  • Environments and Packages:

    Apart from read, Environments and Packages are normally reserved for the system-admin role.

    • all - allows or prevents all the actions that can be performed on Package Management environments.
    • read - allows or prevents read-only access to Package Management environments.
    • update - allows or prevents write access to Package Management environments.
    • delete - allows or prevents deleting a Package Management environment.
    • manage - allows or prevents editing Access Control Lists (ACLs) on Package Management environments. This capability allows one to extend or prevent other users’ access to the Package Management environment, and is granted to the system-admin role by default.
  • Folder:

    These actions pertain to notebook folders and Notebook Dashboards, and apart from read, are normally reserved for the system-admin role.

    • all - allows or prevents all the actions that can be performed on notebook folders.
    • read - allows or prevents read-only access to notebook folders.
    • write - allows or prevents write access to notebook folders.
    • manage - allows or prevents editing Access Control Lists (ACLs) on notebook folders. This capability allows one to extend or prevent other users’ access to the folders, and is granted to the system-admin role by default.
  • Groups:

    Apart from read, actions related to managing groups are normally reserved for the system-admin role.

    • all - allows or prevents all the actions that can be performed on groups.
    • create - allows or prevents creating a group.
    • read - allows or prevents read-only access to groups.
    • update - allows or prevents modifying a group.
    • delete - allows or prevents deleting a group.
  • Notes:

    The system-admin role is allowed all Notes actions.

    • all - allows or prevents all the actions that can be performed on Notebooks.
    • create - allows or prevents creating notes and notebooks.
    • read - allows or prevents read-only access to notebooks.
    • update - allows or prevents modifying notes and notebooks.
    • delete - allows or prevents deleting notes and notebooks.
  • Jupyter Notebook:

    The system-admin role is allowed all Jupyter Notebook actions.

    • all - allows or prevents all the actions that can be performed on Jupyter Notebooks.
    • create - allows or prevents creating Jupyter notebooks.
    • read - allows or prevents read-only access to Jupyter notebooks.
    • update - allows or prevents modifying Jupyter notebooks.
    • delete - allows or prevents deleting Jupyter notebooks.
  • Notebook Dashboards

    The system-admin and the associated notebook owners roles are allowed all actions on a dashboard.

    • all - allows or prevents all the actions that can be performed on Dashboards.
    • create - allows or prevents creating dashboards.
    • read - allows or prevents read-only access to dashboards.
    • update - allows or prevents modifying dashboards.
    • delete - allows or prevents deleting dashboards.
    • execute - allows or prevents executing dashboards.
  • Object Storage

    These capabilities relate to the actions you can perform on your QDS account’s Cloud storage, using the QDS Explore page (see the AWS or Azure instructions) or from the Analyze or Notebooks page.

    • all - allows or prevents all the actions that can be performed on the storage. By default, the system-admin role has this policy.
    • read - allows or prevents read-only access to the storage. By default, the system-user role has this policy.
    • upload - allows or prevents uploading a file.
    • download - allows or prevents downloading a file.
  • Quest

    The system-admin role is allowed all Quest actions.

    • all - allows or prevents all the actions that can be performed from the quest-index UI.
    • create - allows or prevents creating streaming pipelines.
    • read - allows or prevents read-only access to the streaming pipelines.
    • update - allows or prevents modifying streaming pipelines.
    • delete - allows or prevents deleting streaming pipelines.
  • Roles:

    Apart from read, actions related to managing roles, as described on this page, are normally reserved for the system-admin role.

    • all - allows or prevents all the actions described on this page.
    • create - allows or prevents adding policies and roles.
    • read - allows or prevents read-only access to roles.
    • update - allows or prevents modifying roles.
    • delete - allows or prevents deleting roles.
  • Scheduler:

    • all - allows or prevents all the actions that can be performed on the Qubole Scheduler.
    • create - allows or prevents creating a new job in the scheduler. The system-user role allows this.
    • read - allows or prevents read-only access to the scheduler.
    • update - allows or prevents modifying jobs in the scheduler.
    • delete - allows or prevents deleting jobs in the scheduler.
    • clone - allows or prevents cloning jobs in the scheduler. The system-user role allows this.
  • Schedule Instance:

    The system-user role allows these actions:

    • all - allows or prevents killing and re-running scheduled jobs.
    • kill - allows or prevents killing scheduled jobs.
    • rerun - allows or prevents re-running scheduled jobs.
  • Users:

    Apart from read, actions related to managing users are normally reserved for the system-admin role.

    • all - allows or prevents viewing and managing account users.
    • read - allows or prevents viewing account users.
    • manage - allows or prevents managing account users.
Adding, Removing, and Modifying Roles
Creating a Role

Click the add icon ADDIcon4 to create a new role. The Create a New Role page is displayed as shown in the following figure.

_images/CreateRole.png

Enter a name in the Role Name text field. Add a new policy to the role. In the Policies field, select either Allow or Deny access.

Select a resource from the Resource drop-down list. Select one or more actions, all/create/read/update from the Action drop-down list.

Qubole supports create, update, delete, start, terminate, and clone actions that a role can perform on a cluster as shown in the following figure.

_images/ClusterActionRoles.png

Qubole supports selecting the command type when adding/editing a policy to a role. To set a command type, select Command resource and a new text field for command type appears as shown in the following figure.

_images/RoleCommandType.png

A mouse-click in the text field displays the list of command types. You can add more than one command type in the text field. You can also select create/read/update/delete actions from the Actions text field.

Click Add Policy to add a new policy for the new role. You can add more than one policy to a role.

For example, to create a policy to allow only Hive and Presto queries, perform the following steps:

Note

Presto is not currently supported on all Cloud platforms; see QDS Components: Supported Versions and Cloud Platforms.

  1. By default Allow is selected in Access.

  2. Select Commands in Resources and select Hive Query and Presto Query from the command type text field. The following figure shows selecting more than one command type.

    _images/AddRoleCommandType.png
  3. Select actions in the Actions text field.

  4. Click Add Policy. The policy details are displayed as shown in the following figure.

    _images/RoleCommandTypeAllow.png

As an another example, let us consider denying access to only Hive and Presto command types in a policy. Perform the following steps:

  1. In Access, let Allow be selected by default.
  2. In Resources, select Commands. This would allow access to all command types.
  3. Select actions in the Actions text field.
  4. Click Add Policy.
  5. In Access, select Deny (that has Allow selected by default).
  6. In Resources, select Commands.
  7. In the command-type text field, select Hive Query and Presto Query.
  8. Select actions in the Actions text field.
  9. Click Add Policy.

This policy would deny access to only Hive and Presto command types. (You create two policies when you want to deny one or two command types from the list of supported command types).

The following figure shows an example of adding a policy to a role.

_images/AddPolicy.png

Click Create Role after adding the policy and name. Click Cancel to not create the role.

Cloning a Role

In the Action column of a role, click the down arrow and select Clone to clone that role. The Create a New Role page is displayed with the role name prefixed with clone - and the policies of the parent role, as shown in the following figure.

_images/CloneRole.png
Modifying a Role

In the Action column, click the downward arrow and select Modify to edit a role.

_images/ModifyRoles.png

Do the changes and click Update. Click Cancel to retain the previous setting.

Managing Groups

Watch video.

QDS creates the following groups by default for every account: system-admin, system-user, dashboard-user, and everyone. These groups have roles associated with them, which provide access control to all the users of that group.

The following table lists the groups, their associated roles, and the operations that can be performed on these groups.

Group Role Associated with the Group Users User Actions Actions on the Group
system-admin system-admin Any user who creates the account is added to the system-admin group by default.

This group user can perform the following actions:

  • Manage users: add or delete users
  • Manage roles: add or delete roles
  • Create new group
  • Clone a group
  • This group cannot be deleted or renamed.
  • This group can be cloned.
system-user system-user Any user who is invited to join the account through QDS UI is automatically added to the system-user group if the group is not specified when inviting the user. This group user has a restricted set of actions based on the assigned roles.
  • This group cannot be deleted or renamed.
  • This group can be cloned.
everyone none by default. All the users of an account are added to the everyone group by default. This group user has a restricted set of actions based on the assigned roles. User with the appropriate privileges can add, modify, or delete any role associated with this group.
  • This group cannot be deleted or renamed.
  • This group can be cloned.
  • Users of this group cannot be modified.
dashboard-user dashboard-user Any user who is invited to join the account with this group through QDS UI. This group user has a restricted set of actions based on the assigned roles.
  • This group cannot be deleted or renamed.
  • This group can be cloned.

In the Control Panel page, click Manage Groups to create and modify user groups. The Manage Groups page is displayed as shown in the following figure.

_images/ManageGroups.png

Click the downward arrow in the Action column to modify or see users in a group. Click Modify and the Manage Group Members and Group Roles page is displayed as shown in the following figure.

_images/ModifyGroups.png

Add/remove users in the Select Group Members field and add/remove roles in the Select Group Roles field. Click Update after making the changes.

Add a group by clicking the add icon ADDIcon3. The Create a New Group page is displayed as shown in the following figure.

_images/CreateGroup.png

Enter a name in the Group Name text field. Add users from the Select Group Members and roles in the Select Group Roles fields. Click Create Group. Click Cancel to not save the group.

Cloning a Group

In the Action column of a specific group, click the downward arrow and select Clone for cloning that group. The Create a New Group page is displayed with the group name prefixed with clone - and the group members and group roles of the parent group.

Managing Preferences

To edit your account preferences, click My Preferences on the Control Panel page. The My Preferences page appears:

_images/MyPreferencesgcp.png

The Keyboard shortcuts option in the General section allows you to enable or disable keyboard shortcuts.

The Composer Auto-suggestions option in the Analyze Interface section allows you to enable or disable Composer auto-suggestions.

Managing Users

Watch video.

Use this tab to add and manage users in an account.

You can add users using any of the following three methods:

Understanding How Qubole Adds a New User

Qubole adds you into a QDS account by either of the following ways:

  • When you sign up as a new user:
    1. Qubole sends you an email with a confirmation link.
    2. After you click the confirmation link, Qubole activates you as a new user and activates your account as well.
  • When you invite a user to a QDS account:
    1. Qubole sends the owner of that QDS account an email to approve the new user.
    2. The owner of that QDS account or system-admin approves the new user in the Manage Users > Pending Users tab.
Managing Users through the Control Panel

To manage users, choose Manage Users in the Control Panel:

_images/ManageUsers.png
  • Click Action to modify or disable a user’s access.

  • Click the user icon InviteUser to invite a new user.

    The following page appears:

    _images/InviteUser.png

    Enter the user’s email address and select a group from the available Groups. Click Send to send the invitation, or Cancel to change or discard it.

Creating Service Users for APIs

Use Qubole’s new Service user type to create users for API use only. This user is automatically validated and immediately added to the account. Any user having user management permissions can create a Service user and update the authentication token. However, note that these users cannot login from the UI and can be used only through APIs. Additionally, they cannot be added as regular users in any other account after being created as a Service user.

To enable this feature for your account, create a ticket with Qubole Support .

For more information, see the Account Settings.

How QDS Improves Computing Efficiency

Qubole provides advanced capabilities that automatically optimize computing efficiency and help save you money. These include automating cluster sizing and lifecycle management so as to match computing resources with actual usage. Qubole also provides automatic and intelligent purchasing options across different tiers of computing resources.

Cluster Lifecycle Management

QDS automatically shuts down a cluster when there are no more active jobs running on it. You can always shut down the cluster manually, but in practice Qubole has found that the vast majority of clusters are auto-terminated. Automated cluster lifecycle management protects you against accidentally leaving an idle cluster running, saving unnecessary compute charges.

For more information about lifecycle management, see Cluster Termination under Introduction to Qubole Clusters.

Autoscaling

QDS dynamically adds or removes cluster nodes to match the cluster workload; this is called autoscaling. Autoscaling is particularly effective for workloads that are less predictable and which involve many users running jobs on the cluster concurrently. For such workloads, Qubole has seen autoscaling provide substantial savings compared to a static-sized cluster. (see Case Study TubeMogul).

Autoscaling is available for Hadoop (Hive), Presto, and Spark clusters.

Using Active Directory and SAML Single SignOn on Google Cloud Platform

If your organization already uses Active Directory or SAML Single SignOn to manage user authorization, you can continue using your existing identity management system with Qubole on GCP.

Qubole supports Active Directory and SAML single sign-on authorization through Google’s Cloud Identity platform. For details on configuring SAML for GCP environments, see Using your existing identity management system with Google Cloud Platform in the Google Cloud Platform documentation.

Managing Access Permissions and Roles

QDS allows system administrators to create custom roles to control access to specific resources. A role has a list of policies that determine which QDS capabilities a user has access to. See Manage Roles for more information.

To add, modify, and delete roles, choose Manage Roles from the Control Panel.

Create a Role

Creating a Role explains the steps to add a new role.

Adding a Policy to a Role

Perform the following steps to set a policy with access and resource permissions for a role. You must have system administrator (system-admin) privileges.

  1. Navigate to Manage Roles on the Control Panel page. Click the + button to create a new role. Select the Modify action from the drop-down list in the Action column of an existing role that you want to edit.

    Creating a Role explains how to add a new role. Modifying a Role explains how to modify the access permissions of a role. You can modify the policies of an existing role or set a new policy for a new role.

  2. Set the following policy elements:

    1. Access: Denotes the type of access. Select Allow or Deny.

    2. Resources: You can select All from the drop-down list, or any specific resource such as Clusters or App. By default, the drop-down list shows the Account resource. The following figure shows the available resources for which you can add access policies for a role.

      _images/Role-Resources.png
    3. Depending on the resources, you can set policy actions in the Action text field. Click in the text field to see available actions for that resource, such as create and update. Select the action that you want to set for the role.

    4. Click Add Policy to add the policy for the role.

Deleting a Role

You can delete only custom roles; system-defined roles cannot be deleted. See role-types for more information.

To remove a custom role, click the down-arrow in the Action column on the Manage Roles page, and select Delete.

Enabling Data Encryption in QDS

QDS supports the encryption of data in Google Cloud Storage using Cloud Key Management Service (KMS) encryption keys. For information and instructions on encrypting data at rest in cloud storage, see Using customer-managed encryption keys in the GCP documentation. Note that the Compute Service Account for your GCP project must have access to the KMS encryption key used to encrypt your data.

Managing Account Features

The Account Features tab provides a self-service interface for you to enable or disable certain Qubole features for your account before Qubole provides them through Gradual rollout.

Navigate to Control Panel > Account Features tab. On the Account Features page, you can see the list of features, feature description, last status, and roll-out options.

Note

To enable or disable account features, you must have the account update permission. For more information on managing roles and permissions, see Managing Roles.

The following section explains the different options available on the Account Features UI.

Feature Name

The Feature Name column lists the names and short descriptions of the features. To know more about any feature, click the respective feature name.

Last Status

The Last Status column displays the date and time of the previous change and the email address of the user who made the previous change. If Qubole does, it displays Qubole instead of the email address of the last user.

Rollout

This is a toggle button to enable or disable the respective feature for your account.

_images/feature_optin.png
Labels
Beta

This label indicates that the respective feature is in Beta and appropriate for limited production use on the accounts before it becomes generally available.

Note

The Gradual rollout features are excluded from the Account Features page. After a feature is enabled by Qubole for all the accounts, you can no longer enable or disable that feature on the Account Features page.

Cluster

This label indicates that the respective feature may require cluster restart for any change to take effect.

QDS Components: Supported Versions and Cloud Platforms

Supported Versions

The following table shows the currently supported open source versions of QDS components and the Cloud platforms on which they run.

Important

Any other software available on the cluster is subject to change or removal. In general, if you want to install any required software version, then Qubole strongly recommends to install it on a cluster through the node bootstrap.

QDS Component Currently Supported Versions Supported on (Cloud Platforms)
Airflow 1.10.0, 1.10.2QDS, 1.10.9QDS (beta) AWS, Azure, Oracle OCI, GCP
Cascading All AWS, Azure, Oracle OCI
Hadoop 2 2.6.0 AWS, Azure, Oracle OCI, GCP
Hive 1.2 (deprecated), 2.1.1, 2.3, 3.1.1 (beta) AWS
1.2 (deprecated), 2.1.1, 2.3 Azure, Oracle OCI
2.1.1, 2.3 GCP
Java
  • Hadoop 2 and Spark support Java 1.7 as the default version, but 1.8 can be enabled through the node bootstrap.
AWS, Azure, Oracle OCI
  • Presto supports only 1.8
AWS, Azure
  • GCP supports only 1.8
GCP
MapReduce 0.20.1 AWS
2.6.0 AWS, Azure, Oracle OCI, GCP
Pig 0.11, 0.15, 0.17 (beta) AWS
0.11 Azure, Oracle OCI
Presto 0.193 (deprecated), 0.208, and 317 AWS
0.193 (deprecated), 0.208, and 317 Azure
0.208, 317 GCP
Python

2.6, 2.7, and 3.5

Airflow supports only 2.7 and later

AWS

See Can I use Python 2.7 for Hadoop tasks? for more information

2.7 and 3.5 Azure, Oracle OCI
3.6.x and 3.7 GCP
R 3.3.3 AWS, Oracle OCI
3.3.2 Azure
3.5.2 GCP
RubiX 0.2.11 AWS, Azure
Scala 2.10 for Spark versions older than 2.0.0 AWS
2.11 for Spark 2.0.0 and later AWS, Azure, Oracle OCI, GCP
Spark (See also Spark Version Support) 1.6.2, 2.0.2, 2.1.1, 2.2.0, 2.2.1, 2.3.1, 2.4.0, 2.4.3 AWS
2.0 (deprecated), 2.1.0 (deprecated), 2.1 (deprecated), 2.2.1, 2.3.2, 2.4.3 Azure, Oracle OCI
2.3.2, 2.4, 2.4.3 GCP
Sqoop 1.4.7 AWS
1.4.6 AWS, Azure, Oracle OCI
Tez 0.7, 0.8.4, 0.9.1 AWS
0.7, 0.8.4 Azure, Oracle OCI
0.8.4 GCP
Zeppelin (notebooks) 0.6.2, 0.8.0 AWS, Azure, Oracle OCI, GCP
JupyterLab (notebooks) 1.2.3 (beta) AWS, Azure, GCP
Query Engine Version Lifecycle Phases

A query engine’s version lifecycle phases are briefly explained in this following table.

  Beta Supported Deprecated Expired
Production use No Production SLAs Available Available but not recommended Not available; should upgrade
Incident Support Available Available Available Not available; should upgrade
Security Updates Available Available Not available; should upgrade Not available; should upgrade
Bug fixes Available Available Not available; should upgrade Not available; should upgrade
Feature requests Will be considered Will be considered Not available; should upgrade Not available; should upgrade
Visibility Visible in UI Visible in UI Visible in UI Not listed in UI; cannot start new clusters with expired versions of software
Upgrade Customer initiated Customer initiated Customer initiated Qubole will initiate automatic upgrade upon restart

For more information on the deprecated and end-of-life timelines of engine versions, see Deprecated Versions.

Deprecated Versions

The following table shows deprecated versions and corresponding timelines.

Engines on QDS Engine Versions Deprecation Timeline Expiry Timeline
Airflow 1.7 May 2019 October 2019
1.8.2 October 2019 ETA: R60
Hadoop 1 October 2018 April 2019
2.8 May 2019 November 2019
Hive 0.13 Not supported for new customers. For legacy customers, support ended in February 2019. February 2019
1.2 May 2020 ETA: R60
Presto 0.180 October 2019 May 2020
0.193 February 2020 ETA: R60
Spark 1.3.1, 1.4.0, 1.4.1, 1.5.0 2017 ETA: R60
1.5.1, 1.6.0, 1.6.1, 2.0.0, 2.1.0 November 2018 ETA: R60
Spark Version Support

QDS supports two major versions of Spark, 1.x and 2.x. As of April 2017, Qubole began phasing out support for older versions. Those versions are marked deprecated in the drop-down list of available versions on the Clusters page of the QDS UI. You can still launch clusters running a version marked deprecated, but:

  • No new features or bug fixes will be applied to this version.
  • The version will no longer be eligible for Qubole Customer Support; tickets will not be addressed.

Note

In the Spark Version drop-down list on the Clusters page of the QDS UI, latest means the latest open-source maintenance version. If you choose latest, Qubole Spark is automatically upgraded when a new maintenance version is released. For example, if 2.x latest currently points to Spark 2.x.y, then when 2.x.(y+1) is released, QDS clusters running 2.x latest automatically start using 2.x.(y+1) when the cluster is restarted.

Engine Versions Deprecation and Expiration FactSheet

Hive Version 1.2 Deprecation

Hive 1.2 has been live on QDS since 2015. The lifecycle of a supported version on a data platform usually lasts 24 months. With this perspective, Qubole has deprecated Hive 1.2 from May 2020.

Open source Hive has stopped support of Hive 1.2 in these aspects:

  • No new features are ported into Hive version 1.2
  • No security patches are available

Qubole has added support for Hive versions 2.1.1 and 2.3.

With the end of support of Hive 1.2 (deprecated), Qubole urges its users to move to higher Hive versions (preferably Hive 2.3) as appropriate.

New Features supported in Hive Version 2.3

Features that are supported in Hive version 2.3 are:

  • Vectorized Query Execution
  • Predicate Pushdown for Parquet
  • Dynamic Runtime Filtering
  • Better scalability with Application TimelineServer 1.5
  • Multi-Instance HiveServer2
  • CBO Improvements
  • Support for Apache Ranger & AWS Glue
  • Native JDBC Storage Handler
  • Support for GROUPING function
  • HiveServer2 UI
  • Significantly better performance with fewer resources

For migrating from Hive version 1.2 to Hive version 2.3, perform the following steps:

  1. Clone Hadoop (Hive) clusters configured with Hive version 1.2.
  2. Change the Hive version to 2.3 in such clusters.
  3. Test accounts with the new Hive version.

To know more, see Understanding Hive Versions.

FAQs around Hive 1.2 Deprecation
  1. Would Hive 1.2 queries run in later Hive versions supported by Qubole?

    Qubole-introduced features will continue to work as 1.2, however, OSS features will behave as Apache Hive 2.3. You must own the responsibility of testing of queries compatibility from Hive 1.2 (deprecated) to higher versions.

  2. Does Hive metastore DB Schema require an upgrade?

    Qubole will take care of upgrading the schema of Hive metastore DBs managed by Qubole and recommends customers to upgrade the custom Metastore DB.

  3. Should I upgrade the custom thrift server (if in use) ?

    Yes, as Hive versions 2.1.1/2.3 client does not support Hive 1.2 HMS. You should upgrade the custom thrift server.

  4. Does it impact a Spark/Presto user, who is using custom Hive 1.2 Metastore Schema or Thrift Server?

    No, it does not impact a Spark/Presto user, who is using custom Hive 1.2 Metastore schema or thrift server.

  5. What is exactly impacted by Hive 1.2 deprecation?

    Hive 1.2 deprecation only impacts Hadoop (Hive) clusters. Qubole does not intend to deprecate use of Hive 1.2 schema in custom Hive metastores.

  6. Currently I am using Hive on MapReduce, should I move to Tez?

    Though Hive on MapReduce is supported in Qubole Hive 2.1.1 / 2.3, it is deprecated in Open Source Hive. Qubole strongly recommends moving to Tez.

QDS Administration How-to Topics

The following topics describe how to perform QDS administration tasks:

Enable Recommissioning

Recommissioning allows clusters to reactivate nodes that are in the process of being decommissioned if the workload requires it.

Recommissioning on clusters is not enabled by default. It can be enabled as an Hadoop override setting:

mapred.hustler.recommissioning.enabled=true.

To disable the feature, set mapred.hustler.recommissioning.enabled=false.

Set Object-level Permissions in Qubole

Qubole allows you to set cluster-level permissions through the UI as described in Managing Cluster Permissions through the UI.

Qubole allows you to set object-level permissions for a notebook through the Notebooks UI as described in Managing Notebook Permissions and Managing Folder-level Permissions.

At the QDS account level, Qubole allows you to control access to objects/resources by creating roles that are set in Control Panel > Manage Roles. For more information, see Resources, Actions, and What they Mean and Managing Roles.

Run Utility Commands in a Cluster

You can run the nodeinfo command on a cluster node to get information about it.

Syntax:
nodeinfo [is_master | master_ip | master_public_dns]

nodeinfo is_master reports if the node is a Coordinator or Worker in boolean format (1/0).

nodeinfo master_ip reports the public IP address of the coordinator node.

nodeinfo master_public_dns reports the public DNS address of the coordinator node.

Add $(nodeinfo master_public_dns) to a node bootstrap script or any shell script to get the public DNS hostname of the Coordinator node.

See Understanding a Node Bootstrap Script for more information.

You can also use the QDS GUI to see the public and private IP addresses of worker and Coordinator nodes: navigate to the Control Panel > Clusters and click on the number in the Nodes column of a running cluster.

On a cluster node, run the nodeinfo command to check if the node is a worker or Coordinator or get the Coordinator node’s IP address and public DNS address. Here is an example illustration of running the nodeinfo command on an AWS cluster node:

[ec2-user@ip-10-111-11-11 /]$ nodeinfo is_master
1
[ec2-user@ip-10-111-11-11 /]$ nodeinfo master_ip
10.111.11.11
[ec2-user@ip-10-111-11-11 /]$ nodeinfo master_public_dns
ec2-54-54-544-544.compute-1.amazonaws.com
[ec2-user@ip-10-111-11-11 /]$

You can also see How do I check if a node is a coordinator node or a worker node?.

Shade Java Classes Using Maven

While working on projects with multiple dependencies, sometimes, you may experience conflicting dependencies. For example, consider the following scenario on two projects, where:

  • Project A depends on Project B and C.2.0.jar
  • Project B depends on C.3.0.jar

Usually, Project A downloads C.2.0.jar and Project B downloads C.3.0.jar. The two projects add the jar in the corresponding classpath. The problem may occur if C.2.0.jar and C.3.0.jar are incompatible. Project A that has dependency on Project B, requires both C.2.0 and C.3.0 jars in its class path. Due to incompatibility between the two jars, errors occur.

The error occurred only due to the Project A’s dependency on Project B, which requires C.3.0.jar. In such a case, Maven allows a project to shade certain packages/classes, that is, it changes the classpath of such packages and bundles it with the project.

Let us understand Maven shading with a real-time example.

An Hadoop-2 project depends on guava-11.0 and wants to integrate with Qubole’s autoscaling module (ASCM). ASCM depends on guava-16.0 that is incompatible with guava-11.0. Hence, running Hadoop-2 with ASCM results in conflicting jars that causes errors.

So, ASCM modifies its build process to shade jars represented by guava-16.0 and adds it into its own package (ascm-1.0.jar) with a modified namespace.

The modified POM file would be as shown in the following XML file.

<build>
  <plugins>
    <plugin>
      ...
    </plugin>
    <plugin>
      ...
    </plugin>

    <plugin>
      <groupId>org.apache.maven.plugins</groupId>
      <artifactId>maven-shade-plugin</artifactId>
      <version>2.3</version>
      <executions>
        <execution>
          <phase>package</phase>
          <goals>
            <goal>shade</goal>
          </goals>
          <configuration>
            <artifactSet>
              <excludes>
                <exclude>classworlds:classworlds</exclude>
                <exclude>javax.servlet:*</exclude>
                <exclude>commons-logging:commons-logging</exclude>
                <exclude>com.fasterxml.jackson.core:*</exclude>
                <exclude>logkit:*</exclude>
                <exclude>avalon-framework:*</exclude>
                <exclude>org.apache.maven:lib:tests</exclude>
                <exclude>log4j:log4j</exclude>
              </excludes>
            </artifactSet>
            <relocations>
              <relocation>
                <pattern>com.google</pattern>
                <shadedPattern>com.qubole.shaded.google</shadedPattern>
              </relocation>
            </relocations>
          </configuration>
        </execution>
      </executions>
    </plugin>

    <plugin>
      ...
    </plugin>
  </plugins>
</build>
Set up a New Account with QDS

The GCP Quick Start Guide describes how to sign up on QDS. Once you sign up, you can access the QDS user interface (UI) to change your account settings and select the authentication type.

FAQs

The topics that follow provide answers to commonly asked questions:

New User FAQs

Which processing engines does Qubole support?

Qubole supports best-of-breed data processing engines and frameworks for end-to-end data processing. With Qubole’s platform-based approach, you can easily add new open source big data engines and frameworks to ensure platform longevity. By default, Qubole supports the following engines:

_images/engines.png

Click here to learn more.

How to identify the right engine for SQL-based queries?

All SQL big data engines deal with large data sets and have a distributed computing architecture that provides scalability. These engines offer different degrees of support for speedy processing, interactivity, fault tolerance, and type of workload. The most common types of workloads are batch processing, streaming, and interactive queries.

Engines can be differentiated based on the type of workload, SLA Adherence (fault tolerance), and processing speed. Use the decision tree below to arrive at a SQL engine recommendation.

_images/workflow.png
Should I use Presto or Hive?

While Presto may be the better choice for most scenarios, one should not discount Hive as there is always a use case too demanding for Presto.

As Presto has a limitation on the maximum amount of memory each task can store, it fails if the query requires a significant amount of memory. While this error handling logic (or a lack thereof) is acceptable for interactive queries, it is not suitable for daily/weekly reports that must run reliably. Hive may be a better alternative for such tasks.

Hive Presto
Optimized for batch processing of large ETL jobs and batch SQL queries on huge data sets. Used for running interactive analytic queries against data sources of all sizes ranging from gigabytes to petabytes.
Mature SQL – ANSI SQL. Less mature SQL (still ANSI compliant).
Easily extensible. Some extensibility, but limited compared to Hive.
Optimized for query throughput. Optimized for latency.
Needs more resources per query. Resource-efficient.
Suitable for large fact-to-fact joins. Optimized for star schema joins (1 large fact table and many smaller dimension tables).
Suitable for large data aggregations. Interactive queries and quick data exploration.
Rich ecosystem (plenty of resources online) Less rich ecosystem (but now improving with big users such as Facebook, Netflix).
Should I use the Notebooks or the Workbench page?

Here are some tips to help you to choose between Notebooks and Workbench:

Jupyter Notebooks Workbench
Use Jupyter Notebooks to perform interactive data analysis and seamlessly build AI/ML models using Spark or custom Python packages. Generally used by data scientists. Use Workbench to perform ad hoc analysis of data in a data lake or other connected data source. Generally used by data analysts and data engineers.
Supports only the Spark engine. Supports multiple engines such as Hive, Spark, Presto, etc.
Provides visualization capabilities. No inbuilt visualization capabilities. You must download result sets in the data visualization tool of your choice.
Allows offline edits (without attaching a cluster). Not applicable.
Does Qubole provide any clusters by default?

By default, every Qubole account provides 4 clusters that support Airflow, Spark, Hadoop, and Presto. You will find a default label attached to one of the clusters. Unless you specify a cluster, the REST API queries are directed to the default cluster.

To change the default cluster, drag the default label and drop it in front of the new default cluster (the cluster that you would like to make the default cluster).

For more information, see Clusters.

How can I get access to additional resources?

Qubole implements a role-based access control (RBAC) system to secure access to your data and environment. Every user is assigned to a group(s) that subsequently assigns them a role(s). Every role specifies a set of policies that define access to features and resources.

_images/newuser-roles.png

Your Qubole platform administrator manages RBAC for your environment. Contact them for any additional permissions you need.

How can I get access to additional features?

Qubole lets you enable and disable features at the account level through a self-service interface. Navigate to the Control Panel > Account Features page to view the list of features, feature descriptions, last status, and roll-out options. For more information, see Managing Account Features.

When should I use Tags?

Add tags to a command or notebook to make it easily identifiable and searchable from the commands list. You can also add a tag as a filter value while searching commands.

_images/tag.png
When should I use Macros?

Use macros to execute commands with a dynamic variable such as the date, time, and so on. To do this, define the variable name and value that you wish to substitute in the command.

Macros are defined using JavaScript.

Note

Only assignment statements are valid. Loops, function definitions, and all other JavaScript constructs are not supported. Assignment statements can use all operators and functions defined for the objects used in the statements.

For more information, see Macros.

When should I use the Query Path?

Use the Query Path to run a query stored in your cloud storage. Qubole reads the query from the path and displays the sample result.

General Questions

The topics that follow provide answers to questions commonly asked about Qubole:

What is the pricing model for Qubole?

See the Qubole Billing Guide detailed pricing information. See also Qubole Data Platform Pricing.

We also offer custom plans; for these please contact us.

How does Qubole access data in my Cloud object store?

QDS accesses data in your Cloud storage account using credentials you configure when setting up your QDS account. In addition, QDS accesses data in the following ways:

  • For Hive queries and Hadoop jobs, QDS runs a Hadoop cluster on instances that Qubole rents for you. The Hadoop cluster reads and processes the data, and writes the results back to your storage buckets.
  • When you browse or download results from Qubole’s website (UI or the API), Qubole servers read the results from your object store and provide them to you.
How do I control access to a specific object in QDS?

QDS allows you to control access to a particular cluster or notebook through REST APIs. You can also use the Notebooks page in the QDS UI to control access to a notebook.

For more information on managing access to a cluster, see APIs for Qubole on Google Cloud Platform.

For more information on managing access to a notebook, see:

How do I install custom Python libraries through the node bootstrap?

Qubole recommends installing and updating custom Python libraries after activating Qubole’s virtual environment and installing libraries in it. Qubole’s virtual environment is recommended as it contains many popular Python libraries and has advantages as described in Using Pre-installed Python Libraries from the Qubole VirtualEnv.

Installing and updating Python libraries in Qubole’s virtual environment can be done by adding the below script in the node bootstrap.

source /usr/lib/hustler/bin/qubole-bash-lib.sh
qubole-use-python2.7
pip install <library name>

For more information on the node bootstrap, see Understanding a Node Bootstrap Script and Running Node Bootstrap and Ad hoc Scripts on a Cluster.

Using Pre-installed Python Libraries from the Qubole VirtualEnv

Qubole also recommends looking at the list of Python libraries installed in Qubole virtualenv and figure out if the virtualenv can be used. The advantage is that you do not pay the cost of installing a library if it is already installed. There is a reduction in the time taken for the first query to run if the majority of packages are already installed in the Qubole virtualenv. Here are the pre-installed libraries in the virtualenv.

airflow (1.7.0)
alembic (0.8.6)
amqp (1.4.9)
anyjson (0.3.3)
appdirs (1.4.3)
argparse (1.2.1)
awscli (1.11.70)
Babel (1.3)
backports.ssl-match-hostname (3.5.0.1)
beautifulsoup4 (4.5.1)
billiard (3.3.0.23)
boto (2.40.0)
boto3 (1.3.1)
botocore (1.4.93)
bs4 (0.0.1)
celery (3.1.23)
certifi (2016.2.28)
cffi (1.4.2)
chartkick (0.4.2)
Cheetah (2.4.1)
click (6.6)
colorama (0.3.7)
configobj (4.6.0)
croniter (0.3.12)
cryptography (1.7.1)
Cython (0.20.1)
datadog (0.12.0)
decorator (3.3.2)
dill (0.2.5)
Django (1.6.4)
django-extensions (0.9)
docutils (0.13.1)
enum34 (1.1.6)
Flask (0.10.1)
Flask-Admin (1.4.0)
Flask-Cache (0.13.1)
Flask-Login (0.2.11)
Flask-WTF (0.12)
flower (0.9.1)
future (0.15.2)
futures (3.0.5)
gunicorn (19.3.0)
idna (2.2)
inflection (0.3.1)
iniparse (0.3.1)
ipaddress (1.0.17)
itsdangerous (0.24)
Jinja2 (2.8)
jmespath (0.9.0)
kombu (3.0.35)
lxml (2.3)
Mako (1.0.4)
Markdown (2.6.6)
MarkupSafe (0.23)
mpi4py (1.3.1)
mrjob (0.3.5)
MySQL-python (1.2.5)
ndg-httpsclient (0.4.1)
networkx (1.8.1)
nltk (2.0.4)
numexpr (2.6.1)
numpy (1.11.1rc1)
ordereddict (1.1)
packaging (16.8)
pandas (0.18.1)
paramiko (1.7.7.1)
PIL (1.1.7)
pip (1.4.1)
psutil (4.3.1)
publicsuffix (1.0.4)
pyasn1 (0.1.2)
pycparser (2.14)
pycrypto (2.5)
pycurl (7.19.0)
pydot (1.0.2)
Pygments (2.1.3)
pygpgme (0.1)
pyOpenSSL (16.2.0)
pyparsing (2.2.0)
python-dateutil (2.5.3)
python-editor (1.0.1)
python-gflags (2.0)
python-magic (0.4.11)
pytz (2016.4)
PyYAML (3.10)
qds-sdk (1.9.6)
rdbtools (0.1.5, /usr/lib/virtualenv/python27/src/rdbtools)
recordclass (0.4.1)
redis (2.7.6)
requests (2.10.0)
rsa (3.4.2)
s3cmd (1.5.2)
s3transfer (0.1.10)
scikit-image (0.9.3)
scipy (0.13.3)
setproctitle (1.1.10)
setuptools (23.0.0)
simplejson (2.3.3)
six (1.10.0)
SocksiPy-branch (1.1)
spotman-client (0.2.0)
SQLAlchemy (1.1.0b1)
thrift (0.9.3)
tornado (4.2)
ujson (1.33)
urlgrabber (3.9.1)
urllib3 (1.16)
Werkzeug (0.11.10)
wheel (0.24.0)
workerpool (0.9.2)
wsgiref (0.1.2)
WTForms (2.1)
How do I renew the QDS account password after it expires and what is the password policy?

Handling QDS Account Password Expiration and Renewal describes how to renew your password.

See also Managing Profile.

What are the browser requirements for QDS?

QDS is supported on the latest versions of Google Chrome and Mozilla Firefox. Safari is not officially supported, but most QDS functions work reasonably well on it.

Why is my Spark application not using Package Management?

If the Spark application is using the existing Spark interpreters, it will use the system Python and R versions. To make your Spark application use Package Management, you must migrate the interpreter property values as described in Migrating Existing Interpreters to use the Package Management.

How do I prevent other users of my QDS account from seeing my commands?

To prevent other users from seeing your commands, go to Control Panel > Manage Roles in the QDS UI, and assign those users a policy for Command Resource with only the create permission. This allows each user to create commands but prevents them seeing other users’ commands. For more information, see Resources, Actions, and What they Mean.

Questions about Airflow

The topics that follow provide answers to questions commonly asked about Airflow:

Do I need to provide access to Qubole while registering Airflow datastore in QDS?

No, QDS does not need access to the Airflow datastore.

What does this error - Data store was created successfully but it could not be activated mean?

This error appears when you register the data store in the Explore UI page. It usually comes because QDS does not have access to the data store. You can safely ignore this error and associate the data store with the Airflow cluster. This is a known limitation and Qubole plans to address it in the near future.

How do I put the AUTH_TOKEN into the Qubole Default connection?
  1. From the Airflow page, go to Administration > Connections > qubole_default.
_images/QuboleDefault1.png
  1. Add the API token or AUTH_TOKEN in the password’s text box. (A password is the QDS authentication token for a QDS account user. See Managing Your Accounts for more information on API tokens.) Based on the schedule mentioned in the DAG, the next run must pick it up automatically. Try loading the DAG with a different name and schedule to check out the behavior.
_images/AuthTool1.png

Note

If you want to specify a Qubole Auth Token other than qubole_default for a task, you can provide the qubole_conn_id value in the task parameters after creating a new connection with same name from the Connections tab.

Is there any button to run a DAG on Airflow?

There is no button to run a DAG in the Qubole UI, but the Airflow 1.8.2 web server UI provides one.

Can I create a configuration to externally trigger an Airflow DAG?

No, but you can trigger DAGs from the QDS UI using the shell command airflow trigger_dag <DAG>....

If there is no connection password, the qubole_example_operator DAG will fail when it is triggered.

See Airflow for more information about using Airflow via the QDS UI.

Why must I reenter the database password/AUTH-token at a cluster restart?

When the data store is set to default, then the connection authorization password (which is the AUTH-Token) is stored directly on the Airflow cluster in the default data store, which is also on the cluster. That means that when you restart the cluster, you must re-add the password (AUTH-Token) because it would be erased since the database gets deleted when the cluster is offline.

Questions on Airflow Service Issues

Here is a list of FAQs that are related to Airflow service issues with corresponding solutions.

  1. Which logs do I look up for Airflow cluster startup issues?

    Refer to Airflow Services logs which are brought up during the cluster startup.

  2. Where can I find Airflow Services logs?

    Airflow services are Scheduler, Webserver, Celery, and RabbitMQ. The service logs are available at /media/ephemeral0/logs/airflow location inside the cluster node. Since airflow is single node machine, logs are accessible on the same node. These logs are helpful in troubleshooting cluster bringup and scheduling issues.

  3. What is $AIRFLOW_HOME?

    $AIRFLOW_HOME is a location that contains all configuration files, DAGs, plugins, and task logs. It is an environment variable set to /usr/lib/airflow for all machine users.

  4. Where can I find Airflow Configuration files?

    Configuration file is present at “$AIRFLOW_HOME/airflow.cfg”.

  5. Where can I find Airflow DAGs?

    The DAGs’ configuration file is available in the $AIRFLOW_HOME/dags folder.

  6. Where can I find Airflow task logs?

    The task log configuration file is available in $AIRFLOW_HOME/logs.

  7. Where can I find Airflow plugins?

    The configuration file is available in $AIRFLOW_HOME/plugins.

  8. How do I restart Airflow Services?

    You can do start/stop/restart actions on an Airflow service and the commands used for each service are given below:

    • Run sudo monit <action> scheduler for Airflow Scheduler.
    • Run sudo monit <action> webserver for Airflow Webserver.
    • Run sudo monit <action> worker for Celery workers. A stop operation gracefully shuts down existing workers. A start operation adds more equivalent number of workers as per the configuration. A restart operation gracefully shuts down existing workers and adds equivalent number of workers as per the configuration.
    • Run sudo monit <action> rabbitmq for RabbitMQ.
  9. How do I invoke Airflow CLI commands within the node?

    Airflow is installed inside a virtual environment at the location specified in the environment variable AIRFLOW_VIRTUALENV_LOC. Firstly, you should activate the virtual environment using the following script:

    source ${AIRFLOW_HOME}/airflow/qubole_assembly/scripts/virtualenv.sh activate
    

    After you activate the virtual environment, run the Airflow command.

_delete-airflow-dag-faq:
Deleting a DAG on an Airflow Cluster

You can delete a DAG on an Airflow Cluster from the Airflow Web Server.

Before you delete a DAG, you must ensure that the DAG must be either in the Off state or does not have any active DAG runs. If the DAG has any active runs pending, then you should mark all tasks under those DAG runs as completed.

  1. From the Clusters page, click on the Resources drop-down list against the airflow cluster, and select Airflow Web Server. The Airflow Web Server is displayed as shown in the illustration.
_images/delete-dag.png
  1. Click DAGs tab to view the list of DAGs.
  2. Click on the delete button under the Links column against the required DAG.
  3. Click OK to confirm.

By default, it takes 5 minutes for the deleted DAG to disappear from the UI. You can modify this time limit by configuring the scheduler.dag_dir_list_interval setting in the airflow.cfg file. This time limit does not work on sub-DAG operators.

Note

It is recommended not to decrease the time limit value substantially because it might lead to high CPU usage.

Questions about Hive

The topics that follow provide answers to common questions about Hive in Qubole:

What version of Hive does Qubole provide?

See QDS Components: Supported Versions and Cloud Platforms.

How can I create a Hive table to access data in object storage?

To analyze data in object storage using Hive, define a Hive table over the object store directories. This can be done a Hive DDL statement.

Use the Explore page to explore data in object storage and define Hive tables over it. See Exploring Data in the Cloud for more information.

For MapReduce jobs you can input directories through command line options.

What is the difference between an external table and a managed table?

The main difference is that when you drop an external table, the underlying data files stay intact. This is because the user is expected to manage the data files and directories. With a managed table, the underlying directories and data get wiped out when the table is dropped.

How different is a Qubole Hive Session from the Open Source Hive Session?

In a Qubole Hive session, when a user logs into QDS, all query history commands executed when this specific user is active earlier, are in the same Qubole Hive Session.

In open-source Hive, each Hive CLI represents a Hive session. Session-level commands and properties are not effective across open-source Hive sessions.

The difference in how commands run differently in a Qubole Hive session and an open-source session are described in this table.

Note

When you run set hive.on.master and set hive.use.hs2 in the corresponding Qubole Hive session, the two commands are not saved automatically.

Session-Level Commands Qubole Hive Session Open-source Hive Session
set <key>=<value>; All four commands are effective throughout the entire Qubole Hive session. The commands are automatically saved. All four commands are only valid in the corresponding Hive sessions.
add FILES <filepath> <filepath>*;
add JARS <filepath> <filepath>*;
add ARCHIVES <filepath> <filepath>*;
How can I create a table in HDFS?

A CREATE TABLE statement in QDS creates a managed table in Cloud storage. To create a table in HDFS to hold intermediate data, use CREATE TMP TABLE or CREATE TEMPORARY TABLE. Remember that HDFS in QDS is ephemeral and the data is destroyed when the cluster is shut down; use HDFS only for intermediate outputs.

You can use either TMP or TEMPORARY when creating temporary tables in QDS. TEMPORARY tables are only specific to a single command run through Analyze (or Workbench) UI/REST API. You cannot use a temporary table that is created in one command in a different command (with different command ID). How different is a Qubole Hive Session from the Open Source Hive Session? describes the differences.

CREATE TMP TABLE is Qubole’s custom extension and is not part of Apache Hive. The differences are as follows:

Note

Qubole does not support TMP table from Hive 3.1.1 (beta). It recommends using TEMPORARY tables instead. You can only create TMP tables in the default database.

Characteristic CREATE TMP TABLE CREATE TEMPORARY TABLE
Implemented by Qubole (supported only by QDS) Open-source Hive. See this document and the OSS Hive Wiki for details.
Metadata Stored in Hive metastore Lives only in memory
Table storage HDFS HDFS
Life of table QDS user session Hive user session
Table clean-up When QDS cluster is terminated or QDS user session ends When Hive user session ends
Advantages Can be shared across clusters and users and multiple query records (because the metadata is in the Hive metastore) Short-lived, quicker clean-up
Disadvantages Heavy clean up (traversing metastore); more disk capacity needed in HDFS because clean-up is less frequent Available only in Hive user session; doesn’t support index, partition, etc.
Recommended if… The temporary table is expected to live across multiple QDS query history-records The temporary table is needed only in one query history-record

A query history-record is a single row under the History tab of the QDS Workbench page.

What file formats does Qubole’s Hive support out of the box?

Qubole’s Hive supports:

  • Text files (compressed or uncompressed) containing delimited, CSV or JSON data.
  • It can support Binary data files stored in RCFile and SequenceFile formats containing data serialized in Binary JSON, Avro, ProtoBuf and other binary formats

Custom File Formats (InputFormats) and Deserialization libraries (SerDe) can be added to Qubole. Please contact Qubole Support if you have a query in this regard.

What is the default InputFormat used by Qubole’s Hive?

Qubole Hive uses CombineHiveInputFormat by default and treats underlying files as text-files by default. During table definition – users can indicate that the file type of of an alternative format (such as a Sequence or RC File).

Does Qubole remember my tables even when my cluster goes away?

For every account in Qubole – we create a persistent MySql backed Hive metastore automatically. All table definitions are stored in this metastore – and hence table metadata is available across cluster restarts.

(Note that tmp tables are an exception as they are automatically deleted at the end of a user session)

Can I use Excel/Tableau/BI tools on top of Qubole’s Hive tables?

Qubole provides an ODBC driver that you can download and install on a Microsoft Windows server. This allows Excel, Tableau and other BI tools to talk to Qubole’s Hive service.

Can I plug in my own UDFs and SerDes?

Yes you can use your own user-defined functions (UDFs) and Serializer/Deserializers (SerDes). But because Qubole is a multi-tenant service, it must approve such requests on a per-account basis. Create a ticket with Qubole Support if there is a data format that is not supported or if you want to use custom UDFs.

How do I handle the exception displayed with a Tez Query while querying a JDBC table?

You would see the the following exception message when you are querying a JDBC table using Tez.

java.io.IOException: InputFormatWrapper can not support RecordReaders that don't return same key & value objects.

You get the exception as the JDBC Storage handler does not work when Input Splits Grouping is enabled in Hive-on-Tez. HiveInputFormat is enabled by default in Tez to support Splits Grouping.

You can avoid the issue by setting the input format as CombineHiveInputFormat using this command that disables the Splits Grouping.

set hive.tez.input.format=org.apache.hadoop.hive.ql.io.CombineHiveInputFormat;
What are the unsupported features in Qubole Hive?

Qubole Hive does not support:

  • ACID transactions. However, Hive 3.1.1 (beta) supports ACID transactions.
  • The LOAD file command
  • LLAP is not supported

Questions about QDS Clusters

The topics that follow provide answers to questions commonly asked about QDS clusters:

How long does a Qubole Hadoop Cluster take to come up?

It can take up to a few minutes for a cluster to come up.

In whose account are clusters launched?

QDS launches clusters in your Cloud account, using your storage and compute credentials.

When are clusters brought up and shut down?

QDS clusters are ephemeral, though of course the data persists in your Cloud storage.

When you submit a query that requires a cluster, QDS brings up the cluster automatically if a suitable one is not already running, and uses it for subsequent queries, if any. Once the cluster has been idle for some time, QDS brings it down.

You can also bring up and shut down a cluster manually.

When are clusters autoscaled?

See Autoscaling in Qubole Clusters.

Should I have one large autoscaling cluster or multiple smaller clusters?

This is a common question that we come across while using QDS! Note that all clusters under an account share the Hive metastore and storage credentials used to access data in Cloud storage. So the same set of commands and queries can be run irrespective of the cluster configuration. Some trade-offs are listed below:

  • Sharing a single large autoscaling cluster allows efficient usage of compute resources. For example:

    • An application, App1, provisions a cluster but finishes leaving 30 minutes on the clock before the cluster reaches the hour mark and is terminated.
    • Another application, App2, now needs to be run. If App2 uses the same cluster, it can use paid-for compute resources that would otherwise go to waste.

    A careful Fair Scheduler configuration on a shared cluster can provide responsive behavior even when there are multiple users.

  • Multiple clusters allow configurations optimized for different workloads

    This is part of the reason why QDS has different types of clusters for different engines (such as Hadoop and Spark). Memory-intensive applications benefit from high-memory instances while compute-intensive benefit from a different instance type.

    Using multiple clusters is better if, for example, different data sets reside in different regions. It is better in that case to run multiple clusters, each closer to where the data resides.

  • Multiple clusters leads to higher isolation

    Although QDS uses frameworks such as the Hadoop Fair Scheduler to arbitrate access to a common cluster, contention can be avoided with the use of multiple clusters. This is an issue if there are production jobs with stringent SLAs. Running them on a separate cluster is always safe but expensive.

  • Efficiency gains from a shared cluster depend on type of job

    For example, a small job that runs every 15 minutes does not see much gain by sharing compute resources with larger bursty jobs. As such jobs are also often SLA-driven, it is better to run them on a different cluster.

Why is my cluster scaling beyond the configured maximum number of nodes?

When you use multiple worker node types to configure a heterogeneous cluster, autoscaling can cause the actual number of nodes running in the cluster to exceed the configured Maximum Worker Nodes. This is because the goal of autoscaling is to ensure that the cluster’s overall capacity meets (but does not exceed) the needs of the current workload. In a homogeneous cluster, in which worker nodes are of only one instance type, capacity is simply the result of the number of nodes times the memory and cores of the configured instance type. But in a heterogeneous cluster, a given capacity can be achieved by more than one mix of instance types, including some mixes in which the total number of nodes exceeds the configured Maximum Worker Nodes. But the cluster will never exceed the configured maximum capacity, which QDS computes from the capacity of the primary instance type times the worker-node maximum you configure.

The QDS UI uses the term normalized nodes to show the number of nodes that would be running if they were all of the primary instance type. The number of normalized nodes running will never exceed the configured Maximum Worker Nodes.

Will HDFS be affected by cluster autoscaling?

No. Qubole always removes nodes from HDFS gracefully before terminating them. This allows HDFS to safely replicate data to surviving nodes of a cluster and as a result data stored in HDFS is not lost. See Downscaling.

However, HDFS lasts only for the lifetime of a cluster and its contents are lost when the cluster is terminated. Hence we recommend using HDFS only as a temporary data store for intermediate data output by jobs and queries (for example, MapReduce shuffle data).

Are files from a Hadoop archive extracted to a specific folder by default?

When you extract files from a Hadoop archive, by default, the files are extracted to a folder with a name that matches the Hadoop archive’s name exactly. For example, files from a user/zoo/test.tar archive are extracted into a user/zoo/test.tar folder.

How do I check if a node is a coordinator node or a worker node?

Use either of these two methods:

  • Run the nodeinfo command as described in Run Utility Commands in a Cluster.

  • Add the following code in the node bootstrap script:

    #!/bin/bash
    
    source /usr/lib/hustler/bin/qubole-bash-lib.sh
    is_master=`nodeinfo is_master`
    if [[ "$is_master" == "1" ]]; then
    -----your code here--
    
Does Qubole store any data?

See Does Qubole store any of my data on its machines?.

Can I submit Hive Commands to a Spark Cluster and is it supported?

You can submit Hive commands to a Spark cluster and expect it to run correctly. However, Qubole does not recommend using this scenario, where multiple engines run on the same cluster.

In this specific case, Hive (using Tez or MapReduce execution engine) and Spark target different use cases. So the Hadoop cluster configurations optimized for each use case could be unrelated to each other.

QDS does not block this behavior out of the consideration for convenience and experimental purpose. For example, you may submit Hive commands for adhoc purpose such as DDL or schema change.

Questions about GCP

The topics that follow provide answers to questions commonly asked about using QDS with Google Cloud Platform (GCP):

What GCP regions and zones are supported by Qubole?
GCP Regions and Zones

The following table shows supported GCP regions and zones:

Region Zones Location
asia-east1 a,b,c Changhua County, Taiwan
asia-east2 a,b,c Hong Kong
asia-northeast1 a,b,c Tokyo, Japan
asia-northeast2 a,b,c Osaka, Japan
asia-south1 a,b,c Mumbai, India
asia-southeast1 a,b,c Jurong West, Singapore
australia-southeast1 a,b,c Sydney, Australia
europe-north1 a,b,c Hamina, Finland
europe-west1 b,c,d St. Ghislain, Belgium
europe-west2 a,b,c London, England, UK
europe-west3 a,b,c Frankfurt, Germany
europe-west4 a,b,c Eemshaven, Netherlands
europe-west6 a,b,c Zürich, Switzerland
northamerica-northeast1 a,b,c Montréal, Québec, Canada
southamerica-east1 a,b,c São Paulo, Brazil
us-central1 a,b,c,f Council Bluffs, Iowa, USA
us-east1 b,c,d Moncks Corner, South Carolina, USA
us-east4 a,b,c Ashburn, Northern Virginia, USA
us-west1 a,b,c The Dalles, Oregon, USA
us-west2 a,b,c Los Angeles, California, USA

To verify currently available GCP regions and zones, see the QDS UI for cluster properties: Clusters > Create New Cluster > Advanced Configuration > GCP SETTINGS.

_images/gcp-supported-regions.png

Questions about Security

The topics that follow provide answers to common questions about Qubole security:

Where do my Hive queries/Hadoop jobs run?
  • All Hadoop commands run on a Hadoop cluster. If no cluster is already running, a Hadoop command brings one up automatically.
  • Hive commands run on QDS machines, launching a Hadoop cluster only to run MapReduce jobs; metadata queries such as “show table” and “recover partitions” don’t bring up a Hadoop cluster.
Does Qubole store any of my data on its machines?

Qubole doesn’t store any of our users’ data. QDS clusters use your credentials for your Cloud storage account, and this is where the results of your jobs and queries are stored. Your credentials are encrypted in our database.

Our HDFS cache, which is ephemeral and used only while running queries, supports encrypted storage. We may temporarily cache some query results but these are destroyed when the node is decommisioned.

Questions about Package Management

The topics that follow provide answers to common questions about the Qubole Package Management feature:

How does Qubole Package Management pick a library version? Is it dependent on the Conda version or does it pick the latest version?

QDS Package Management (PM) only does a conda install <package>. If this fails, then it does a pip install <package>. If you mention a specific version of package while adding it, the PM installs that version. If you do not specify a version, then Conda uses the default version and conda-forge channels (in that priority order with channel_priority set as flexible) to resolve and download the package version that is compatible with the Conda environment.

Starting from R59, you can customize the channels. For more information, see Modifying Channels.

Note

PM uses the R channel to install the R packages.

Using the Default Package Management UI describes how to use the Package Management to create an environment and add or remove packages using the UI. Add Packages to the Package Management Environment describes how to add packages in a PM environment using the API.

Does the Qubole Package Management install dependencies of a package?

QDS Package Management installs all dependencies of a package. If it cannot install the dependencies, installation of that package fails.

Does the Qubole Package Management upgrade the underlying dependent libraries if they are already installed?

QDS Package Management (PM) only installs the dependent libraries during the package’s installation. PM does not upgrade the dependent libraries automatically.

How do I install packages that are not available in a Package Management environment?

If you want to use a specific package which is unavailable in a Package Management environment, create a ticket with Qubole Support to get an alternative solution to install the missing packages.

See also: Getting Started and Qubole Product Suite.

Connectivity Options

Qubole provides JDBC driver, supports integration with a Business Intelligence tool. The following topics describe driver and integration tool that Qubole supports:

Drivers

Qubole provides custom JDBC driver. Qubole recommends you to use this JDBC driver, rather than the open-source versions, on QDS.

The following topics provide download and installation information:

JDBC Driver

Qubole provides its own JDBC driver for Hive, Presto, and Spark. The Qubole JDBC jar can also be added as a Maven dependency. Here is an example POM.xml with dependency on the JDBC jar. Add the repositories, group, and artifact ID as mentioned in the above POM file. Change the version as required but Qubole recommends to add the latest version.

Benefits of the Qubole JDBC Driver

The Qubole driver provides several advantages over the open-source JDBC drivers that are mentioned below:

  • Queries are displayed in the QDS Analyze/Workbench page for a historical view.
  • Users are authenticated through the Qubole account API Token.
  • Qubole JDBC drivers support cluster lifecycle management (CLCM), cluster start/stop. The cluster does not have to be running all the time. Idle clusters are terminated by the Qubole Control Plane and queries executed through the Qubole JDBC driver brings the cluster up.

The topics below describe how to install and configure the JDBC driver before using it:

Downloading the JDBC Driver

Find here the latest Qubole JDBC driver version.

Note

The Qubole JDBC driver supports JDBC API version 4.1, which QDS currently uses.

JDBC Driver Version Release Date Downloadable JAR Location Release Notes
Driver version 2.3.2 24th Feb 2020 https://s3.amazonaws.com/paid-qubole/jdbc/qds-jdbc-2.3.2.jar JDBC Driver Version 2.3.2
Driver version 2.3.1 29th Jan 2020 https://s3.amazonaws.com/paid-qubole/jdbc/qds-jdbc-2.3.1.jar For version 2.3.1 and older, see JDBC Driver Release Notes
Driver version 2.3.0 16th Dec 2019 https://s3.amazonaws.com/paid-qubole/jdbc/qds-jdbc-2.3.0.jar
Driver version 2.2.0 18th Oct 2019 https://s3.amazonaws.com/paid-qubole/jdbc/qds-jdbc-2.2.0.jar

For instructions on using the driver to connect to a QDS cluster, see Connecting to Qubole through the JDBC Driver.

Verifying the JDBC Driver Version

The Downloading the JDBC Driver page always lists the most-recently released version of the JDBC driver. You can get the version from the driver INFO logs. You can also get the version number by using the qds-jdbc-<driver version>.jar file.

Verifying the JDBC Driver Version using the qds-jdbc-<driver version>.jar

To verify on Windows:

  1. Launch the Command Prompt.
  2. Go to the folder that contains qds-jdbc-<driver version>.jar.
  3. Run the following commands:
    1. jar -xvf qds-jdbc-<driver version>.jar META-INF/MANIFEST.MF
    2. more "META-INF/MANIFEST.MF" | findstr "Implementation-Version"

To verify on Linux and Mac:

  1. Launch the Terminal.

  2. Go to the folder that contains qds-jdbc-<driver version>.jar.

  3. Run the following command:

    jar -xvf qds-jdbc-<driver version>.jar META-INF/MANIFEST.MF | grep Implementation-Version META-INF/MANIFEST.MF

Setting the JDBC Connection String

You must set the JDBC connection string for Hive, Presto, and Spark queries on the JDBC driver 2.3.2 and older versions.

Setting the Connection String for Hive and Presto Queries (AWS and Azure)

Use the following syntax to set the JDBC connection string for Hive and Presto queries.

jdbc:qubole://<hive/presto/sqlcommand/spark>/<Cluster-Label>[/<database>][?propertyName1=propertyValue1[;propertyName2=propertyValue2]...]

In the connection string, <hive/presto/spark> (command type) and the cluster label are mandatory; database name and property name/value are optional.

Note

If you do not specify a database, then in the query, specify either the database or fully-qualified table names.

An example of a connection string for Hive query is mentioned below (applicable to JDBC driver 2.3.2 and older versions).

jdbc:qubole://hive/default/tpcds_orc_500?endpoint=https://api.qubole.com;chunk_size=86

In the above example, https://api.qubole.com is one of the QDS endpoints on AWS. For a list of supported endpoints, see Supported Qubole Endpoints on Google Cloud Platform (GCP).

Connection String Properties for JDBC Driver

Note

Parameters marked in bold below are mandatory. Others are optional and have default values.

Property Name Property Description
password

You can set the account API token as the password as in password=<API token> See Managing Your Accounts on how to get the API token from the Control Panel UI of the account.

Warning

Qubole highly recommends not using the password in the JDBC connection string as the password is prone to be exposed by the client tool that uses the string for connecting to Qubole. So, as a safe alternative, use the interface that the client tool provides to enter the user password.

endpoint The endpoint is not required only for the https://api.qubole.com endpoint. You must specify the API endpoint for other QDS-on-AWS endpoints and Cloud providers. For the list, see Supported Qubole Endpoints on Google Cloud Platform (GCP).
chunk_size The chunk size in MB and used in streaming large results from the Cloud storage. The default value is 100 MB. Reduce the default value if you face out-of-memory (OOM) issues.
catalog_name Add this property and enter the catalog’s name as its value.
skip_parsing Set this property to true to allow the driver to skip parsing the query and directly send it to QDS.
stream_results

It enables the Presto FastStreaming feature. Presto FastStreaming enables streaming of results directly from Cloud Object Storage in the JDBC driver. The streaming behavior can help the BI tool performance as results are displayed as soon as they are available in Cloud Object Storage. Presto FastStreaming for JDBC driver is supported in Presto versions 0.193 and 0.208. It is only applicable to QDS-on-AWS and Qubole-on-GCP. As streaming cannot be used with Presto Smart Query Retry, the Presto FastStreaming feature automatically disables Presto Smart Query Retry.

Create a ticket with Qubole Support to enable the Presto FastStreaming feature on the account.

useS3 This property is set to ensure that JDBC Driver bypasses QDS Control Plane to download results directly from Cloud Object Storage. It is set to true by default. By default, for each QDS account, the result file size limit is 20 MB. If the result set size is more than 20 MB, the driver downloads results directly from Cloud Object Storage irrespective of the useS3 property’s value. If you want to increase this limit, create a ticket with Qubole Support.
Additional Properties (Optional)

In addition, you can:

Setting the Connection String for Spark Queries

Use the following syntax to set the JDBC connection string for Spark queries.

jdbc:qubole://spark/<Cluster-Label>/<app-id>[/<database>][?propertyName1=propertyValue1[;propertyName2=propertyValue2]...]

For example:

jdbc:qubole://spark/spark-cluster/85/my-sql?endpoint=https://us.qubole.com;chunk_size=86;password=<API token>;useS3=true

Warning

Qubole highly recommends not using the password in the JDBC connection string as the password is prone to be exposed by the client tool that uses the string for connecting to Qubole. So, as a safe alternative, use the interface that the client tool provides to enter the user password.

Note

Create an App with the configuration parameter, zeppelin.spark.maxResult=<A VERY BIG VALUE>. It can return only the configured maximum number of row results.

In the connection string, spark (command type) and the cluster label are mandatory; database name and property name/value are optional.

Note

If you do not specify a database, then in the query, specify either the database or fully-qualified table names.

Specifying app-id is mandatory. An app is the main abstraction in the Spark Job Server API. It is used to store the configuration for a Spark application. Creating an app returns an app-id. (You can get the app id using GET API http://api.qubole.com/api/v1.2/apps). See Understanding the Spark Job Server for more information.

Connecting to Qubole through the JDBC Driver

To connect to Qubole through JDBC, perform the following steps:

Note

In JDBC and ODBC driver configurations, https://api.qubole.com is the default endpoint.

  • Register the driver by specifying the name of the driver URL class. com.qubole.jdbc.jdbc41.core.QDriver is the driver class name.
  • URL is the connection string as per the query. See Setting the JDBC Connection String for more information.
  • You can leave the username empty.
  • Use the account API token as the password. See Managing Your Accounts on how to get the API token from the Control Panel UI of the account.

Here is a Java code sample for connecting to Qubole.

Qubole allows you to bypass the QDS Control Plane to fetch results from the Cloud Object Storage by setting useS3=true.

Enabling Logging

Use the logs to debug a JDBC application issue. By default, JDBC logging is disabled. Enable it to get the logs.

For enabling JDBC logging, add the following properties to the JDBC connection string:

  • LogLevel: It can have one of these values: 1, 2, 3, 4, 5, 6 where

    • 1 implies LOG_OFF that disables all logging.
    • 2 implies LOG_FATAL that logs very severe error events that might lead the driver to abort.
    • 3 implies LOG_ERROR that logs error events that might still allow the driver to continue running.
    • 4 implies LOG_WARNING that logs potentially harmful situations.
    • 5 implies LOG_INFO that logs general information that describes the driver’s progress.
    • 6 implies LOG_DEBUG that logs a detailed information that is useful for debugging the driver.
    • 7 implies LOG_TRACE that logs a more detailed information than the LOG_DEBUG level.
  • LogPath: It is the log file location to which you must have write access.

    For example, this is a log path: jdbc:qubole://hive/default?LogLevel=6;LogPath=C:\\Users\\User\\Desktop

Note

Increasing the verbosity of LogLevel creates a larger log file. For example, LogLevel=6 for a long-running session can generate a 2GB-size log file.

Logs are printed in a file named QuboleJDBC_Driver.log.

Enabling the Proxy Connection

You can enable the proxy connection if you want to use a proxy server for connections. To enable it, add the following properties to the connection string.

proxy=true
proxyhost=<proxy server ip>
proxyport=<proxy server port value> (has to be a integer value)
proxyusername= <username for proxy authentication> (optional)
proxypassword= <password for proxy authentication> (optional)
proxytype=<http or socks> (optional - default value is http)

An example proxy connection would be:

jdbc:qubole://hive/hadoop2?proxy=true;proxyhost=10.0.0.1;proxyport=8080;proxyusername=qubole;proxypassword=qubole"

Tools

Qubole has partnered with multiple BI solution vendors, data warehousing providers, and ETL solution providers such as Looker, and Tableau Software to provide first-class integrations with these tools. This section covers partner integration with the following tools:

Looker

This integration guide describes about how to configure Looker to use it with the Qubole Data Service. Looker Integrates with Qubole Presto using Qubole JDBC drivers, as shown in the diagram below.

_images/LookerQuboleIntegration.png
Configuring Looker with QDS
Configure Looker

Follow the instructions below to configure Looker:

  1. After Looker instance or server is ready, log in to your Looker server.

  2. Click Admin and the General Settings page appears.

    _images/LookerAdmin.png
  3. From the navigation pane on the left, click Connections under the Database section. The Connections page appears.

    _images/DatabaseConnections.png
  4. On the Connections page, click the New Connection button at the top-left of the page. The Connection Settings page appears.

    _images/NewDBConnection.png
    • For the new connection:

      1. Enter a name for the connection in the Name field.

      2. Select the appropriate Qubole Presto cluster version from the Dialect dropdown list.

        Note

        To know the Qubole Presto cluster version, navigate to Clusters page and click Edit against the Presto cluster that you want to use. You can now check the Presto Version in the Configurations tab.

      3. Skip Serverless Qubole check box as Serverless Qubole is currently unavailable.

      4. Enter the Qubole cluster label name in the Cluster Label field.

      5. Enter hive in the Database field.

      6. Enter the API token in the API key field. You can find the API Token from QDS (Control Panel > My Accounts > click Show under the API Token column and copy it). To know more, see Managing Your Accounts.

      7. Enter the database or schema name from Qubole in the Schema field (for example, tpcds_orc_1000).

        Note

        In JDBC and ODBC driver configurations, https://api.qubole.com is the default endpoint.

      8. If the Qubole account is on some other Qubole environment or URL, enter the Additional Params with endpoint=<Qubole URL>. The endpoint represents the Qubole environment that you will connect to.

        _images/AdditionalParams.png

        Note

        To save the tables that Looker fetches from Qubole, select the Persistent Derived Tables check box in Looker and fill in the Temp Database.

        _images/PersistentDerivedTables.png
  5. Click Test These Settings to verify the connection. A Testing… message appears while Looker tries to run a query on Qubole to verify the connection.

You have successfully completed the configuration and a success message appears.

_images/ConnectionSuccess.png
Troubleshooting Qubole-Looker Integration Issues

Symptom: Connection refused

_images/CommonConnectionError.png

Cause:

  • API Token is wrong
  • User account is not on the given environment and you have not added endpoint=<Qubole_URL> in the Additional Params field.

Remedy:

  • Verify the correctness of the API token.
  • Add the Qubole environment in the Additional Params field. For more information, see step 4 of Configure Looker.
Using Qubole through Looker

After the connection between Qubole and Looker is established, you can use it to create dashboards in the Looker. You have to get all the tables in the schema available in Qubole and create a dashboard out of one of the tables in the schema.

  1. View Tables in the Schema
  2. Create a Dashboard on Looker
View Tables in the Schema

Follow the instructions below to view tables in the schema:

  1. On Looker, click Admin. In the Admin tab, click Connections under Database, which is on the left side.

  2. For the connection that you have created in Configure looker, click the settings (gear) icon.

    _images/ConnectionSettings.png
  3. From the Settings menu, select Explore and it runs the queries on Qubole’s Presto cluster to get the table information for the Schema (provided in Step 4 of Configure looker). You can expect a delay in loading the Explore page based on the number of queries to run. If the cluster is not started, some additional time is required to start the cluster.

    _images/QueryExecution.png
  4. Click the tables that you want to use in Looker and it redirects you to the Explore page for the selected table.

Create a Dashboard on Looker

Follow the instructions below to create a dashboard on Looker:

  1. Select the required field at the left side of the scroll bar in the Panel and click the Filter tag beside it. In the SQL sub-section under Data in Looker, you can see the SQL which is formed out of the selected filters.

    _images/LookerDashboard.png
  2. Click Run. It starts running a query on Qubole and it can be seen on the Analyze page of Qubole.

    _images/PrestoQueryonAnalyze.png

You have successfully created a Dashboard on Looker. Under the Visualizations and Data sections on Looker UI, the results are populated for the selected columns.

Tableau

This guide describes the steps to configure Tableau for using it with the Qubole Data Service.

To define the connectivity between Tableau and Qubole, you must specify the Qubole API Token (which Qubole uses to authenticate access), the cluster name (cluster Label) and the endpoint (Qubole platform in the cloud provider where the customer has its Qubole account). The first time a query, a dashboard or a report is run, Qubole authenticates the Tableau user and starts the big data engine (Presto or Hive) that Tableau requires, if it is not already running. After that, Tableau sends SQL commands through the ODBC/JDBC driver that Qubole passes to the right cluster. Qubole manages and runs the cluster with the right number of nodes and only for the required time, thus saving users up to 40% in infrastructure cost

_images/TableauQuboleIntegration.png
Support Matrix

Tableau users connect to Qubole using Qubole’s ODBC/JDBC drivers and use Hive or Presto engines to analyze their data. Qubole-Tableau integration support matrix is given below:

Qubole Engine Tableau Version Recommended Qubole Driver Recommended Qubole Connector for Tableau
Presto 2019.1, 2019.2, 2019.3 Qubole JDBC Driver Custom Qubole Presto Connector (JDBC)
Presto 2019.4, 2020.1, 2020.2 Qubole ODBC Driver Inbuilt Qubole Presto Connector (ODBC)
Presto 2020.3 or later Qubole JDBC Driver Inbuilt Qubole Presto Connector (JDBC)
Hive 2019.4 or later Qubole JDBC Driver Qubole Hive Connector (JDBC)

Note

If you are on a Tableau version prior to 2019.1 or using Spark as the query engine, contact Qubole Support.

Using Qubole Presto Connector

There are two types of Qubole Presto Connectors available for Tableau: Custom Qubole Presto Connector (JDBC) and Inbuilt Qubole Presto Connector (ODBC). Custom Qubole Presto Connector (JDBC) builds on top of JDBC and works with Tableau versions 2019.1 or later. Tableau version 2019.4 (or later) comes with an inbuilt Qubole Presto Connector based on ODBC.

Note

Qubole ODBC driver does not support Google Cloud Platform (GCP) and therefore only the Custom Qubole Presto Connector (JDBC) can be used on GCP.

Follow the instructions below to connect to QDS using the Qubole Presto connector:

Custom Qubole Presto Connector (JDBC) - For Tableau Version 2019.1, 2019.2 and, 2019.3

Follow the instructions below to connect to QDS using the Custom Qubole Presto Connector (JDBC):

A. Download Qubole JDBC Driver
  1. Download the latest Qubole JDBC driver (version 2.3 or later). For more information, see JDBC Driver.
  2. Move the driver (jar) file to following location:
    • For Windows: C:\Program Files\Tableau\Drivers
    • For MAC: ~/Library/Tableau/Drivers
    • For Linux: /opt/tableau/tableau_driver/jdbc
B. Download Custom Qubole Presto Connector (JDBC):

Click here to download the Custom Qubole Presto Connector (JDBC).

C. Plugin the Custom Qubole Presto Connector (JDBC):

Follow the instructions below to plugin the Custom Qubole Presto Connector (JDBC):

  1. After downloading the Custom Qubole Presto Connector (JDBC), run the following command to extract plugin Qubole-Tableau-Jdbc-Plugin.zip:

    $ unzip Qubole-Tableau-Jdbc-Plugin.zip
    
  2. Create a directory for Tableau connectors in the following location:

    For Tableau Desktop:

    MacOS: /connector

    Windows: C:\connector

    For Tableau Server:

    Linux: /connector

    Windows: C:\connector

    Note

    The location of the connector must be in the root directory. For macOS Catalina version 10.15 or later, you are unable to write to root directory (https://support.apple.com/en-in/HT210650). Therefore, you could create a connector in any of the directories while using macOS Catalina version 10.15 or later and ensure that the user has sufficient access to read and execute files in that respective directory.

  3. Copy the qubole_jdbc directory containing your connector’s manifest.xml file in the newly created directory.

  4. Run Tableau using -DConnectPluginsPath command line argument for Tableau Desktop pointing to your connector directory, as mentioned below:

    For MAC:

    /Applications/Tableau\ Desktop\ 2019.1.app/Contents/MacOS/Tableau -DConnectPluginsPath=/connector/
    

    For Windows:

    cd c:\Program Files\Tableau\Tableau 2019.1\bin
    tableau.exe -DConnectPluginsPath=C:\connector
    
  5. For Tableau Server: Run the Tableau Server Manager (TSM) with the following option to make it available for publishing:

    For Linux/Windows:

    tsm configuration set -k native_api.connect_plugins_path -v <path_to_connector> --force-keys
    tsm pending-changes  apply
    
    This operation will perform a server restart. Are you sure you wish to continue?
    (y/n): y
    Starting deployments asynchronous job.
    

You have successfully plugged in the connector.

Note

  • If you mention an incorrect <path-to-directory> which doesn’t have plugin code, you will view the following errors:

    For Tableau Desktop: Custom Qubole Presto Connector with the name “Qubole Presto” will not appear in the list of connectors when you start Tableau.

    For Tableau Server: Once you publish a report from Tableau Desktop, prompt appears with an error: “Tableau doesn’t recognize the data source type qubole_jdbc”.

D. Connect to Qubole Data Service (QDS):

Follow the instructions below to connect to QDS:

  1. After opening Tableau, navigate to the Connect menu. Under To a Server, select More….

  2. Click Qubole Presto (JDBC) and the Qubole Presto (JDBC) dialog box appears.

    _images/qubole_jdbc.png
  3. Enter the EndPoint, API Token, Catalog, and Cluster Label in their respective fields.

    _images/qubole_presto_jdbc.png

    Note

    • API Token: You can find the Auth Token or Password from QDS (Control Panel > My Accounts > click Show under the API Token column and copy it). To know more, see Managing Your Accounts. If you want to add additional JDBC parameters in the EndPoint field, use semicolon (;) to separate them (Example: https://api.qubole.com;LogLevel=6;LogPath=/tmp).
    • EndPoint: Mention the EndPoint based on the region where you have the Qubole account. For more information, see Supported Qubole Endpoints on Google Cloud Platform (GCP). The EndPoint doesn’t require suffixes such as /api/xxx. For example, the EndPoint can be https://in.qubole.com.
  4. Select Read uncommitted data check box to enable streaming results.

  5. Click Sign In to fire a query on the Qubole portal.

You have successfully connected to Qubole Data Service (QDS) via Qubole Presto Connector (JDBC).

Note

  • Remove all the .tdc files (if there is any) from Documents/My\ Tableau\ Repository/Datasources.

  • You should take the following precautions with Custom Qubole Presto Connector (JDBC):

    • Open Other Sessions of Tableau from the terminal to use Custom Qubole Presto Connector (JDBC).
    • If you open new workbooks from the Tableau UI, the subsequent sessions don’t display the Custom Qubole Presto Connector (JDBC). To avoid this, you should launch new sessions through command-line parameters as instructed above.
Inbuilt Qubole Presto Connector (JDBC) - For Tableau Version 2020.3 or Later

Follow the instructions to configure Tableau using Inbuilt Qubole Presto Connector (JDBC):

A. Download Qubole JDBC Driver
  1. Download the latest Qubole JDBC driver. For more information, see JDBC Driver.
  2. Move the driver (jar) file to following location:
    • For Windows: C:\Program Files\Tableau\Drivers
    • For Mac: ~/Library/Tableau/Drivers
    • For Linux: /opt/tableau/tableau_driver/jdbc
B. Connect to Qubole Data Service (QDS)

Follow the instructions below to connect to QDS:

  1. After opening Tableau, navigate to the Connect menu. Under To a Server, select More….

  2. Click Qubole Presto and the Qubole Presto dialog box appears.

    _images/tab-menu.png
  3. Enter the End Point, Catalog, Cluster Label, and Password in their respective fields.

    _images/qub_presto.png

    Note

    • You can find the Password from QDS (Control Panel > My Accounts > click Show under the API Token column and copy it). To know more, see Managing Your Accounts.
    • EndPoint: Mention the EndPoint based on the region where you have the Qubole account. For more information, see Supported Qubole Endpoints on Google Cloud Platform (GCP). The EndPoint doesn’t require suffixes such as /api/xxx. For example, the EndPoint can be https://in.qubole.com.
    • How to add additional parameters? You should add the additional JDBC parameter with the EndPoint field value and use a semicolon (;) to separate them (Example: https://api.qubole.com;qds_bypass=true;LogLevel=6;LogPath=/tmp).
  4. Select Read uncommitted data check box to enable streaming results.

  5. Click Sign In to fire a query on the Qubole portal.

You have successfully connected to Qubole Data Service (QDS) via Inbuilt Qubole Presto Connector (JDBC).

Using Qubole Hive Connector - for Tableau 2019.4 or later

Follow the instructions below to connect to QDS using the Custom Qubole Hive Connector (JDBC):

A. Download and Install Qubole Hive Connector and Qubole JDBC Driver

Follow the instructions from Tableau Gallery.

B. Connect to Qubole Data Service (QDS):

Follow the instructions below to connect to QDS:

  1. After opening Tableau, navigate to the Connect menu. Under To a Server, select More….

  2. Click Qubole Hive by Qubole and the Qubole Hive by Qubole dialog box appears.

    _images/qubolehivebyqubole.png
  3. Enter the EndPoint, Cluster Label, and Password in their respective fields.

    _images/qubole_hive.png

    Note

    • Password/API Token: You can find the Auth Token or Password from QDS (Control Panel > My Accounts > click Show under the API Token column and copy it). To know more, see Managing Your Accounts.
    • EndPoint: Mention the EndPoint based on the region where you have the Qubole account. For more information, see Supported Qubole Endpoints on Google Cloud Platform (GCP). The EndPoint doesn’t require suffixes such as /api/xxx. For example, the EndPoint can be https://in.qubole.com.
    • How to add additional parameters? You should add the additional JDBC parameter with the EndPoint field value and use a semicolon (;) to separate them (Example: https://api.qubole.com;qds_bypass=true;LogLevel=6;LogPath=/tmp).
  4. Click Sign In to connect to QDS.

You have successfully connected to Qubole Data Service (QDS) via Qubole Hive Connector (JDBC).

Apache Superset

Apache Superset is a business intelligence tool. Apache Superset uses Qubole SQLAlchemy Toolkit to connect to Qubole Data Service (QDS). Currently, you can use Apache Superset with Qubole Presto and Hive as the backend engines.

_images/apache.png
Prerequisites
  • Java 8 or later
  • Python 3.x
  • Apache Superset 0.35.2 or later

Follow the instruction below to connect to Qubole Data Service (QDS) from Apache Superset:

Configuring Apache Superset with Qubole Data Service (QDS)

Follow the instructions below to configure Apache Superset with Qubole:

  1. Install Apache Superset. For more information on Apache Superset installation, see Apache Superset Installation.

  2. Install and configure SQLAlchemy-Qubole. For more information, see Qubole SQLAlchemy Toolkit.

  3. Restart Apache Superset.

  4. Login to Apache Superset from browser. The Dashboards page appears.

    _images/dashboards.png
  5. From the top menu, select Sources > Databases. The Databases page appears.

  6. Click the + icon at the right pane to add a database. The Add Database page appears.

    _images/database.png
  7. Enter a name for the database in the Database field.

  8. Enter SQLAlchemy URI (Example of SQLAlchemy URI for Qubole Presto: qubole+presto://presto/presto_cluster?endpoint=https://api.qubole.com;password=<API-TOKEN>;catalog_name=hive). To know more, see Connecting to QDS Using SQLAlchemy-Qubole Package.

  9. Click Test Connection to verify the connection. If the connection is successfully established, the following success message appears.

    _images/success_msg.png
  10. Click Save.

You have successfully connected to Qubole Data Service (QDS) from Apache Superset.

Note

For more information on Apache Superset use cases and examples, see Apache Superset Tutorials.

Qubole SQLAlchemy Toolkit

SQLAlchemy is a Python SQL Toolkit and Object Relational Mapper (ORM) that gives developers the full flexibility and power of SQL. The primary purpose of this integration guide is to showcase a working dialect for Qubole Presto and Hive that can be used in any Python code to connect to Qubole. Some popular BI tools that use python (and will, therefore, get benefit from this integration) include Apache Superset and redash-integration-index. Furthermore, it can also be used as an ORM which is one of the general use cases of SQLAlchemy.

Qubole SQLAlchemy runs on top of the Qubole JDBC as a dialect to bridge a QDS Presto/Hive and SQLAlchemy applications.

_images/sqlalchemy.png
Prerequisite
  • Java 8 or later
  • Python 3.x

Follow the instructions below:

Installing SQLAlchemy

Follow the instructions below to install the sqlalchemy-qubole package:

  1. Download the Qubole JDBC driver (version 2.3 or later). For more information, see Downloading the JDBC Driver.

  2. Set the environment variable QUBOLE_JDBC_JAR_PATH pointing to JDBC JAR location with the absolute path (example: export QUBOLE_JDBC_JAR_PATH=/Users/testuser/qubolejdbc/qds-jdbc-2.3.0.jar).

  3. Install sqlalchemy-qubole package. The package is available on PyPI.

    $ pip install sqlalchemy-qubole
    

    Note

    Ensure that pip is pointing to Python3. You can use pip3 as well as this package supports Python 3.x version.

After successful installation of sqlalchemy-qubole package, you can use this package in your python code to run Qubole Presto and Hive queries.

Connecting to QDS Using SQLAlchemy-Qubole Package

Here is an example to connect to QDS and submit a Presto query:

# Qubole presto
from sqlalchemy import create_engine
engine = create_engine('qubole+presto://presto/presto_cluster?endpoint=https://api.qubole.com;password=********;catalog_name=hive')

with engine.connect() as con:
rs = con.execute('SHOW TABLES')
for row in rs:
print(row)

In the create_engine method, enter an SQLAlchemy URI to connect to QDS Presto or Hive. Below are the different Qubole dialects supported by SQLAlchemy:

  • Presto Dialect: qubole+presto://presto/<cluster-label>?endpoint=<env>;password=<API-TOKEN>;catalog_name=hive (Qubole Dialect points to Presto by default.)
  • Hive Dialect: qubole+hive://hive/<cluster-label>?endpoint=<env>;password=<API-TOKEN>

Note

While providing the dialects, replace <cluster-label>, <env> (Qubole environment), and <API-TOKEN> with their respective values.

You could add additional JDBC parameters and separate them with a semicolon(;). For more information on JDBC connection properties, see Setting the JDBC Connection String.

Important

If you don’t have a Qubole account and want to use the above tools, click here to create a Qubole Free Trial account. You can also upgrade your account to Qubole Enterprise Edition which provides actionable Alerts, Insights, and Recommendations to optimize reliability, performance, and costs. To upgrade your account to QDS Enterprise Edition, click here. To set up and configure your Qubole account, see aws-quick-start-guide.

REST API Reference

The following topics describe Qubole REST APIs. You can also use these functions interactively via a GUI; see the User Guide and the Administration Guide.

Overview

The Qubole Data Service (QDS) is accessible via REST APIs.

To write and test applications that use these APIs, you can use any HTTP client in any programming language to interact with Qubole Data Service. The detailed syntax of the calls is described in subsequent sections.

You can also use the QDS Python SDK or the QDS Java SDK to interact with the Qubole Data Service. The Python SDK also provides an easy-to-use command-line interface.

Note

Qubole now supports only HTTPS. All HTTP requests are now redirected to HTTPS. This is aimed at better security for Qubole users.

Access URL

All URLs referenced in the documentation have the following base:

https://gcp.qubole.com/api/${V}/

where ${V} refers to the version of the API. Valid values of ${V} for the GCP APIs are currently v2 for the Account and Cluster APIs, and v1.2 all other GCP APIs.

API Versions Supported on QDS-on-GCP

The APIs for Qubole on Google Cloud Platform are supported as follows.

  • Only the Cluster API and the Account API are supported on v2. The documentation for these APIs is based on v2.
  • All other APIs are only supported on v1.2. The documentation for these APIs only describes the syntax and sample API calls with respect to v1.2.

cURL is a useful tool for testing out Qubole REST API calls from the command line.

The Qubole API is served over HTTPS and Qubole redirects all HTTP access to HTTPS. Qubole now supports only HTTPS access, which is aimed at better security for Qubole users. When using HTTPS, ensure that the certificates on the client machine are up-to-date.

Authentication

API calls must be authenticated with a Qubole API Token.

Navigate to Control Panel in the QDS UI and click the My Accounts tab on the left pane. Click Show for the account and copy the API token that is displayed.

Set the value of this API token to the AUTH_TOKEN environment variable when running the API examples via curl.

API Types

The APIs for Qubole on Google Cloud Platform are divided into the following categories:

Account API

These APIs let you to create a new account, edit and delete an existing account, view users, and set Hive bootstrap. See Account API for more information.

Apps API

These APIs allow you to create, list, and delete a Spark Job Server App. For more information, see Apps API.

Cluster API

These APIs allow you to create a new cluster, edit, restart, and delete an existing clusters. These APIs also allow you to view the list of clusters in a Qubole account, and add and remove nodes in some clusters. See Cluster API for more information.

Command API

The Command APIs let you submit queries and commands, check the status of commands, retrieve results and logs, or cancel commands. See Command API for more information. The Qubole Data Service currently supports these command types:

Custom Metastore API

These APIs allow you to connect to a custom metastore, edit and view the metastore details. For more information, see Custom Metastore API.

DbTap API

A DbTap identifies an external end point for import/export of data from QDS, such as a MySQL instance. The DbTap APIs let you create, view, edit or delete a DbTap. For more information, see DbTap API.

Folder API

This API is mainly used to create and manage a Notebook/Dashboard folders. For more information, see Folder API.

Groups API

These APIs allow you to create groups, add/delete users in a group, and assign/unassign roles to a group in a Qubole account. For more information, see Groups API.

Hive Metadata API

These APIs provide a set of read-only views that describe your Hive tables and the metadata. For more information, see Hive Metadata API.

Notebook API

These APIs allow you to create a notebook, clone, configure, run, import, and delete a notebook. For more information, see Notebook API.

Object Policy API

These APIs allow you to create object-level access policies for notebooks and clusters. For more information, see Object Policy API.

Reports API

These APIs let you view aggregated statistical and operational data for your commands. For more information, see Reports API.

Roles API

These APIs allow you to create and delete a role to perform a set of actions in a Qubole account. For more information, see Roles API.

Scheduler API

The Scheduler APIs let you schedule any command or workflow to run at regular intervals. For more information, see Scheduler API.

Sensor API

These APIs allow you to create file and partition sensors to monitor the file and Hive partition’s availability. For more information, see Sensor API.

Users API

These APIs let you view invite a user to a Qubole account and enable/disable users in a Qubole account. For more information, see Users API.

Supported Qubole Endpoints on Google Cloud Platform (GCP)

Qubole provides QDS on the Google Cloud Platform (GCP) through the endpoints shown in this table. APIs for QDS-on-GCP are supported on either api/v2 (Account API and Cluster API) or api/v1.2 (all other APIs), so ${V} = v2 or v1.2 as appropriate.

Endpoint QDS Environment Security Certification
https://gcp.qubole.com/api/${V}/ Use this URL to access the general environment of QDS on the GCP. SOC2 and ISO 27001 compliant
https://gcp-eu.qubole.com/api/${V}/ Use this URL to access the European environment of QDS on GCP. SOC2 and ISO 27001 compliant

Note

Contact Qubole Support if you are unsure which deployment of GCP to use.

APIs for Qubole on Google Cloud Platform

The following topics describe Qubole REST APIs. You can also use these functions interactively via a GUI; see the User Guide and the Administration Guide.

Account API v1.2

View Users of a QDS Account
GET /api/v1.2/accounts/get_users

Use this API to view users of a QDS account.

Required Role

The following users can make this API call:

  • Users who belong to the system-user or system-admin group.
  • Users who belong to a group associated with a role that allows viewing an account’s details. See Managing Groups and Managing Roles for more information.
curl -X GET -H "X-AUTH-TOKEN:<AUTH-TOKEN>" -H "Content-Type: application/json" -H "Accept: application/json" \
"https://gcp.qubole.com/api/v1.2/accounts/get_users"
View Pending Users of a QDS account
GET /api/v1.2/accounts/get_pending_users

Use this API to view pending users of a QDS account, who are awaiting an approval. You must have administration privileges to see the list of pending users for a QDS account.

Required Role

The following users can make this API call:

  • Users who belong to the system-admin group.
  • Users who belong to a group associated with a role that allows viewing an account’s details. See Managing Groups and Managing Roles for more information.
curl -X GET -H "X-AUTH-TOKEN:<AUTH-TOKEN>" -H "Content-Type: application/json" -H "Accept: application/json" \
"https://gcp.qubole.com/api/v1.2/accounts/get_pending_users"
View User Emails
GET /api/v1.2/accounts/get_user_emails

Use this API to get the users’ emails associated with the current QDS account.

Required Role

The following users can make this API call:

  • Users who belong to the system-user or system-admin group.
  • Users who belong to a group associated with a role that allows viewing an account’s details. See Managing Groups and Managing Roles for more information.
curl -X GET -H "X-AUTH-TOKEN:<AUTH-TOKEN>" -H "Content-Type: application/json" -H "Accept: application/json" \
"https://gcp.qubole.com/api/v1.2/accounts/get_user_emails"
Suspend and Resume a QDS Account

You can suspend and resume a QDS account through a REST API call.

Suspend a QDS Account
PUT /api/v1.2/accounts/suspend

You can now temporarily suspend an account and block access to non-system-admin users. Only system-admins can work on the suspended account.

Required Role

The following users can make this API call:

  • Users who belong to the system-admin group.
  • Users who belong to a group associated with a role that allows suspending an account. See Managing Groups and Managing Roles for more information.
Request API Syntax

Here is the syntax for an API request.

curl -X PUT -H "X-AUTH-TOKEN: $AUTH_TOKEN" -H "Content-Type: application/json" -H "Accept: application/json" \
-d '{"id":"<account_id>"}' \ "https://gcp.qubole.com/api/v1.2/accounts/suspend"
Sample Request

Here is a sample request to suspend the account with ID 3002.

curl -X PUT -H "X-AUTH-TOKEN: $AUTH_TOKEN" -H "Content-Type: application/json" -H "Accept: application/json" \
-d '{"id":"3002"}' \ "https://gcp.qubole.com/api/v1.2/accounts/suspend"
Resume a QDS Account
PUT /api/v1.2/accounts/resume

A system-admin can resume the suspended account and thus restore access to other users.

Required Role

The following users can make this API call:

  • Users who belong to the system-admin group.
  • Users who belong to a group associated with a role that allows resuming a suspended account. See Managing Groups and Managing Roles for more information.
Request API Syntax

Here is the syntax for an API request.

curl -X PUT -H "X-AUTH-TOKEN: $AUTH_TOKEN" -H "Content-Type: application/json" -H "Accept: application/json" \
-d '{"id":"<account_id>"}' \ "https://gcp.qubole.com/api/v1.2/accounts/resume"
Sample Request

Here is a sample request to resume the suspended account with ID 3002.

curl -X PUT -H "X-AUTH-TOKEN: $AUTH_TOKEN" -H "Content-Type: application/json" -H "Accept: application/json" \
-d '{"id":"3002"}' \ "https://gcp.qubole.com/api/v1.2/accounts/resume"
Reset Authentication Token
PUT api/v1.2/accounts/token

Use this API to reset your authentication token. Users with relevant permissions can also use this API to reset the authentication tokens of other users in the account.

Required Roles

Users with the following permissions can use this API to reset the authentication tokens of other users:

  • Users who belong to the system-admin group.
  • A user who has the manage users permission. See Managing Roles for more information.

Note

A user can reset his authentication token if he has the authentication token permission on accounts.

Parameters

You must pass at least one of the parameters (emails or groups) in the API body. If you pass both parameters, a union of the users present in the emails parameter and the users present in the groups parameter is considered.

Parameter Description
emails

Provide the email ID of the user whose authentication token you want to reset. Use comma-separated values to provide multiple email IDs.

Note

When resetting multiple tokens using the emails parameter, no user token is reset if any of the email IDs passed are invalid or not part of the account.

groups

Provide the name of the group whose authentication token you want to reset. Use comma-separated values to provide multiple groups.

Note

When used for a group, this resets the authentication tokens of all the users present in that group. In case of any errors, no user token is reset.

If your authentication token is successfully reset (by you or by another user), you will receive a confirmation email.

Request API Syntax
curl -i -X PUT -H "X-AUTH-TOKEN: $AUTH_TOKEN" -H "Content-Type: application/json" -H "Accept: application/json" \
-d {"emails": "<emails>", "groups": "<groups>"}' \ "https://gcp.qubole.com/api/v1.2/accounts/token"
Sample Requests
Sample Request Using both the Parameters
curl -i -X PUT -H "X-AUTH-TOKEN: $AUTH_TOKEN" -H "Content-Type: application/json" -H "Accept: application/json" \
-H 'cache-control: no-cache' -d '{"emails": "test_1@domain.com,test_2@domain.com", "groups": "group_1,group_2"}' \
"https://gcp.qubole.com/api/v1.2/accounts/token"
Sample Request Using only the emails Parameter
curl -i -X PUT -H "X-AUTH-TOKEN: $AUTH_TOKEN" -H "Content-Type: application/json" -H "Accept: application/json" \
-H 'cache-control: no-cache' -d '{"emails": "test_1@domain.com"}' \ "https://gcp.qubole.com/api/v1.2/accounts/token"
Sample Request Using only the groups Parameter
curl -i -X PUT -H "X-AUTH-TOKEN: $AUTH_TOKEN" -H "Content-Type: application/json" -H "Accept: application/json" \
-H 'cache-control: no-cache' -d '{"groups": "group_1}' \ "https://gcp.qubole.com/api/v1.2/accounts/token"
Sample Response
{
   "message": "Tokens updated successfully"
}
Brand Logo and Documentation
PUT /api/v1.2/accounts/branding

Use this API to configure the logo and documentation links for resellers, who are within the account.

Note

This feature is not enabled by default. To enable it on the QDS account, create a ticket with Qubole Support.

Customise your Account describes how to add the logo through the UI.

Required Role

The following users can make this API call:

  • Users who belong to the system-user or system-admin group.
  • Users who belong to a group associated with a role that allows cloning an account. See Managing Groups and Managing Roles for more information.
Parameters

Note

Parameters marked in bold below are mandatory. Others are optional and have default values.

Parameter Description
account_id

It is the account ID of the Qubole account for which you need to do branding. Specify the following branding sub-options:

  • logo:
    • logo_uri: It is a publicly accessible logo URI image in a png/gif/svg/jpg format. The image size must be less than 100 KB. It is mandatory for branding the logo. It must have the pixel dimension of (120px x 48px).
  • link: You can add a documentation link. It has these two sub options:
    • link_url: Specify the documentation URL. It is mandatory for adding a documentation link.
    • link_label: Add a label to describe the documentation URL.
Request API Syntax

Here is the syntax for an API request.

curl -X PUT -H "X-AUTH-TOKEN: $AUTH_TOKEN" -H "Content-Type: application/json" -H "Accept: application/json"\
-d '{"account_id":"<account ID>","logo":{"logo_uri":"<image URI>"}, "link":{"link_url":"<doc-link>",
"link_label":"<documentation-label>"} }'  \ "https://gcp.qubole.com/api/v1.2/accounts/branding"
Sample Requests

Here is an example to brand logo and add documentation links.

curl -X PUT -H "X-AUTH-TOKEN: $AUTH_TOKEN" -H "Content-Type: application/json" -H "Accept: application/json" \
-d '{"account_id":"24","logo":{"logo_uri":"https://www.xyz.com/images/logo.jpg"},
"link":{"link_url":"https://www.xyz.com/documentation", "link_label":"Documentation"} }' \
"https://gcp.qubole.com/api/v1.2/accounts/branding"

Here is an example to only brand the logo.

curl -X PUT -H "X-AUTH-TOKEN: $AUTH_TOKEN" -H "Content-Type: application/json" -H "Accept: application/json" \
-d '{"account_id":"24","logo":{"logo_uri":"https://www.xyz.com/images/logo.jpg"}}' \
"https://gcp.qubole.com/api/v1.2/accounts/branding"

Here is an example to only add documentation. You must brand the logo before adding the documentation.

curl -X PUT -H "X-AUTH-TOKEN: $AUTH_TOKEN" -H "Content-Type: application/json" -H "Accept: application/json" \
-d '{"account_id":"24", "link":{"link_url":"https://www.xyz.com/documentation", "link_label":"Documentation"} }' \
"https://gcp.qubole.com/api/v1.2/accounts/branding"
Disable a QDS Account

Use this API to disable a specific Qubole account. A user having access to multiple Qubole accounts can still access other Qubole accounts except the disabled account. For enabling the disabled-account, create a ticket with Qubole Support. This feature is currently supported only through the API.

Required Role

The following users can make this API call:

  • Users who belong to the system-admin group.
Parameters

Note

Parameters marked in bold below are mandatory. Others are optional and have default values.

Parameter Description
id Account ID of the account that you want to disable.
Request API Syntax
curl -X PUT -H "X-AUTH-TOKEN:<AUTH-TOKEN>" -H "Content-Type: application/json" -H "Accept: application/json" \
-d '{"id":"<Account-Id>"}' \ "https://gcp.qubole.com/api/v1.2/accounts/disable"
Sample Request

Here is a sample request API to delete a Qubole account with ID 23.

curl -X PUT -H "X-AUTH-TOKEN:<AUTH-TOKEN>" -H "Content-Type: application/json" -H "Accept: application/json"
-d '{"id":"23"}' \ "https://gcp.qubole.com/api/v1.2/accounts/disable"
View the Account-level Public SSH Key
GET /api/v1.2/accounts/ssh_key

Use this API to view the account-level public SSH key set in a QDS account. Qubole expects this key to be present on the bastion node if the Unique SSH key feature is enabled otherwise QDS cannot connect to the bastion host.

The public SSH key is used while configuring the bastion host in a Virtual Private Cloud (VPC). Qubole communicates with the cluster that is on a private subnet in the VPC through the bastion host. This happens only if the Unique SSH Key feature is enabled on the QDS account. If it is not enabled, then the default Qubole public key is used to SSH into the bastion host. For more information, see clusters-in-vpc.

You must add this public SSH key to the bastion host by appending ~/.ssh/authorized_keys for an ec2-user. For more information, see clusters-in-vpc.

Parameters

None

Request API Syntax
curl -X GET -H "X-AUTH-TOKEN:$AUTH_TOKEN" -H "Content-Type:application/json" -H "Accept: application/json" \
"https://gcp.qubole.com/api/v1.2/accounts/ssh_key"
Sample API Request

Here is a sample request to see the account-level public SSH key of a QDS account in the https://gcp.qubole.com environment.

curl -X GET -H "X-AUTH-TOKEN:$AUTH_TOKEN" -H "Content-Type:application/json" -H "Accept: application/json" \
"https://gcp.qubole.com/api/v1.2/accounts/ssh_key"

Account API v2.0

Create a QDS Account
POST /api/v2/accounts/

Use this API to create a QDS account.

Required Role

The following users can make this API call:

  • Users who belong to the system-admin group.
  • Users who belong to a group associated with a role that allows creating an account. See Managing Groups and Managing Roles for more information.
Parameters

Note

Parameters marked in bold below are mandatory. Others are optional and have default values.

Parameter Description
name Name of the account. Provide a name to the account.
sub_account_creation

This parameter is set to use the account plan. By default, it is set to false. Possible values are true and false with the following implications:

  • Setting the value to true implies that the new account uses the parent account’s plan.
  • The default value set to false implies that the new account uses Qubole’s free trial-period plan.
idle_cluster_timeout_in_secs

After enabling the aggressive downscaling feature on the QDS account, the Cluster Idle Timeout can be configured in seconds. Its minimum configurable value is 300 seconds and the default value would still remain 2 hours (that is 120 minutes or 7200 seconds).

Note

This feature is only available on a request. Contact the account team to enable this feature on the QDS account.

idle_session_timeout If there is no activity on the UI, Qubole logs you out after a specified interval. The default value is 1440 minutes (24 hours). Use this option to configure the time interval in minutes if you do not want the default idle session timeout. The maximum value is 10800 minutes (1 week).
Request API Syntax
curl -i -X POST -H "X-AUTH-TOKEN: $AUTH_TOKEN" -H "Content-Type: application/json" -H "Accept: application/json" \
-d '{"account": {"name":"new_account"}' \ "https://gcp.qubole.com/api/v2/accounts
Sample Response
{
  "account_id": “$AccountId”,
  "authentication_token": "$Auth_token",
  "authentication_token_updated_at": "2019-02-19T10:26:31Z",
  "capabilities": 0,
  "created_at": "2019-02-19T10:26:31Z",
  "disabled": false,
  "disabled_at": null,
  "disabled_by": null,
  "id": 23,
  "is_admin": true,
  "is_default": false,
  "is_token_encrypted": true,
  "setting_id": null,
  "setting_type": null,
  "updated_at": "2019-02-19T10:26:31Z",
  "user_id": 12,
  "user_type": "regular",
  "qubole_service_account_email":"$qubole_service_account_email"
}
Edit a QDS Account
PUT /api/v2/accounts/

Use this API to edit an existing QDS account’s parameters.

Required Role

The following users can make this API call:

  • Users who belong to the system-admin group.
  • Users who belong to a group associated with a role that allows editing an account. See Managing Groups and Managing Roles for more information.
Parameters

Note

Parameters marked in bold below are mandatory. Others are optional and have default values.

Parameter Description
access_mode

Denotes the access type you want to use to create an account. The possible values are automated (default) and manual.

Refer to Setting up your Qubole Account on GCP for details. If the access mode is manual, the parameters instance_service_account and compute_service_account are mandatory.

instance_service_account

Instance service account created in customer’s project that Qubole uses to read datasets that the customer chooses to analyze and write results to default location.

The instance_service_account is only required when access_mode is manual. When access_mode is automated, instance_service_account will be ignored.

compute_service_account

Compute service account created in customer’s project that Qubole uses to bring up clusters.

The compute_service_account is only required when access_mode is manual. When access_mode is automated, compute_service_account will be ignored.

project_id The GCP project associated with the Qubole account. Every Qubole account can be linked to only one project.
defloc Cloud storage location where Qubole stores results and logs.
data_buckets Cloud storage locations where Qubole can access data. This is a comma-separated string with a maximum limit of 5 data buckets.
Request API Syntax
curl -X PUT -H "X-AUTH-TOKEN:<AUTH-TOKEN>" -H "Content-Type: application/json" -H "Accept: application/json" \
-d '{"<an account parameter" : "<new value>", ......}' \ "https://gcp.qubole.com/api/v2/accounts/"
Sample Request (automated flow)
curl -X PUT -H "X-AUTH-TOKEN:<AUTH-TOKEN>" -H "Content-Type: application/json" -H "Accept: application/json" \
-d '{
       "account":
       {
          "access_mode": "automated",
          "project_id": "$project_id",
          "defloc": "$defloc",
          "data_buckets": "data_bucket1, data_bucket2"
       }
    }' \ "https://gcp.qubole.com/api/v2/accounts/"
Sample Request (manual flow)
curl -X PUT -H "X-AUTH-TOKEN:<AUTH-TOKEN>" -H "Content-Type: application/json" -H "Accept: application/json" \
-d '{
       "account":
       {
          "access_mode": "manual",
          "compute_service_account": "$compute_service_account",
          "instance_service_account": "$instance_service_account",
          "project_id": "$project_id",
          "defloc": "$defloc",
          "data_buckets": "data_bucket1, data_bucket2"
       }
   }' \ "https://gcp.qubole.com/api/v2/accounts/"
Sample Response (same for both automated and manual flow)
{
  "allow_qubole_access": true,
  "allowed_email_domain": "gmail.com,qubole.com,qu-test.com",
  "compute_validated": true,
  "defloc": "$DEFOC",
  "id": 5675,
  "idle_cluster_timeout": 5,
  "idle_cluster_timeout_in_secs": 10555,
  "idle_session_timeout": 25,
  "name": "Control Panel ahsy",
  "persistent_security_groups": "default",
  "private_ssh_key": "$PRIVATE_SSH_KEY",
  "public_ssh_key": "$PUBLIC_SSH_KEY",
  "storage_validated": true,
}
Clone a QDS Account
POST /api/v2/accounts/clone

Use this API to clone a QDS account. You can choose to clone the users from the parent account; by default, they are not cloned.

Required Role

The following users can make this API call:

  • Users who belong to the system-admin group.
  • Users who belong to a group associated with a role that allows cloning an account. See Managing Groups and Managing Roles for more information.
Parameters

Note

Parameters marked in bold below are mandatory. Others are optional and have default values.

Parameter Description
id Account ID of the account that is to be cloned.
name Name of the new cloned account. Provide a name to the account.
clone_qbol_users Set this parameter to true when you want to associate parent account users with the cloned account as well. If it is not set, only the user who clones the account would get associated with the cloned account.

Note

By default, clusters are cloned.

Request API Syntax
curl -X POST -H "X-AUTH-TOKEN:<AUTH-TOKEN>" -H "Content-Type: application/json" -H "Accept: application/json" \
-d '{"id":"<Account-Id>", "name":"<account-name>", "clone_qbol_users":"false"}' \ "https://gcp.qubole.com/api/v2/accounts/clone"
Sample Request
curl -X GET -H "X-AUTH-TOKEN:<AUTH-TOKEN>" -H "Content-Type: application/json" -H "Accept: application/json"
"https://gcp.qubole.com/api/v2/accounts/"
Sample Response
{
  "name": "example_account_name",
  "state": "processing_create",
  "authentication_token": "$AUTH-TOKEN",
  "account_id": "123",
  "status": "success"
}
View Information for a QDS Account
GET /api/v2/accounts/

Use this API to view the details of the current account.

Required Role

The following users can make this API call:

  • Users who belong to the system-admin or system-user group can call this API request.
  • Users who belong to a group associated with a role that allows viewing an account’s details. See Managing Groups and Managing Roles for more information.
Parameters

None.

Request API Syntax
curl -X GET -H "X-AUTH-TOKEN:<AUTH-TOKEN>" -H "Content-Type: application/json" -H "Accept: application/json" \
"https://gcp.qubole.com/api/v2/accounts/"
Sample Response
{
  "id": “$account_id”,
  "access_mode": "$access_mode",
  "project_id": "$project_id",
  "storage_location": "$storage_location",
  "qubole_service_account": "$qubole_service_account",
  "instance_service_account": "$instance_service_account",
  "compute_service_account": "$compute_service_account",
  "compute_validated": "$compute_validation_status",
  "storage_validated": "$storage_validation_status",
  "validation_status": "$validation_status",
  "validation_error_message": "$validation_error_message", (Only if validation_status is completed and validation has failed)
  "data_buckets": "$data_buckets",
  "data_buckets_validation_status": "$data_buckets_validation_status",
  "data_buckets_error_message": "$data_buckets_error_message", (only if data buckets validation has failed)
  "name": "$name",
  "idle_session_timeout": "$idle_session_timeout",
  "authorized_ssh_key": "$authorized_ssh_key"
}
Set and View a Hive Bootstrap in a QDS Account using API version 2

Use these APIs to set and view a Hive bootstrap in a QDS account. Refer to Managing Hive Bootstrap for more information on how to set a bootstrap using the bootstrap editor.

Required Role

The following users can make this API call:

  • Users who belong to the system-user/system-admin groups.
  • Users who belong to a group associated with a role that allows editing an account. See Managing Groups and Managing Roles for more information.
Set a Hive Bootstrap
PUT /api/v2/accounts/bootstrap

Use this API to set a Hive bootstrap in a QDS account.

Parameters

Note

Parameters marked in bold below are mandatory. Others are optional and have default values.

Parameter Description
script Add Hive bootstrap content using this parameter.
is_user_bootstrap Use this parameter to personalize the user-level Hive bootstrap. Set it to true to modify it at the user level.

Note

The user-level Hive bootstrap is loaded after the account-level Hive bootstrap. In case of duplicate entries in user-level and account-level bootstraps, only the user-level Hive bootstrap becomes valid.

Request API Syntax
curl -X PUT -H "X-AUTH-TOKEN:<AUTH-TOKEN>" -H "Content-Type: application/json" -H "Accept: application/json" \
-d '{"script":"<new-bootstrap>", "is_user_bootstrap":"true"}' \ "https://gcp.qubole.com/api/v2/accounts/bootstrap"
Sample API Request

Here is a sample API request to set a Hive bootstrap.

curl -X PUT -H "X-AUTH-TOKEN:<AUTH-TOKEN>" -H "Content-Type: application/json" -H "Accept: application/json" \
-d '{ "script": "add jar s3://qubole/jars/loc1/stringutils-1.0-SNAPSHOT.jar;
                add jar s3://qubole/jars/loc1/udftest-1.0-SNAPSHOT.jar;
                create temporary function udftest as 'com.qubole.hive.udftest.UDFTest';"
    }' \ "https://gcp.qubole.com/api/v2/accounts/bootstrap"
View a Bootstrap
GET /api/v2/accounts/bootstrap

Use this API to view a Hive bootstrap and a personalized Hive bootstrap (if any set at the user-level) set in a QDS account.

Parameters

Note

Parameters marked in bold below are mandatory. Others are optional and have default values.

Parameter Description
is_user_bootstrap This parameter is set to true to fetch the Hive bootstrap set at user level. Use this parameter to only see the personalized Hive bootstrap (if available).
Required Role

The following users can make this API call:

  • Users who belong to the system-user/system-admin groups.
  • Users who belong to a group associated with a role that allows viewing an account’s details. See Managing Groups and Managing Roles for more information.
Request API Syntax
curl -X GET -H "X-AUTH-TOKEN:<AUTH-TOKEN>" -H "Content-Type: application/json" -H "Accept: application/json" \
-d '{"is_user_bootstrap":"true"}' \ "https://gcp.qubole.com/api/v2/accounts/bootstrap"
Allow an IP Address using the New API version

Allowing IP addresses lets users of an account log in only from certain (IPv4 or IPv6) addresses.

Note

Send an email request to help@qubole.com to enable allowing IP addresses for an account. Once enabled, users of the account can log in only from an allowed address.

Required Role

To make this API call you must:

  • Belong to the system-user or system-admin group.
  • Belong to a group associated with a role that allows editing an account. See Managing Groups and Managing Roles for more information.
Add an Allowed IP Address
POST /api/v2/accounts/whitelist_ip
Parameter
Parameter Description
ip_cidr IP address to be allowed, in IPv4 or IPv6 format.
Example

Request:

curl -X POST -H "X-AUTH-TOKEN: $X_AUTH_TOKEN" -H "Content-Type:application/json" -H "Accept: application/json" \
-d '{"ip_cidr" : "103.252.24.87"}' \ "https://gcp.qubole.com/api/v2/accounts/whitelist_ip"

Response:

{"status":{"status_code":200,"message":"IP whitelisted successfully."}}
List Allowed IP Addresses
GET /api/v2/accounts/whitelist_ip
Example

Request:

curl -X GET -H "X-AUTH-TOKEN: $AUTH_TOKEN" -H "Content-Type:application/json" -H "Accept: application/json" \
"https://gcp.qubole.com/api/v2/accounts/whitelist_ip"

Response:

An array of hashes containing account ID and IP address info; for example:

{"account_whitelisted_ips":[{"account_id":1,"created_at":"2017-01-17T19:06:56Z","id":1,"ip_cidr":"103.252.24.92","updated_at":"2017-01-17T19:06:56Z"},{"account_id":1,"created_at":"2017-01-17T19:07:20Z","id":2,"ip_cidr":"103.252.24.91","updated_at":"2017-01-17T19:07:20Z"}]}
Delete One or More Allowed IP Addresses
DELETE /api/v2/accounts/whitelist_ip/<id>

where <id> is the ID of the allowed IP address. To delete multiple addresses, use a comma-separated list of IDs.

Example

Request:

curl -X DELETE -H "X-AUTH-TOKEN:  $X_AUTH_TOKEN" -H "Content-Type: application/json" -H "Accept: application/json" \
"https://gcp.qubole.com/api/v2/accounts/whitelist_ip/1,2"

Response

{"status":{"status_code":200,"message":"Deleted"}}
Refresh the Account-level SSH Key Pair using API version 2
PUT api/v2/accounts/ssh_key

Use this API to create a new SSH key pair and use the public key to configure it on the bastion host of the VPC.

You must configure the public key of the SSH key pair on the bastion host of a VPC. Qubole communicates with the cluster that is configured on a private subnet of a VPC through the bastion host.

If the Unique SSH Key feature is not enabled, then the default Qubole public key is used to SSH into the bastion host. For more information, see clusters-in-vpc.

Important

The account-level unique SSH key feature is enabled by default on https://gcp.qubole.com and https://in.qubole.com. If you have the QDS account on any other QDS environment, then create a ticket with Qubole Support to get it enabled.

Whenever the public SSH key is rotated, ensure to replace the public SSH key on the bastion host with the rotated public SSH key. For more information on adding the unique public SSH key, see clusters-in-vpc.

The rotation policy for each cluster instance that is every time a cluster starts, a new SSH key pair is generated so Qubole automatically handles the rotation of keys every time the cluster starts. This happens only if the Unique SSH Key feature is enabled on the QDS account. If it is not enabled, then the default Qubole public key is added on the cluster nodes.

Caution

The previously generated SSH key pair gets overridden by the newly generated SSH key pair.

Parameters

None

Request API Syntax
curl -X PUT -H "X-AUTH-TOKEN:$AUTH_TOKEN" -H "Content-Type:application/json" -H "Accept: application/json" \
"https://gcp.qubole.com/api/v2/accounts/ssh_key"
Sample API Request

Here is a sample request to see the account-level public SSH key of a QDS account in the https://gcp.qubole.com environment.

curl -X PUT -H "X-AUTH-TOKEN:$AUTH_TOKEN" -H "Content-Type:application/json" -H "Accept: application/json" \
"https://gcp.qubole.com/api/v2/accounts/ssh_key"

Apps API

This API is currently supported only to create, list, or delete Spark Job Server apps. For more information on Spark Job Server, see Understanding the Spark Job Server. The APIs are described in these topics:

Create an App
POST /api/v1.2/apps

Use this API to create a Spark Job Server App. The Spark Job Server allows you to start a long-running Spark application and fire many code snippets at that Spark application. The benefits of using a Job Server are:

  • Code snippets do not incur startup cost making the application run faster.
  • One code snippet can depend on the result of the previous code snippet and thus the application runs incremently.

For more information on Spark Job Server, see Understanding the Spark Job Server.

Required Role

The following users can make this API call:

  • Users who belong to the system-admin or system-user group.
  • Users who belong to a group associated with a role that allows creating an account. See Managing Groups and Managing Roles for more information.
Parameters

Note

Parameters marked in bold below are mandatory. Others are optional and have default values.

Parameter Description
name Name of the Spark Job Server App.
config This parameter denotes the Spark job configuration along with its value. It allows an array of Spark Job configuration and its value.
kind This parameter denotes the type of application. Currently, Qubole supports only Spark application.
Request API Syntax
curl -i -X POST -H "X-AUTH-TOKEN: $AUTH_TOKEN" -H "Content-Type: application/json" -H "Accept: application/json" \
-d '{ "name": "<App Name>", "config":[ {"<Job Config>": "<value>"}, {"<Job Config>": "<value>"}] "kind": "spark"}' \
    "https://gcp.qubole.com/api/v1.2/apps"
Sample Request

Here is a sample request API to create a new Spark Job Server App.

curl -i -X POST -H "X-AUTH-TOKEN: $AUTH_TOKEN" -H "Content-Type: application/json" -H "Accept: application/json" \
 -d '{ "name": "sparkjob_app", "config": {"spark.executor.memory": "3g"}, "kind": "spark"}' \
    "https://gcp.qubole.com/api/v1.2/apps"
List Apps
GET /api/v1.2/apps/

Use this API to get the list of Spark Job Server Apps.

Required Role

The following users can make this API call:

  • Users who belong to the system-admin or system-user group.
  • Users who belong to a group associated with a role that allows creating an account. See Managing Groups and Managing Roles for more information.
Parameters

None

Request API Syntax

Here is the API Syntax.

curl -i -X GET -H "X-AUTH-TOKEN: $AUTH_TOKEN" -H "Accept: application/json" -H "Content-type: application/json" \
"https://gcp.qubole.com/api/v1.2/apps"
Delete an App
DELETE /api/v1.2/apps/<app-ID>

Use this API to delete a Spark Job Server app by specifying its unique app-ID. Creating an app returns an unique app ID. For more information, see Understanding the Spark Job Server.

Required Role

The following users can make this API call:

  • Users who belong to the system-admin or system-user group.
  • Users who belong to a group associated with a role that allows creating an account. See Managing Groups and Managing Roles for more information.
Parameters

None

Request API Syntax

Here is the API Syntax.

curl -i -X GET -H "X-AUTH-TOKEN: $AUTH_TOKEN" -H "Accept: application/json" -H "Content-type: application/json" \
"https://gcp.qubole.com/api/v1.2/apps/<app-ID>"
Sample API Request

Here is an example to delete a Spark Job Server app with 300 as its ID.

curl -i -X GET -H "X-AUTH-TOKEN: $AUTH_TOKEN" -H "Accept: application/json" -H "Content-type: application/json" \
"https://gcp.qubole.com/api/v1.2/apps/300"

Cluster API

Create a Cluster on Google Cloud Platform
POST /api/v2/clusters/

Use this API to create a new cluster when you are using Qubole on GCP. You create a cluster for a workload that has to run in parallel with your pre-existing workloads.

You might want to run workloads across different geographical locations or there could be other reasons for creating a new cluster.

Required Role

The following users can make this API call:

  • Users who belong to the system-admin or system-user group.
  • Users who belong to a group associated with a role that allows creating a cluster. See Managing Groups and Managing Roles for more information.
Parameters

Note

Parameters marked in bold below are mandatory. Others are optional and have default values.

Parameter Description
cloud_config A list of labels that identify the cluster. At least one label must be provided when creating a cluster.
cluster_info Contains the configurations of a cluster.
engine_config Contains the configurations of the type of clusters
monitoring Contains the cluster monitoring configuration.
internal Contains the security settings for the cluster.
cloud_config
Parameter Description
compute_config Defines the GCP account compute credentials for the cluster.
storage_config Defines the GCP account storage credentials for the cluster.
location Sets the GCP geographical location.
network_config Defines the network configuration for the cluster.
cluster_composition Defines the mixture of on-demand instances and preemptible instances for the cluster.
compute_config
Parameter Description
use_account_compute_creds Determines whether to use account compute credentials. By default, it is set to false. Set it to true to use account compute credentials.
customer_project_id The project ID, unique across GCP.
storage_config
Parameter Description
customer_project_id The project ID, unique across GCP.
disk_type  
disk_size_in_gb  
disk_count  
disk_upscaling_config  
location
Parameter Description
region A Google-defined geographical location where you can run your GCP resources.
zone A subdivision of a GCP region, identified by letter a, b, c, etc.
network_config
Parameter Description
network The Google VPC network.
subnet The name of the subnet.
master_static_ip The static IP address to be attached to the cluster’s coordinator node.
bastion_node_public_dns The bastion host public DNS name if private subnet is provided for the cluster in a VPC. Do not specify this value for a public subnet.
bastion_node_port The port of the bastion node. The default value is 22. You can specify a non-default port if you want to access a cluster that is in a VPC with a private subnet.
bastion_node_user The bastion node user, which is ec2-user by default. You can a non-default user using this option.
master_elastic_ip The elastic IP address for attaching to the cluster coordinator. For more information, see this documentation.
cluster_composition
Parameter Description
master Whether the coordinator node is preemptible or not.
min_nodes Specifies what percentage of minimum required nodes can be preemptible instances.
autoscaling_nodes Specifies what percentage of autoscaling nodes can be preemptible instances.
cluster_info
Parameter Description
master_instance_type Defines the coordinator node type.
slave_instance_type Defines the worker node type.
node_base_cooldown_period

With the aggressive downscaling feature enabled on the QDS account, this is the cool down period set in minutes for nodes on a Hadoop 2 or Spark cluster. The default value is 15 minutes.

Note

The aggressive downscaling feature is only available on request.

node_volatile_cooldown_period

With the aggressive downscaling feature enabled on the QDS account, this is the cool down period set in minutes for preemptible nodes on a Hadoop 2 or Spark cluster. The default value is 15 minutes. The default value is 15 minutes.

Note

The aggressive downscaling feature is only available on request.

label Label for the cluster.
min_nodes The minimum number of worker nodes. The default value is 1.
max_nodes The maximum number of nodes up to which the cluster can be autoscaled. The default value is 2.
idle_cluster_timeout_in_secs

After enabling the aggressive downscaling feature on the QDS account, the Cluster Idle Timeout can be configured in seconds. Its minimum configurable value is 300 seconds and the default value would still remain 2 hours (that is 120 minutes or 7200 seconds).

Note

This feature is only available on a request. Create a ticket with Qubole Support to enable this feature on the QDS account.

cluster_name The name of the cluster.
node_bootstrap A file that is executed on every node of the cluster at boot time. Use this to customize the cluster nodes by setting up environment variables, installing the required packages, and so on. The default value is, node_bootstrap.sh.
disallow_cluster_termination Prevent auto-termination of the cluster after a prolonged period of disuse. The default value is, false.
force_tunnel  
customer_ssh_key SSH key to use to login to the instances. The default value is none. (Note: This parameter is not visible to non-admin users.) The SSH key must be in the OpenSSH format and not in the PEM/PKCS format.
env_settings  
datadisk  
root_volume_size Defines the size of the root volume of cluster instances. The supported range for the root volume size is 90 - 2047. An example usage would be "rootdisk" => {"size" => 500}.
engine_config
Parameter Description
flavour Denotes the type of cluster. The supported values are: hadoop2 and spark.
hadoop_settings To change the coordinator node type from the default, select a different type from the drop-down list.
hive_settings Enter the minimum number of worker nodes if you want to change it (the default is 1).
hadoop_settings
Parameter Description
custom_hadoop_config The custom Hadoop configuration overrides. The default value is blank.
use_qubole_placement_policy Use Qubole Block Placement policy for clusters with preemptible nodes.
is_ha  
fairscheduler_settings The fair scheduler configuration options.
hive_settings
Parameter Description
hive_version Set to 2.1.1.
pig_version The default version of Pig is 0.11. Pig 0.15 and Pig 0.17 (beta) are the other supported versions. Pig 0.17 (beta) is only supported with Hive 2.1.1.
pig_execution_engine  
overrides The custom configuration overrides. The default value is blank.
is_metadata_cache_enabled  
execution_engine  
monitoring
Parameter Description
ganglia Whether to enable Ganglia monitoring for the cluster. The default value is, false.
datadog  
airflow_settings

The following table contains engine_config for an Airflow cluster.

Note

Parameters marked in bold below are mandatory. Others are optional and have default values.

Parameter Description
dbtap_id ID of the data store inside QDS. Set it to -1 if you are using the local MySQL instance as the data store.
fernet_key Encryption key for sensitive information inside airflow database. For example, user passwords and connections. It must be a 32 url-safe base64 encoded bytes.
type Engine type. It is airflow for an Airflow cluster.
version The default version is 1.10.0 (stable version). The other supported stable versions are 1.8.2 and 1.10.2. All the Airflow versions are compatible with MySQL 5.6 or higher.
airflow_python_version Supported versions are 3.5 (supported using package management) and 2.7. To know more, see Configuring an Airflow Cluster.
overrides

Airflow configuration to override the default settings. Use the following syntax for overrides:

<section>.<property>=<value>\n<section>.<property>=<value>...

internal
Parameter Description
zeppelin_interpreter_mode The default mode is legacy. Set it to user mode if you want the user-level cluster-resource management on notebooks. See Configuring a Spark Notebook for more information.
image_uri_overrides  
spark_s3_package_name  
zeppelin_s3_package_name  
Request API Syntax
curl -X POST -H "X-AUTH-TOKEN:$X_AUTH_TOKEN" -H "Content-Type:application/json" -H "Accept: application/json" \
-d '{
     "cloud_config": {
             "compute_config": {
                     "use_account_compute_creds": true,
                     "customer_project_id": "dev-acm-cust-project-1"
             },
             "storage_config": {
                     "customer_project_id": "dev-acm-cust-project-1",
                     "disk_type": null,
                     "disk_size_in_gb": 100,
                     "disk_count": 0,
                     "disk_upscaling_config": null
             },
             "location": {
                     "region": "us-east1",
                     "zone": "us-east1-b"
             },
             "network_config": {
                     "network": "projects/dev-acm-cust-project-1/global/networks/default",
                     "subnet": "projects/dev-acm-cust-project-1/regions/us-east1/subnetworks/default",
                     "master_static_ip": null,
                     "bastion_node_public_dns": null,
                     "bastion_node_port": null,
                     "bastion_node_user": null,
                     "master_elastic_ip": null
             },
             "cluster_composition": {
                     "master": {
                             "preemptible": false
                     },
                     "min_nodes": {
                             "preemptible": false,
                             "percentage": 0
                     },
                     "autoscaling_nodes": {
                             "preemptible": true,
                             "percentage": 50
                     }
             }
     },
     "cluster_info": {
             "master_instance_type": "n1-standard-4",
             "slave_instance_type": "n1-standard-4",
             "node_base_cooldown_period": null,
             "label": ["gcp-cluster-2"],
             "min_nodes": 1,
             "max_nodes": 1,
             "idle_cluster_timeout_in_secs": null,
             "cluster_name": "gcpqbol_acc44_cl176",
             "node_bootstrap": "node_bootstrap.sh",
             "disallow_cluster_termination": false,
             "force_tunnel": false,
             "customer_ssh_key": null,
             "child_hs2_cluster_id": null,
             "parent_cluster_id": null,
             "env_settings": {},
             "datadisk": {
                     "encryption": false
             },
             "slave_request_type": "ondemand",
             "spot_settings": {}
     },
     "engine_config": {
             "flavour": "hadoop2",
             "hadoop_settings": {
                     "custom_hadoop_config": null,
                     "use_qubole_placement_policy": true,
                     "is_ha": null,
                     "fairscheduler_settings": {
                             "default_pool": null
                     }
             },
             "hive_settings": {
                     "is_hs2": false,
                     "hive_version": "2.1.1",
                     "pig_version": "0.11",
                     "pig_execution_engine": "mr",
                     "overrides": null,
                     "is_metadata_cache_enabled": true,
                     "execution_engine": "tez",
                     "hs2_thrift_port": null
             }
     },
     "monitoring": {
             "ganglia": false,
             "datadog": {
                     "datadog_api_token": null,
                     "datadog_app_token": null
             }
     },
     "internal": {
             "zeppelin_interpreter_mode": null,
             "image_uri_overrides": null,
             "spark_s3_package_name": null,
             "zeppelin_s3_package_name": null
     }
 }' \ "https://gcp.qubole.com/api/v2/clusters"
Sample API Request
curl -X POST -H "X-AUTH-TOKEN:$X_AUTH_TOKEN" -H "Content-Type:application/json" -H "Accept: application/json"
-d '{
       "cloud_config" : {
         "provider" : "gcp"
         "compute_config" : {
                       "compute_validated": False,
                       "use_account_compute_creds": False,
                       "compute_client_id": "<your client ID>",
                       "compute_client_secret": "<your client secret key>",
                       "compute_tenant_id": "<your tenant ID>",
                       "compute_subscription_id": "<your subscription ID>"
                 },
         "location": {
                       "location": "centralus"
                 },
         "network_config" : {
                       "vnet_name" : "<vpc name>",
                           "subnet_name": "<subnet name>",
                           "vnet_resource_group_name": "<vnet resource group name>",
                           "persistent_security_groups": "<persistent security group>",
                 },
         "storage_config" : {
                       "storage_access_key": "<your storage access key>",
                       "storage_account_name": "<your storage account name>",
                       "disk_storage_account_name": "<your disk storage account name>",
                       "disk_storage_account_resource_group_name": "<your disk storage account resource group name>"
           "data_disk_count":4,
           "data_disk_size":300 GB
                 }
       },
       "cluster_info": {
            "master_instance_type": "Standard_A6",
            "slave_instance_type": "Standard_A6",
            "label": ["gcp"],
            "min_nodes": 1,
            "max_nodes": 2,
            "cluster_name": "GCP1",
            "node_bootstrap": "node_bootstrap.sh",
            },
       "engine_config": {
            "flavour": "hadoop2",
              "hadoop_settings": {
                  "custom_hadoop_config": "mapred.tasktracker.map.tasks.maximum=3",
              }
             },
       "monitoring": {
              "ganglia": true,
             }
       }' \ "https://gcp.qubole.com/api/v2/clusters"
Update a Cluster on Google Cloud Platform
PUT /api/v2/clusters/<cluster-id>/<cluster label>

Use this API to update a cluster that is on Google Cloud Platform.

Required Role

The following users can make this API call:

  • Users who belong to the system-admin or system-user group.
  • Users who belong to a group associated with a role that allows creating a cluster. See Managing Groups and Managing Roles for more information.
Parameters

Parameters describes the list of parameters of a cluster on Google Cloud Platform. You can change the name of the cluster.

While updating a cluster, all parameters are optional.

Request API Syntax

Request API Syntax explains the entire syntax for creating a cluster. You can add the configuration that needs modification of an existing cluster in the Update API payload.

Sample API Requests

Here is a sample API request to update a cluster with 1223 as its ID.

curl -X PUT -H "X-AUTH-TOKEN:$X_AUTH_TOKEN" -H "Content-Type:application/json" -H "Accept: application/json" \
-d '{
 "cluster_info": {
          "min_nodes": 1,
          "max_nodes": 6,
          }
 }' \ "https://gcp.qubole.com/api/v2/clusters/1223"

Here is a sample API request to add Presto overrides.

curl -X PUT -H "X-AUTH-TOKEN:$AUTH-TOKEN" -H "Content-Type:application/json" -H "Accept: application/json" \
-d '{"engine_config":{
         "flavour":"presto",
         "presto_settings":{
               "custom_presto_config":"jvm.config:\n-Xmx16G \nconfig.properties:\ndatasources=jmx,hive,sqlservercatalog\nascm.enabled=false\ncatalog/sqlservercatalog.properties:\nconnector.name=sqlserver\nconnection-url=jdbc:sqlserver://xxx.xx.xx.xx:xxxx;databaseName=HadoopData\nconnection-user=username\nconnection-password=password",
               "presto_version":"0.180"
         }
         }
     }' \ "https://gcp.qubole.com/api/v2/clusters/cluster-id"
Clone a Cluster on Google Cloud Platform
POST /api/v2/clusters/<cluster ID>/clone

Use this API to clone a cluster.

Required Role

The following users can make this API call:

  • Users who belong to the system-admin or system-user group.
  • Users who belong to a group associated with a role that allows creating a cluster. See Managing Groups and Managing Roles for more information.
Parameters

Parameters describes the list of parameters of a cluster on GCP. You can change the name of the cluster.

Request API Syntax

Request API Syntax explains the entire syntax for creating a GCP cluster. You can add the configuration that needs modification of an existing cluster in the Clone API payload.

Sample API Request

Here is a sample API request to clone a cluster with 1223 as its ID.

curl -X POST -H "X-AUTH-TOKEN:$X_AUTH_TOKEN" -H "Content-Type:application/json" -H "Accept: application/json" \
-d '{
 "cluster_info": {
          "label": ["gcp-clone"],
          "min_nodes": 1,
          "max_nodes": 4,
          "cluster_name": "GCP1-clone",
          "node_bootstrap": "node_bootstrap.sh",
          }
 }' \ "https://gcp.qubole.com/api/v2/clusters/1223/clone"
View a Cluster on Google Cloud Platform
GET /api/v2/clusters/<cluster ID>

Use this API to view a specific cluster’s information/configuration.

Required Role

The following users can make this API call:

  • Users who belong to the system-admin or system-user group.
  • Users who belong to a group associated with a role that allows creating a cluster. See Managing Groups and Managing Roles for more information.
Parameters

None

Request API Syntax
curl -X GET -H "X-AUTH-TOKEN:$X_AUTH_TOKEN" -H "Content-Type:application/json" -H "Accept: application/json" \
"https://gcp.qubole.com/api/v2/clusters/<cluster-ID>"
Sample API Request

Here is a sample request to view the cluster configuration that has 302 as its ID.

curl -X GET -H "X-AUTH-TOKEN:$X_AUTH_TOKEN" -H "Content-Type:application/json" -H "Accept: application/json" \
"https://gcp.qubole.com/api/v2/clusters/302"
List Clusters on Google Cloud Platform
GET /api/v2/clusters/

Use this API to list all clusters.

Parameters

None

Required Role

The following users can make this API call:

  • Users who belong to the system-admin or system-user group.
  • Users who belong to a group associated with a role that allows creating a cluster. See Managing Groups and Managing Roles for more information.
Request API Syntax
curl -X GET -H "X-AUTH-TOKEN:$X_AUTH_TOKEN" -H "Content-Type:application/json" -H "Accept: application/json" \
"https://gcp.qubole.com/api/v2/clusters/"
Sample API Request

It is the same as the syntax as there is no variable in the API.

Start or Terminate a Cluster on Google Cloud Platform
PUT /api/v2/clusters/<cluster-id/cluster-label>/state

Use this API to start or terminate a cluster.

Required Role

The following users can make this API call:

  • Users who belong to the system-admin or system-user group.
  • Users who belong to a group associated with a role that allows start/stop operations on a cluster. See Managing Groups and Managing Roles for more information.
Parameters

Note

Parameters marked in bold below are mandatory. Others are optional and have default values.

Parameter Description
state It is used to state the action to perform. Its valid values are start to start a cluster and terminate to terminate it. Starting a running cluster or stopping a terminated cluster will have no effect.
Request API Syntax

Here is the request API syntax to start/terminate a cluster.

curl -X PUT -H "X-AUTH-TOKEN:$X_AUTH_TOKEN" -H "Content-Type: application/json" -H "Accept: application/json" \
-d '{"state":"<start/terminate>"}' \ "https://gcp.qubole.com/api/v2/clusters/<cluster ID/cluster label>/state"
Sample API Requests

Here are sample requests to start and terminate a cluster with 200 as its ID.

Start a Cluster
curl -X PUT -H "X-AUTH-TOKEN:$X_AUTH_TOKEN" -H "Content-Type: application/json" -H "Accept: application/json" \
-d '{"state":"start"}' \ "https://gcp.qubole.com/api/v2/clusters/200/state"
Terminate a Running Cluster
curl -X PUT -H "X-AUTH-TOKEN:$X_AUTH_TOKEN" -H "Content-Type: application/json" -H "Accept: application/json" \
-d '{"state":"terminate"}' \ "https://gcp.qubole.com/api/v2/clusters/200/state"
Check Cluster Status on Google Cloud Platform
GET /api/v2/clusters/<cluster ID/cluster-label>

Use this API to get the current state of a cluster.

Required Role

The following users can make this API call:

  • Users who belong to the system-admin or system-user group.
  • Users who belong to a group associated with a role that allows viewing a cluster configuration. See Managing Groups and Managing Roles for more information.
Request API Syntax

Here is the API request syntax to know the current status of a cluster.

curl -X GET -H "X-AUTH-TOKEN:$X_AUTH_TOKEN" -H "Content-Type: application/json" -H "Accept: application/json" \
"https://gcp.qubole.com/api/v2/clusters/<cluster ID/cluster label>/state"
Sample API Request

Here is the sample API request to know the current status of a cluster that has 250 as its ID.

curl -X GET -H "X-AUTH-TOKEN:$X_AUTH_TOKEN" -H "Content-Type: application/json" -H "Accept: application/json" \
"https://gcp.qubole.com/api/v2/clusters/250/state"
Delete a Cluster on Google Cloud Platform
DELETE /api/v2/clusters/<cluster ID>

Use this API to delete a specific cluster.

Required Role

The following users can make this API call:

  • Users who belong to the system-admin or system-user group.
  • Users who belong to a group associated with a role that allows creating a cluster. See Managing Groups and Managing Roles for more information.
Parameters

None

Request API Syntax
curl -X DELETE -H "X-AUTH-TOKEN:$X_AUTH_TOKEN" -H "Content-Type:application/json" -H "Accept: application/json" \
"https://gcp.qubole.com/api/v2/clusters/<cluster-ID>"
Sample API Request

Here is a sample request to delete the cluster that has 301 as its ID.

curl -X DELETE -H "X-AUTH-TOKEN:$X_AUTH_TOKEN" -H "Content-Type:application/json" -H "Accept: application/json" \
"https://gcp.qubole.com/api/v2/clusters/301"
Remove a Node from a Cluster
DELETE /api/v2/clusters/(string: id_or_label)/nodes

Use this API to remove a worker node from a cluster. This action starts the operation asynchronously. This API is supported only on Hadoop 2 and Presto clusters.

In Hadoop 2/Presto clusters, nodes are removed from the cluster gracefully, which means that a node is only removed after tasks/jobs running on that particular node get completed. A node is only removed only if the current cluster size is greater than the configured minimum cluster size in Hadoop 2/Presto clusters.

The operation can be monitored using the command ID in the response through the command status API.

Note

A cluster must be running to remove a node. A Remove Node API does not check for maximum size of the cluster. Currently, this function is supported only on Hadoop 2 and Presto clusters.

Required Role

The following users can make this API call:

  • Users who belong to the system-admin group.
  • Users who belong to a group associated with a role that allows removing nodes from an existing cluster. See Managing Groups and Managing Roles for more information.
Response

The response contains a JSON object representing the command ID of the remove node operation. All the attributes mentioned here are returned (except when otherwise specified or redundant).

Parameters

Note

Parameters marked in bold below are mandatory. Others are optional and have default values.

Parameter Description
private_dns Private DNS of the worker node.
Examples

Example 1 shows an example to remove nodes and Example 2 shows an example to remove nodes by passing the private DNS of the worker node.

Example 1

To remove nodes from a cluster with 277344 as its ID.

curl -X DELETE
-H "X-AUTH-TOKEN:$AUTH_TOKEN"
-H "Content-Type:application/json" \
https://gcp.qubole.com/api/v2/clusters/277344/nodes
Example 2

To remove nodes from a cluster by specifying the private DNS of the worker node.

curl -i -X DELETE -H "X-AUTH-TOKEN:$AUTH_TOKEN" -H "Content-Type:application/json" \
https://gcp.qubole.com/api/v2/clusters/<cluster-id>/nodes?private_dns=<private_dns>

Command API

Command Object

Many Command APIs like submitting, canceling or viewing the status, return a (json encoded) command object. The fields of this object are detailed below:

Field Description
id The ID of the command.
status

The status of the command can be one of the following:

  • waiting: queued in QDS but not started processing yet
  • running: being processed
  • cancelling: the command is being cancelled in response to a user request
  • cancelled: command is complete and was cancelled by the user
  • error: command is complete and failed
  • done: command is complete and was successful
command_type The type of the command.
Cancel a Command
PUT /api/v1.2/commands/(int: id)

This API is used to cancel a command.

Required Role

The following users can make this API call:

  • Users who belong to the system-admin or system-user group.
  • Users who belong to a group associated with a role that allows submitting a command. See Managing Groups and Managing Roles for more information.
Parameters

Note

Parameters marked in bold below are mandatory. Others are optional and have default values.

Parameter Description
status kill
Example
curl  -i -X PUT -H "X-AUTH-TOKEN: $AUTH_TOKEN" -H "Content-Type: application/json" -H "Accept: application/json" \
-d '{"status":"kill"}' \ "https://gcp.qubole.com/api/v1.2/commands/${QUERYID}"
Submit a DB Query Command
POST /api/v1.2/commands/

This API is to submit a DB query.

Required Role

The following users can make this API call:

  • Users who belong to the system-admin or system-user group.
  • Users who belong to a group associated with a role that allows submitting a command. See Managing Groups and Managing Roles for more information.
Parameters

Note

Parameters marked in bold below are mandatory. Others are optional and have default values.

Parameter Description
query Specify DB Tap query to run.
db_tap_id Specify the DB Tap id of the target database to run the query on.
command_type DbTapQueryCommand
macros Expressions to evaluate macros used in the DB Tap Query command. Refer to Macros in Scheduler for more details.
name Add a name to the command that is useful while filtering commands from the command history. It does not accept & (ampersand), < (lesser than), > (greater than), ” (double quotes), and ‘ (single quote) special characters, and HTML tags as well. It can contain a maximum of 255 characters.
tags Add a tag to a command so that it is easily identifiable and searchable from the commands list in the Commands History. Add a tag as a filter value while searching commands. It can contain a maximum of 255 characters. A comma-separated list of tags can be associated with a single command. While adding a tag value, enclose it in square brackets. For example, {"tags":["<tag-value>"]}.
macros Denotes the macros that are valid assignment statements containing the variables and its expression as: macros: [{"<variable>":<variable-expression>}, {..}]. You can add more than one variable. For more information, see Macros.
timeout It is a timeout for command execution that you can set in seconds. Its default value is 129600 seconds (36 hours). QDS checks the timeout for a command every 60 seconds. If the timeout is set for 80 seconds, the command gets killed in the next minute that is after 120 seconds. By setting this parameter, you can avoid the command from running for 36 hours.
script_location Denotes the cloud storage location for the query.
Example

Goal: Show tables

curl -i -X POST -H "X-AUTH-TOKEN: $AUTH_TOKEN" -H "Content-Type: application/json" -H "Accept: application/json" \
-d '{"command_type": "DbTapQueryCommand", "db_tap_id":"1",  "query":"show tables"}' \ "https://gcp.qubole.com/api/v1.2/commands"

Response

HTTP/1.1 200 OK
Content-Type: application/json; charset=utf-8

 {
  "timeout":null,
  "qlog":null,
  "status":"waiting",
  "meta_data":{
   "results_resource":"commands/000000/results",
   "logs_resource":"commands/000000/logs"
  },
  "account_id":00,
  "user_id":00,
  "command_source":"API",
  "command":{
   "db_tap_id":1,
   "query":"show tables",
   "md_cmd":null
  },
  "pool":null,
  "can_notify":false,
  "end_time":null,
  "command_type":"DbTapQueryCommand",
  "label":"default",
  "pid":null,
  "progress":0,
  "num_result_dir":-1,
  "created_at":"2014-12-24T09:01:27Z",
  "submit_time":1419411687,
  "name":null,
  "start_time":null,
  "template":"generic",
  "resolved_macros":null,
  "path":"/tmp/2014-12-24/234/000000",
  "id":000000,
  "qbol_session_id":null
 }
Submit a Hive Command
POST /api/v1.2/commands/

This API is used to submit a Hive query.

Required Role

The following users can make this API call:

  • Users who belong to the system-admin or system-user group.
  • Users who belong to a group associated with a role that allows submitting a command. See Managing Groups and Managing Roles for more information.
Parameters

Note

Parameters marked in bold below are mandatory. Others are optional and have default values.

Parameter Description
query Specify Hive query to run. Either query or script_location is required.
script_location Specify a Google Cloud Storage path where the hive query to run is stored. Either query or script_location is required.
command_type HiveCommand
label Specify the cluster label on which this command is to be run.
retry Denotes the number of retries for a job. Valid values of retry are 1, 2, and 3.
retry_delay Denotes the time interval between the retries when a job fails.
macros Expressions to evaluate macros used in the hive command. Refer to Macros in Scheduler for more details.
sample_size Size of sample in bytes on which to run the query for test mode.
maximum_progress Value of progress for constrained run. The valid float value is between 0 and 1.
maximum_run_time Constrained run maximum runtime in seconds
minimum_run_time Constrained run minimum runtime in seconds
approx_aggregations Convert count distinct to count approx. Valid values are bool or NULL
name Add a name to the command that is useful while filtering commands from the command history. It does not accept & (ampersand), < (lesser than), > (greater than), ” (double quotes), and ‘ (single quote) special characters, and HTML tags as well. It can contain a maximum of 255 characters.
pool Use this parameter to specify the Fairscheduler pool name for the command to use.
tags Add a tag to a command so that it is easily identifiable and searchable from the commands list in the Commands History. Add a tag as a filter value while searching commands. It can contain a maximum of 255 characters. A comma-separated list of tags can be associated with a single command. While adding a tag value, enclose it in square brackets. For example, {"tags":["<tag-value>"]}.
macros Denotes the macros that are valid assignment statements containing the variables and its expression as: macros: [{"<variable>":<variable-expression>}, {..}]. You can add more than one variable. For more information, see Macros.
timeout It is a timeout for command execution that you can set in seconds. Its default value is 129600 seconds (36 hours). QDS checks the timeout for a command every 60 seconds. If the timeout is set for 80 seconds, the command gets killed in the next minute that is after 120 seconds. By setting this parameter, you can avoid the command from running for 36 hours.

Note

Log for a particular Hive query is available at <Default location>/cluster_inst_id/<cmd_id>.log.gz.

Examples

Goal: Show tables

curl  -i -X POST -H "X-AUTH-TOKEN: $AUTH_TOKEN" -H "Content-Type: application/json" -H "Accept: application/json" \
-d '{
      "query":"show tables;", "command_type": "HiveCommand"
    }' \
"https://gcp.qubole.com/api/v1.2/commands"

Response:

HTTP/1.1 200 OK
Content-Type: application/json; charset=utf-8

 {
   "command": {
     "approx_mode": false,
     "approx_aggregations": false,
     "query": "show tables",
     "sample": false
   },
   "qbol_session_id": 0000,
   "created_at": "2012-10-11T16:01:09Z",
   "user_id": 00,
   "status": "waiting",
   "command_type": "HiveCommand",
   "id": 3850,
   "progress": 0,
   "meta_data": {
     "results_resource": "commands\/3850\/results",
     "logs_resource": "commands\/3850\/logs"
   }
 }

Goal: Create an External Table from data on S3

export QUERY="create external table miniwikistats (projcode string, pagename string, pageviews int, bytes int) partitioned by(dt string) row format delimited fields terminated by \t lines terminated by \n location s3n://paid-qubole/default-datasets/miniwikistats/"

curl -X POST -H "X-AUTH-TOKEN:$AUTH_TOKEN" -H "Content-Type: application/json" -H "Accept: application/json" \
-d '{
      "query":"$QUERY", "command_type":"HiveCommand"
   }'\
"https://gcp.qubole.com/api/v1.2/commands/"

Response:

HTTP/1.1 200 OK
Content-Type: application/json; charset=utf-8

 {
   "command": {
     "approx_mode": false,
     "approx_aggregations": false,
     "query": "create external table miniwikistats (projcode string, pagename string, pageviews int, bytes i) partitioned by(dt string) row format delimited fields terminated by ' ' lines terminated by '\n' location 's3n:\/\/paid-qubole\/default-datasets\/miniwikistats\/'",
     "sample": false
   },
   "qbol_session_id": 0000,
   "created_at": "2012-10-11T16:44:53Z",
   "user_id": 00,
   "status": "error",
   "command_type": "HiveCommand",
   "id": 3851,
   "progress": 100,
   "meta_data": {
     "results_resource": "commands\/3851\/results",
     "logs_resource": "commands\/3851\/logs"
   }
 }

Goal: Count the number of rows in the table

export QUERY="select count(*) as num_rows from miniwikistats;"

curl -X POST -H "X-AUTH-TOKEN: $AUTH_TOKEN" -H "Content-Type: application/json" -H "Accept: application/json" \
-d '{
     "query":"$QUERY", "command_type": "HiveCommand"
    }' \
"https://gcp.qubole.com/api/v1.2/commands/"

Response:

HTTP/1.1 200 OK
Content-Type: application/json; charset=utf-8

 {
   "command": {
     "approx_mode": false,
     "approx_aggregations": false,
     "query": "select count(*) as num_rows from miniwikistats;",
     "sample": false
   },
   "qbol_session_id": 0000,
   "created_at": "2012-10-11T16:54:57Z",
   "user_id": 00,
   "status": "waiting",
   "command_type": "HiveCommand",
   "id": 3852,
   "progress": 0,
   "meta_data": {
     "results_resource": "commands\/3852\/results",
     "logs_resource": "commands\/3852\/logs"
   }
 }

Goal: Run a query stored in a S3 file location

Contents of file in S3

select count(*) from miniwikistats

Payload

{
  "script_location":"<S3 Path>", "command_type": "HiveCommand"
}

Request

curl -X POST -H "X-AUTH-TOKEN: $AUTH_TOKEN"  -H "Content-Type: application/json" -H "Accept: application/json" \
-d @payload "https://gcp.qubole.com/api/v1.2/commands/"

Goal: Run a parameterized query stored in a S3 file location

Contents of file in S3

select count(*) from miniwikistats where dt = '$formatted_date$'

Payload

{
    "script_location":"<S3 Path>",
    "macros":[{"date":"moment('2011-01-11T00:00:00+00:00')"},{"formatted_date":"date.clone().format('YYYY-MM-DD')"}],
    "command_type": "HiveCommand"
}

Request

curl -X POST -H "X-AUTH-TOKEN: $AUTH_TOKEN"  -H "Content-Type: application/json" -H "Accept: application/json" \
-d @payload "https://gcp.qubole.com/api/v1.2/commands/"

Take a note of the query ID (in this case 3852). It is used in later examples.

export QUERYID=3852

Goal: Submitting a Hive Query to a Specific Cluster

curl  -X POST -H "X-AUTH-TOKEN: $AUTH_TOKEN" -H "Content-Type: application/json" -H "Accept: application/json" \
-d '{"query":"show tables;", "label":"HadoopCluster", "command_type": "HiveCommand"}' \ "https://gcp.qubole.com/api/v1.2/commands"
Submit a Hadoop CloudDistCp Command
POST /api/v1.2/commands/

Hadoop DistCP is the tool used for copying large amount of data across clusters. Ensure that the output directory is new and does not exist before running a Hadoop job.

Required Role

The following users can make this API call:

  • Users who belong to the system-admin or system-user group.
  • Users who belong to a group associated with a role that allows submitting a command. See Managing Groups and Managing Roles for more information.
Parameters

Note

Parameters marked in bold below are mandatory. Others are optional and have default values.

Parameter Description
command_type HadoopCommand
sub_command clouddistcp
sub_command_args
[hadoop-generic-options] [clouddistcp-arg1] [clouddistcp-arg2] ...
src

Location of the data on HDFS or Google Cloud Storage location, to copy.

Important

CloudDistCp does not support bucket names with the underscore (_) character.

dest

Destination path for the copied data on HDFS or Google Cloud Storage location.

Important

CloudDistCp does not support bucket names with the underscore (_) character.

label Specify the cluster label on which this command is to be run.
name Add a name to the command that is useful while filtering commands from the command history. It does not accept & (ampersand), < (lesser than), > (greater than), ” (double quotes), and ‘ (single quote) special characters, and HTML tags as well. It can contain a maximum of 255 characters.
tags Add a tag to a command so that it is easily identifiable and searchable from the commands list in the Commands History. Add a tag as a filter value while searching commands. It can contain a maximum of 255 characters. A comma-separated list of tags can be associated with a single command. While adding a tag value, enclose it in square brackets. For example, {"tags":["<tag-value>"]}.
macros Denotes the macros that are valid assignment statements containing the variables and its expression as: macros: [{"<variable>":<variable-expression>}, {..}]. You can add more than one variable. For more information, see Macros.
srcPattern

A regular expression that filters the copy operation to a data subset at the src. If you specify neither srcPattern nor groupBy, all data from src is copied to dest.

If the regular expression contains special characters such as an asterisk (*), either the regular expression or the entire args string must be enclosed in single quotes (‘).

groupBy

A regular expression that causes CloudDistCp to concatenate files that match the expression. For example, you could use this option to combine log files written in one hour into a single file. The concatenated filename is the value matched by the regular expression for the grouping.

Parentheses indicate how files should be grouped, with all of the items that match the parenthetical statement being combined into a single output file. If the regular expression does not include a parenthetical statement, the cluster fails on the CloudDistCp step and returns an error.

If the regular expression argument contains special characters, such as an asterisk (*), either the regular expression or the entire args string must be enclosed in single quotes (‘).

When groupBy is specified, only files that match the specified pattern are copied. You must not specify groupBy and srcPattern at the same time.

targetSize

The size, in mebibytes (MiB), of the files to create based on the groupBy option. This value must be an integer. When it is set, CloudDistCp attempts to match this size; the actual size of the copied files may be larger or smaller than this value. Jobs are aggregated based on the size of the data file, hence, it is possible that the target file size will match the source data file size.

If the files concatenated by groupBy are larger than the value of targetSize, they are broken up into part files, and named sequentially with a numeric value appended to the end. For example, a file concatenated into file.gz would be broken into parts as: file0.gz, file1.gz, and so on.

outputCodec It specifies the compression codec to use for the copied files. This can take the values: gzip, gz, lzo, snappy, or none. You can use this option, for example, to convert input files compressed with Gzip into output files with LZO compression, or to uncompress the files as part of the copy operation. If you choose an output codec, the filename is appended with the appropriate extension (for example, for gz and gzip, the extension is .gz) If you do not specify a value for outputCodec, the files are copied over with no change in the compression.
CloudServerSideEncryption It ensures that the target data is transferred using SSL and automatically encrypted in Google Cloud Storage using a service-side key. When retrieving data using CloudDistCp, the objects are automatically unencrypted. If you try to copy an unencrypted object to an encryption-required storage bucket, the operation fails.
deleteOnSuccess If the copy operation is successful, this option makes CloudDistCp delete copied files from the source location. It is useful if you are copying output files, such as log files, from one location to another as a scheduled task, and you do not want to copy the same files twice.
disableMultipartUpload It disables the use of multipart upload.
encryptionKey If SSE-KMS or SSE-C is specified in the algorithm, then using this parameter, you can specify the key using which the data is encrypted. In case the algorithm is SSE-KMS, the key is not mandatory as default KMS would be used. If algorithm is SSE-C, then specify the key else the job fails.
filesPerMapper It is the value that denotes the number of files that is placed in each map task.
multipartUploadChunkSize The size, in MiB is the multipart upload part size. By default, it uses multipart upload when writing to cloud storage. The default chunk size is 16 MiB.
numberFiles It prepends output files with sequential numbers. The count starts at 0 unless a different value is specified by startingIndex.
startingIndex It is used with numberFiles to specify the first number in the sequence.
outputManifest It creates a text file, compressed with Gzip, that contains a list of all files copied by CloudDistCp.
previousManifest It reads a manifest file that was created during a previous call to CloudDistCp using the outputManifest. When previousManifest is set, CloudDistCp excludes the files listed in the manifest from the copy operation. If outputManifest is specified along with previousManifest, files listed in the previous manifest also appear in the new manifest file, even though the files are not copied.
copyFromManifest It reverses the previousManifest behavior to cause CloudDistCp to use the specified manifest file as a list of files to copy, instead of a list of files to exclude from copying.
CloudEndpoint It specifies the endpoint to use when uploading a file. This option sets the endpoint for both the source and destination.
CloudSSEAlgorithm It is used for encryption. If you do not specify it but CloudServerSideEncryption is enabled, then AES256 algorithm is used by default. Valid values are AES256, SSE-KMS, and SSE-C.
srcCloudEndpoint It is a cloud storage endpoint to specify as the source path.
timeout It is a timeout for command execution that you can set in seconds. Its default value is 129600 seconds (36 hours). QDS checks the timeout for a command every 60 seconds. If the timeout is set for 80 seconds, the command gets killed in the next minute that is after 120 seconds. By setting this parameter, you can avoid the command from running for 36 hours.
tmpDir It is the location (path) where files are stored temporarily when they are copied from the cloud object storage to the cluster. The default value is hdfs:///tmp.
Example
curl  -i -X POST -H "X-AUTH-TOKEN: $AUTH_TOKEN" -H "Content-Type: application/json" -H "Accept: application/json" \
-d '{"sub_command": "clouddistcp", "sub_command_args": "--src cloud://paid-qubole/kaggle_data/HeritageHealthPrize/ --dest /datasets/HeritageHealthPrize", "command_type": "HadoopCommand"}' \
"https://gcp.qubole.com/api/v1.2/commands"

Sample Response

{
    "id": 18167,
    "meta_data": {
        "logs_resource": "commands/18167/logs",
        "results_resource": "commands/18167/results"
    },
    "command": {
        "sub_command": "clouddistcp",
        "sub_command_args": "--src cloud://paid-qubole/kaggle_data/HeritageHealthPrize/ --dest /datasets/HeritageHealthPrize"
    },
    "command_type": "HadoopCommand",
    "created_at": "2013-03-14T09:34:15Z",
    "path": "/tmp/2013-03-14/53/18167",
    "progress": 0,
    "qbol_session_id": 3525,
    "qlog": null,
    "resolved_macros": null,
    "status": "waiting"
}
Submit a Hadoop Jar Command
POST /api/v1.2/commands/

This API is used to submit a Hadoop Jar command. Ensure that the output directory is new and does not exist before running a Hadoop job.

For developing applications, see use-cascading-with-qds.

Required Role

The following users can make this API call:

  • Users who belong to the system-admin or system-user group.
  • Users who belong to a group associated with a role that allows submitting a command. See Managing Groups and Managing Roles for more information.
Parameters

Note

Parameters marked in bold below are mandatory. Others are optional and have default values.

Parameter Description
command_type HadoopCommand
sub_command jar
sub_command_args
storage_path_to_jar [main_class] [hadoop-generic-options] [arg1] [arg2] ...
label Specify the cluster label on which this command is to be run.
retry Denotes the number of retries for a job. Valid values of retry are 1, 2, and 3.
retry_delay Denotes the time interval between the retries when a job fails.
name Add a name to the command that is useful while filtering commands from the command history. It does not accept & (ampersand), < (lesser than), > (greater than), ” (double quotes), and ‘ (single quote) special characters, and HTML tags as well. It can contain a maximum of 255 characters.
pool Use this parameter to specify the Fairscheduler pool name for the command to use.
tags Add a tag to a command so that it is easily identifiable and searchable from the commands list in the Commands History. Add a tag as a filter value while searching commands. It can contain a maximum of 255 characters. A comma-separated list of tags can be associated with a single command. While adding a tag value, enclose it in square brackets. For example, {"tags":["<tag-value>"]}.
macros Denotes the macros that are valid assignment statements containing the variables and its expression as: macros: [{"<variable>":<variable-expression>}, {..}]. You can add more than one variable. For more information, see Macros.
timeout It is a timeout for command execution that you can set in seconds. Its default value is 129600 seconds (36 hours). QDS checks the timeout for a command every 60 seconds. If the timeout is set for 80 seconds, the command gets killed in the next minute that is after 120 seconds. By setting this parameter, you can avoid the command from running for 36 hours.
Examples

The example given below runs a Hadoop Streaming job. The streaming jar is stored on Google Cloud Storage and the application just runs a map-only job running the Unix utility wc against the input dataset.

Hadoop Streaming Job
export OUTPUT_LOC=<cloud_storage output location>;

curl  -i -X POST -H "X-AUTH-TOKEN: $AUTH_TOKEN" -H "Content-Type: application/json" -H "Accept: application/json"  \
-d '{"sub_command": "jar", "sub_command_args": "s3://paid-qubole/HadoopAPIExamples/jars/hadoop-0.20.1-dev-streaming.jar -mapper wc -numReduceTasks 0 -input s3://paid-qubole/HadoopAPITests/data/3.tsv -output s3://paid-qubole/HadoopAPITests/data/3_wc", "command_type": "HadoopCommand"}' \
"https://gcp.qubole.com/api/v1.2/commands"

Sample Response

{
  "id":4246,
  "meta_data":
   {
      "results_resource":"commands/4246/results",
      "logs_resource":"commands/4246/logs"
   },
   "command":{"sub_command_args":"s3n://paid-qubole/HadoopAPITests/jars/hadoop-0.20.1-dev-streaming.jar -mapper wc -numReduceTasks 0 -input s3://paid-qubole/datasets/data1_30days/20100101/EU/3.tsv -output s3n://paid-qubole/tmp/wcl_3","sub_command":"jar"},
   "progress":0,
   "status":"waiting",
   "command_type":"HadoopCommand",
   "qbol_session_id":1629,
   "created_at":"2012-10-16T11:29:36Z",
   "user_id":9
}
Hadoop Jar Gutenberg Job
curl  -i -X POST -H "X-AUTH-TOKEN: $AUTH_TOKEN " -H "Content-Type: application/json" -H "Accept: application/json" \
-d '{"sub_command": "jar", "sub_command_args": "s3://paid-qubole/HadoopAPIExamples/jars/hadoop-0.20.1-dev-streaming.jar -files s3n://paid-qubole/HadoopAPIExamples/WordCountPython/mapper.py,s3n://paid-qubole/HadoopAPIExamples/WordCountPython/reducer.py -mapper mapper.py -reducer reducer.py -numReduceTasks 1 -input s3n://paid-qubole/default-datasets/gutenberg -output s3://paid-qubole/default-datasets/grun119_1",
"command_type": "HadoopCommand"}' \ "https://gcp.qubole.com/api/v1.2/commands"
Hadoop Streaming Job with a Cluster Label
curl  -i -X POST -H "X-AUTH-TOKEN: $AUTH_TOKEN " -H "Content-Type: application/json" -H "Accept: application/json" \
-d '{"sub_command": "streaming", "sub_command_args": "'-files'
's3://paid-qubole/HadoopAPIExamples/WordCountPython/mapper.py,s3://paid-qubole/HadoopAPIExamples/WordCountPython/reducer.py' '-mapper' 'mapper.py' '-reducer' 'reducer.py' '-numReduceTasks' '1' '-input' 's3://paid-qubole/default-*/guten*' '-output' 's3://paid-qubole/default-datasets/output4'",
"command_type": "HadoopCommand", "label":"HadoopCluster"}' \ "https://gcp.qubole.com/api/v1.2/commands"
Hadoop Streaming Job without a Cluster Label

Note: When a job is run without a cluster label, the default cluster runs the command.

curl  -i -X POST -H "X-AUTH-TOKEN: $AUTH_TOKEN " -H "Content-Type: application/json" -H "Accept: application/json" \
-d '{"sub_command": "streaming", "sub_command_args": "'-files' 's3://paid-qubole/HadoopAPIExamples/WordCountPython/mapper.py,s3://paid-qubole/HadoopAPIExamples/WordCountPython/reducer.py' '-mapper' 'mapper.py' '-reducer' 'reducer.py' '-numReduceTasks' '1' '-input' 's3://paid-qubole/default-*/guten*' '-output' 's3://paid-qubole/default-datasets/output4'",
"command_type": "HadoopCommand"}' \ "https://gcp.qubole.com/api/v1.2/commands"
Submit a Presto Command
POST /api/v1.2/commands/

This API is used to submit a Presto command.

Note

Presto is not currently supported on all Cloud platforms; see QDS Components: Supported Versions and Cloud Platforms. Presto supports querying tables backed by the storage handlers.

Required Role

The following users can make this API call:

  • Users who belong to the system-admin or system-user group.
  • Users who belong to a group associated with a role that allows submitting a command. See Managing Groups and Managing Roles for more information.
Parameters

Note

Parameters marked in bold below are mandatory. Others are optional and have default values.

Parameter Description
query Specify Presto query to run. Either the query or the script_location is required.
script_location Specify a S3 path where the presto query to run is stored. Either the query or the script_location is required. The storage credentials stored in the account are used to retrieve the script file.
command_type Presto command
label Specify the cluster label on which this command is to be run.
retry

Denotes the number of retries for a job. Valid values of retry are 1, 2, and 3.

Caution

Configuring retries will just do a blind retry of a Presto query. This may lead to data corruption for non-Insert Overwrite Directory (IOD) queries.

name Add a name to the command that is useful while filtering commands from the command history. It does not accept & (ampersand), < (lesser than), > (greater than), ” (double quotes), and ‘ (single quote) special characters, and HTML tags as well. It can contain a maximum of 255 characters.
tags Add a tag to a command so that it is easily identifiable and searchable from the commands list in the Commands History. Add a tag as a filter value while searching commands. It can contain a maximum of 255 characters. A comma-separated list of tags can be associated with a single command. While adding a tag value, enclose it in square brackets. For example, {"tags":["<tag-value>"]}.
macros Denotes the macros that are valid assignment statements containing the variables and its expression as: macros: [{"<variable>":<variable-expression>}, {..}]. You can add more than one variable. For more information, see Macros.
timeout It is a timeout for command execution that you can set in seconds. Its default value is 129600 seconds (36 hours). QDS checks the timeout for a command every 60 seconds. If the timeout is set for 80 seconds, the command gets killed in the next minute that is after 120 seconds. By setting this parameter, you can avoid the command from running for 36 hours.

presto-system-metrics describes the list of metrics that can be seen on the Datadog monitoring service. It also describes the abnormalities and actions that you can perform to handle abnormalities.

Examples
Show tables
curl  -i -X POST -H "X-AUTH-TOKEN: $AUTH_TOKEN" -H "Content-Type: application/json" -H "Accept: application/json" \
-d '{"query":"show tables;", "command_type": "PrestoCommand"}' \ "https://gcp.qubole.com/api/v1.2/commands"

Note

For a given Presto query, a new Presto Query Tracker is displayed in the Logs tab when:

  • A cluster instance that ran the query is still up.
  • The query info is still present in the Presto server in that cluster instance. The query information is periodically purged from the server.

If any of the above 2 conditions is not met, the older Presto Query Tracker is displayed in the Logs tab.

Sample Response

{ "command":
  { "query": "show tables", },
  "qbol_session_id": 0000,
  "created_at": "2014-01-21T16:01:09Z",
  "user_id": 00,
  "status": "waiting",
  "command_type": "PrestoCommand",
  "id": 4850,
  "progress": 0,
  "meta_data": {
    "results_resource": "commands/4850/results",
    "logs_resource": "commands/4850/logs"
  }
}
Count Number of Rows
export QUERY="select count(*) as num_rows from miniwikistats;" curl -X POST -H "X-AUTH-TOKEN: $AUTH_TOKEN" -H "Content-Type: application/json" -H "Accept: application/json" \
-d '{\"query\":\"$QUERY\", \"command_type\" : \"PrestoCommand\" }' \ "https://gcp.qubole.com/api/v1.2/commands/"

Sample Response

{
  "command": { "query": "select count(*) as num_rows from miniwikistats;", },
  "qbol_session_id": 0000,
  "created_at": "2014-10-11T16:54:57Z",
  "user_id": 00,
  "status": "waiting",
  "command_type": "PrestoCommand",
  "id": 4852,
  "progress": 0,
  "meta_data": {
    "results_resource": "commands/4852/results",
    "logs_resource": "commands/4852/logs"
  }
}
Submit a Refresh Table Command
POST /api/v1.2/commands/

This command API can be used to refresh only a Hive table. This API can be mainly used when a Hive partition or directory is extensively used to write data and when Hive tables must be refreshed regularly.

Required Role

The following users can make this API call:

  • Users who belong to the system-admin or system-user group.
  • Users who belong to a group associated with a role that allows submitting a command. See Managing Groups and Managing Roles for more information.
Parameters

Note

Parameters marked in bold below are mandatory. Others are optional and have default values.

Parameter Description
db_name Database name that contains the Hive table, which is to be refreshed.
hive_table Name of the Hive table that is to be refreshed
loader_stable Checks if a Hive directory or partition is fully loaded.
loader_stable_mult It is the time in minutes to wait before a directory is considered loaded.
template s3import template is used to refresh tables. The template is used to differentiate the refresh table command from other Hive commands that use generic template.
name Add a name to the command that is useful while filtering commands from the command history. It does not accept & (ampersand), < (lesser than), > (greater than), ” (double quotes), and ‘ (single quote) special characters, and HTML tags as well. It can contain a maximum of 255 characters.
tags Add a tag to a command so that it is easily identifiable and searchable from the commands list in the Commands History. Add a tag as a filter value while searching commands. It can contain a maximum of 255 characters. A comma-separated list of tags can be associated with a single command. While adding a tag value, enclose it in square brackets. For example, {"tags":["<tag-value>"]}.
macros Denotes the macros that are valid assignment statements containing the variables and its expression as: macros: [{"<variable>":<variable-expression>}, {..}]. You can add more than one variable. For more information, see Macros.
timeout It is a timeout for command execution that you can set in seconds. Its default value is 129600 seconds (36 hours). QDS checks the timeout for a command every 60 seconds. If the timeout is set for 80 seconds, the command gets killed in the next minute that is after 120 seconds. By setting this parameter, you can avoid the command from running for 36 hours.
Request
curl -i -X POST -H "X-AUTH-TOKEN: $AUTH_TOKEN" -H "Content-Type: application/json" -H "Accept: application/json" \
-d ' {"db_name":"default", "hive_table":"default_qubole_memetracker", "loader_stable":"1",
"loader_stable_mult":"Minutes", "template":"s3import"}' \
"https://gcp.qubole.com/api/v1.2/commands"
Sample Response
HTTP/1.1 200 OK
Cache-Control: max-age=0, private, must-revalidate
Content-Type: application/json; charset=utf-8
Date: Mon, 16 Nov 2015 18:49:20 GMT
ETag: "b2a6a723e8b7e931ff87e44feacc9a2f"
Server: nginx/1.6.2 + Phusion Passenger 4.0.53
Set-Cookie: _tapp_session=bd0070f3344489b9a306c8c072cdc71c; path=/; HttpOnly
Set-Cookie: qbol_user_id=1574; path=/
Status: 200 OK
X-Powered-By: Phusion Passenger 4.0.53
X-Rack-Cache: invalidate, pass
X-Request-Id: 91b57e2a21319a4b51fb354378869fb0
X-Runtime: 0.422908
X-UA-Compatible: IE=Edge,chrome=1
Content-Length: 865
Connection: keep-alive

{"status":"waiting","qbol_session_id":null,"progress":0,"uid":2081,"account_id":632,"end_time":null,
"start_time":null,"command_type":"HiveCommand","command":{"sample":false,"approx_aggregations":false,"md_cmd":null,
"query":"use default ; alter table default_qubole_memetracker recover partitions;","approx_mode":false,
"loader_table_name":"default.default_qubole_memetracker","retry":0,"script_location":null,"loader_stable":1},
"created_at":"2015-11-16T18:49:20Z","num_result_dir":0,"submit_time":1447699760,"pid":null,"can_notify":false,
"qlog":null,"resolved_macros":null,"label":"default","user_id":1574,"saved_query_mutable_id":null,
"command_source":"API","name":null,"pool":null,"timeout":null,"template":"s3import",
"path":"/tmp/2015-11-16/632/402620","id":402620,"meta_data":{"results_resource":"commands/402620/results",
"logs_resource":"commands/402620/logs"}}
Submit a Shell Command
POST /api/v1.2/commands/

Use this API to submit a shell command and it only supports Bash commands.

Required Role

The following users can make this API call:

  • Users who belong to the system-admin or system-user group.
  • Users who belong to a group associated with a role that allows submitting a command. See Managing Groups and Managing Roles for more information.
Parameters

Note

Parameters marked in bold below are mandatory. Others are optional and have default values.

Parameter Description
inline Inline script to run or submit a Shell command. Either script or script_location is required.
script_location Specify a cloud storage path where the shell query to run is stored. Either query or script_location is required. Storage credentials stored in the account are used.
command_type ShellCommand
files List of files in an cloud storage bucket. Format : file1,file2. These files will be copied to the working directory where the command is executed.
archive List of archives in a storage bucket. Format : archive1,archive2. These are unarchived in the working directory where the command is executed. The file gets archived in the folder with the name as the filename along with its extension. For example, if the archive file is s3://<bucket>/abc.tar, it uncompresses the file as <working directory>/abc.tar. If you have to refer to a file src/a.py, then refer to it as abc.tar/src/a.py.
macros Expressions to evaluate macros used in the shell command. Refer to Macros in Scheduler for more details.
label Specify the cluster label on which this command is to be run.
can_notify Sends an email on command completion.
name Add a name to the command that is useful while filtering commands from the command history. It does not accept & (ampersand), < (lesser than), > (greater than), ” (double quotes), and ‘ (single quote) special characters, and HTML tags as well. It can contain a maximum of 255 characters.
pool Use this parameter to specify the Fairscheduler pool name for the command to use.
tags Add a tag to a command so that it is easily identifiable and searchable from the commands list in the Commands History. Add a tag as a filter value while searching commands. It can contain a maximum of 255 characters. A comma-separated list of tags can be associated with a single command. While adding a tag value, enclose it in square brackets. For example, {"tags":["<tag-value>"]}.
macros Denotes the macros that are valid assignment statements containing the variables and its expression as: macros: [{"<variable>":<variable-expression>}, {..}]. You can add more than one variable. For more information, see Macros.
timeout It is a timeout for command execution that you can set in seconds. Its default value is 129600 seconds (36 hours). QDS checks the timeout for a command every 60 seconds. If the timeout is set for 80 seconds, the command gets killed in the next minute that is after 120 seconds. By setting this parameter, you can avoid the command from running for 36 hours.
Examples

Goal: Inline script

curl  -X POST -H "X-AUTH-TOKEN: $AUTH_TOKEN" -H "Content-Type: application/json" -H "Accept: application/json" \
-d '{
        "inline":"hadoop dfs -lsr s3://paid-qubole/;", "command_type":"ShellCommand"
        }' \
"https://gcp.qubole.com/api/v1.2/commands"

Response:

HTTP/1.1 200 OK
Content-Type: application/json; charset=utf-8
{
    "qlog":null,
     "created_at":"2015-01-12T11:50:21Z",
     "status":"waiting",
     "meta_data":{
         "results_resource":"commands/36/results",
         "logs_resource":"commands/36/logs"
     },
     "account_id":"1",
     "user_id":1,
     "pool":null,
     "submit_time":1421063421,
     "progress":0,
     "template":"generic",
     "pid":null,
     "resolved_macros":null,
     "label":"default",
     "timeout":null,
     "can_notify":false,
     "qbol_session_id":7,
     "command_source":"API",
     "name":null,
     "num_result_dir":-1,
     "end_time":null,
     "start_time":null,
     "path":"/tmp/2015-01-12/1/36",
     "id":36,
     "command_type":"ShellCommand",
     "command":{
         "files":null,
         "parameters":null,
         "script_location":null,
         "inline":"hadoop dfs -lsr s3://paid-qubole/;",
         "archives":null
     }
 }

Goal: Script_location

curl  -X POST -H "X-AUTH-TOKEN: $AUTH_TOKEN" -H "Content-Type: application/json" -H "Accept: application/json" \
-d '{
        "script_location":"s3://paid-qubole/ShellDemo/data/excite-small.sh;", "command_type":"ShellCommand"
        }' \
"https://gcp.qubole.com/api/v1.2/commands"

Goal: Running shell commands using Files

curl -X POST -H "X-AUTH-TOKEN: $AUTH_TOKEN" -H "Content-Type: application/json" -H "Accept: application/json" \
-d '{
        "inline":"hadoop dfs -lsr s3://paid-qubole/;", "files":"s3://paid-qubole/ShellDemo/data/excite-small.sh,s3://paid-qubole/ShellDemo/data/excite-big.sh;", "command_type":"ShellCommand"
        }' \
"https://gcp.qubole.com/api/v1.2/commands"

Goal: Running shell commands using Archives

curl -X POST -H "X-AUTH-TOKEN: $AUTH_TOKEN" -H "Content-Type: application/json" -H "Accept: application/json" \
-d '{
        "inline":"hadoop dfs -lsr s3://paid-qubole/;", "archives":"s3://paid-qubole/ShellDemo/data/excite-small.gz,s3://paid-qubole/ShellDemo/data/excite-big.gz;", "command_type":"ShellCommand"
        }' \
"https://gcp.qubole.com/api/v1.2/commands"

Goal: Using Macros in a shell command

curl -X POST -H "X-AUTH-TOKEN: $AUTH_TOKEN" -H "Content-Type: application/json" -H "Accept: application/json" \
-d '{
        "inline" : "hadoop dfs -lsr s3://$location$/;", "command_type" : "ShellCommand",
        "macros" : [{"location" : "\"paid-qubole\""}]}' \
"https://gcp.qubole.com/api/v1.2/commands"

Take a note of how the double quotes are used in the above query.

Goal: Submit a Shell Script

curl -X POST -H "X-AUTH-TOKEN: $AUTH_TOKEN" -H "Content-Type: application/json" -H "Accept: application/json" \
-d '{"parameters" : "5454 5454", "command_type" : "ShellCommand"}' \
 "https://gcp.qubole.com/api/v1.2/commands"
Submit a Spark Command
POST /api/v1.2/commands/

Use this API to submit a Spark command.

Required Role

The following users can make this API call:

  • Users who belong to the system-admin or system-user group.
  • Users who belong to a group associated with a role that allows submitting a command. See Managing Groups and Managing Roles for more information.
Parameters

Note

Parameters marked in bold below are mandatory. Others are optional and have default values.

Parameter Description
program Provide the complete Spark Program in Scala, SQL, Command, R, or Python.
language

Specify the language of the program. The supported values are scala (Scala), sql (SQL), command_line (Command), R (R), or py (Python). Required only when a program is used.

Note

Values are case-sensitive.

script_location Specify an cloud storage path where the Spark query (Scala, Python, SQL, R, and Command Line) script is stored. Storage credentials stored in the account are used to retrieve the script file.
arguments Specify the spark-submit command line arguments here.
user_program_arguments Specify the arguments that the user program takes in.
cmdline Alternatively, you can provide the spark-submit command line itself. If you use this option, you cannot use any other parameters mentioned here. All required information is captured in command line itself.
command_type SparkCommand
label Specify the cluster label on which this command is to be run.
retry Denotes the number of retries for a job. Valid values of retry are 1, 2, and 3.
retry_delay Denotes the time interval between the retries when a job fails.
app_id ID of an app, which is a main abstraction of the Spark Job Server API. An app is used to store the configuraton for a Spark application. See Understanding the Spark Job Server for more information.
name Add a name to the command that is useful while filtering commands from the command history. It does not accept & (ampersand), < (lesser than), > (greater than), ” (double quotes), and ‘ (single quote) special characters, and HTML tags as well. It can contain a maximum of 255 characters.
tags Add a tag to a command so that it is easily identifiable and searchable from the commands list in the Commands History. Add a tag as a filter value while searching commands. It can contain a maximum of 255 characters. A comma-separated list of tags can be associated with a single command. While adding a tag value, enclose it in square brackets. For example, {"tags":["<tag-value>"]}.
macros Denotes the macros that are valid assignment statements containing the variables and its expression as: macros: [{"<variable>":<variable-expression>}, {..}]. You can add more than one variable. For more information, see Macros.
pool Use this parameter to specify the Fairscheduler pool name for the command to use.
timeout It is a timeout for command execution that you can set in seconds. Its default value is 129600 seconds (36 hours). QDS checks the timeout for a command every 60 seconds. If the timeout is set for 80 seconds, the command gets killed in the next minute that is after 120 seconds. By setting this parameter, you can avoid the command from running for 36 hours.

Note

  • You can run Spark commands with large script file and large inline content.
  • You can use macros in script files for the Spark commands with subtypes scala (Scala), py (Python), R (R), command_line (Command), and sql (SQL). You can also use macros in large inline contents and large script files for scala (Scala), py (Python), R (R), and sql (SQL).

These features are not enabled for all users by default. Create a ticket with Qubole Support to enable these features on the QDS account.

Note

If you are submitting a Scala code that contains multiple lines, then you must escape every new line with the escape character \n .

Examples

Examples are written in python and uses pyCurl. Using CURL directly is possible but hard as the program needs escaping. Also, JSON does not support new lines. To avoid confusion, these python API examples are provided which are clear and can be used directly.

Alternatively, you can use qds-sdk-py directly.

Example Python API Framework
import sys
import pycurl
import json
c= pycurl.Curl()
url="https://gcp.qubole.com/api/v1.2/commands"
auth_token = <provide auth token here>
c.setopt(pycurl.URL, url)
c.setopt(pycurl.HTTPHEADER, ["X-AUTH-TOKEN: "+ auth_token, "Content-Type:application/json", "Accept: application/json"])
c.setopt(pycurl.POST,1)

(After this, select any of the following examples depending on the requirement.)

The above code snippet can be used to make API calls. The following examples uses the above program as its base and shows various use-cases.

Example to Submit Spark Scala Program
prog = '''
import scala.math.random

import org.apache.spark._

/** Computes an approximation to pi */
object SparkPi {
  def main(args: Array[String]) {
    val conf = new SparkConf().setAppName("Spark Pi")
    val spark = new SparkContext(conf)
    val slices = if (args.length > 0) args(0).toInt else 2
    val n = math.min(100000L * slices, Int.MaxValue).toInt // avoid overflow
    val count = spark.parallelize(1 until n, slices).map { i =>
      val x = random * 2 - 1
      val y = random * 2 - 1
      if (x*x + y*y < 1) 1 else 0
    }.reduce(_ + _)
    println("Pi is roughly " + 4.0 * count / n)
    spark.stop()
  }
}
'''
data=json.dumps({"program":prog,"language":"scala","arguments":"--class SparkPi", "command_type":"SparkCommand"})

c.setopt(pycurl.POSTFIELDS, data)
c.perform()

To submit a snippet to the Spark Job Server app, use the following data payload instead of the above data.

data=json.dumps({"program":prog,"language":"scala","arguments":"--class SparkPi", "command_type":"SparkCommand",
"label"="spark","app_id"="3"})

Where app_id = Spark Job Server app ID. See Understanding the Spark Job Server for more information.

Example to Submit Spark Python Program

Here is the Spark Pi example in Python.

prog = '''
import sys
from random import random
from operator import add

from pyspark import SparkContext


if __name__ == "__main__":
    """
        Usage: pi [partitions]
    """
    sc = SparkContext(appName="PythonPi")
    partitions = int(sys.argv[1]) if len(sys.argv) > 1 else 2
    n = 100000 * partitions

    def f(_):
        x = random() * 2 - 1
        y = random() * 2 - 1
        return 1 if x ** 2 + y ** 2 < 1 else 0

    count = sc.parallelize(xrange(1, n + 1), partitions).map(f).reduce(add)
    print "Pi is roughly %f" % (4.0 * count / n)

    sc.stop()
'''
data=json.dumps({"program":prog,"language":"python","command_type":"SparkCommand"})

c.setopt(pycurl.POSTFIELDS, data)
c.perform()
Example to Add Spark Submit Options

Add arguments in JSON body to supply spark-submit options. You can pass remote files in a cloud storage location in addition to the local files as values to the --py-files argument.

data=json.dumps(
{"program":prog,
"language":"python", "arguments": "--num-executors 10 --max-executors 10 --executor-memory 5G --executor-cores 2"
"command_type":"SparkCommand"})
Example to Add Arguments to User Program

Add user_program_arguments in JSON body. Here is a sample program which takes in arguments (input and output location).

prog=
'''spark.range(args(0).toInt).collect.foreach (println)'''

data=json.dumps(
{"program":prog,
"language":"scala",
"user_program_arguments": "10",
"command_type":"SparkCommand",})

c.setopt(pycurl.POSTFIELDS, data)
c.perform()
Example to Use Command Line Parameter

For power users, Qubole provides the ability to provide the spark-submit command line directly. This is explained in detail here.

Note

It is not recommended to run a Spark application as a Bash command under the Shell command options because automatic changes such as increase in the Application Coordinator memory based on the driver memory and debug options’ availability do not happen. Such automatic changes occur when you run a Spark application through the Command Line option.

In this case, you must compile the program (in case of Scala), create a jar, upload the file to cloud storage and invoke the command line. Note that Qubole’s deployment of Spark is available at the /usr/lib/spark directory:

/usr/lib/spark/bin/spark-submit [options] <app jar in cloud storage | python file> [app options]

Here is an example.

/usr/lib/spark/bin/spark-submit --class <classname>  --max-executors 100 --num-executors 15 --driver-memory 10g
--executor-memory 3g --executor-cores 5 <jar_path_in-storage> <arguments>

Here is a REST API example to submit a Spark command in the command-line language.

curl  -i -X POST -H "X-AUTH-TOKEN: $AUTH_TOKEN" -H "Content-Type: application/json" -H "Accept: application/json" \
-d '{"cmdline":"/usr/lib/spark/bin/spark-submit --class org.apache.spark.examples.SparkPi
     --master yarn-client /usr/lib/spark/spark-examples-*", "language":"command_line", "command_type":"SparkCommand",
     "label":"sparkcluster"}' \
      "https://gcp.qubole.com/api/v1.2/commands"
Example to Submit Spark Command in SQL

You can submit a Spark Command in SQL. Here is an example to submit a Spark Command in SQL.

curl  -i -X POST -H "X-AUTH-TOKEN: $AUTH_TOKEN" -H "Content-Type: application/json" -H "Accept: application/json" \
-d '{
      "sql":"select * from default_qubole_memetracker limit 10;",
      "language":"sql","command_type":"SparkCommand", "label":"spark"
    }' \
"https://gcp.qubole.com/api/v1.2/commands"

When submitting a Spark command in SQL, you can specify the location of a SparkSQL script in the script_location parameter as shown in the following example.

curl  -i -X POST -H "X-AUTH-TOKEN: $AUTH_TOKEN" -H "Content-Type: application/json" -H "Accept: application/json" \
-d '{"script_location":"<S3 Path>", "language":"sql", "command_type":"SparkCommand", "label":"<cluster-label>"
    }' \
"https://gcp.qubole.com/api/v1.2/commands"
Example to Submit a Spark Command in SQL to a Spark Job Server App

You can submit a Spark command in SQL to an existing Spark Job Server app.

curl  -i -X POST -H "X-AUTH-TOKEN: $AUTH_TOKEN" -H "Content-Type: application/json" -H "Accept: application/json" \
-d '{
      "sql":"select * from default_qubole_memetracker limit 10;",
      "language":"sql","command_type":"SparkCommand", "label":"spark","app_id":"3"
    }' \
"https://gcp.qubole.com/api/v1.2/commands"

Where app_id = Spark Job Server app ID. See Understanding the Spark Job Server for more information.

Known Issue

The Spark Application UI may display the state of the application incorrectly when preemptible instances are used.

When GCP reclaims the instance the coordinator node is running on, the Spark Application UI may still show the application is running. You can see the actual status of the Qubole command on the Workbench or Notebooks page.

To avoid this issue, use an on-demand instance for the coordinator node.

View the Command History
GET /api/v1.2/commands/

Use this API to view command history.

Required Role

The following users can make this API call:

  • Users who belong to the system-admin or system-user group.
  • Users who belong to a group associated with a role that allows viewing the command history. See Managing Groups and Managing Roles for more information.
View Queries by User ID
Resource URI commands/
Request Type GET
Supporting Versions v1, v2.0
Return Value This curl request returns a JSON object containing the Command Objects with all its attributes as described above. Additionally, the returned JSON object contains the next_page, previous page, and per page parameters.
Parameters

Note

Parameters marked in bold below are mandatory. Others are optional and have default values.

Parameter Description
page The number of pages that contain the commands’ history. Its value is an integer.
per_page The number of commands to be retrieved per page. Its value is an integer and the maximum value can be 100. Retrieves the next 100 commands based on the last command ID for a given QDS account.
all_users By default, it is set to 0. Set it to 1 to get the command history of all users.
include_query_properties By default, this parameter is set to false. Setting it to true displays query properties such as tags and query history comments.
start_date The date from which you want the command history (inclusive). The API default is 30 days before the end date. This parameter also supports timestamp in the UTC timezone (YYYY-MM-DDTHH:MM:SSZ) format.
end_date The date until which you want the command history (inclusive). The API default is today. This parameter also supports timestamp in the UTC timezone (YYYY-MM-DDTHH:MM:SSZ) format.
command_type The type of the command. Enter a single or multiple (comma-separated) values.
status

The status of the command. It can be one of the following:

  • waiting: denotes that the command is queued (in QDS) but has not started processing yet
  • running: denotes that the command is being processed
  • cancelling: denotes that the command is being cancelled in response to a user request
  • cancelled: denotes that the command is complete but was cancelled by the user
  • error: denotes that the command is complete but failed
  • done: denotes that the command is complete and was successful
command_source The source of creation of the command. For example, UI, API, and/ or Scheduler. Enter a single or multiple (comma-separated) values.
name Use the name of the command to filter commands from the command history. & (ampersand), < (lesser than), > (greater than), ” (double quotes), and ‘ (single quote) special characters, and HTML tags are not accepted. It can contain a maximum of 255 characters.
Sample API Requests

Examples:

  • To get last 10 commands for current user:
curl -i -X GET -H "X-AUTH-TOKEN:$AUTH_TOKEN" -H "Content-Type:application/json" -H "Accepts:application/json" \
"https://gcp.qubole.com/api/v1.2/commands"
  • (Pagination) To get results 10-12 for current user (4th page with 3 results per page):
curl -i -X GET -H "X-AUTH-TOKEN: $AUTH_TOKEN" -H "Content-Type: application/json" -H "Accept: application/json" \
-d '{"page":"4", "per_page":"3"}' "https://gcp.qubole.com/api/v1.2/commands"

or

curl -i -H "X-AUTH-TOKEN: $AUTH_TOKEN" "https://gcp.qubole.com/api/v1.2/commands?page=4&per_page=3"

To search by the command type and name (returns last 10 commands with the specified name and specified command type):

curl -i -X GET -H "X-AUTH-TOKEN:$AUTH_TOKEN" -H "Content-Type:application/json" -H "Accepts:application/json" \
-d '{"command_type":"ShellCommand,HiveCommand", "name":"named_command"}' \
"https://gcp.qubole.com/api/v1.2/commands"
Sample API Response

The sample response for the paginated call will be:

{
   “paging_info":{"previous_page":3,"next_page":5,"per_page":3},
   “commands”: [{<standard command object as described in create a command>}, ..]
}
View Hadoop Jobs Spawned By a Command
GET /api/v1.2/commands/(int: command_id)/jobs

Use this API to retrieve the details of the Hadoop jobs spawned on the cluster by a command (command_id). This information is only available for commands, which have been completed.

Required Role

The following users can make this API call:

  • Users who belong to the system-admin or system-user group.
  • Users who belong to a group associated with a role that allows viewing Hadoop jobs. See Managing Groups and Managing Roles for more information.
Response

The response is an array of JSON objects with the following details.

Job Fields
Field Description
job_stats The field displays various details of the job, namely its counters, start time, finish time, and so on. It displays values only when the cluster is active. Otherwise, the field shows NULL value as in job_stats:{} for inactive/terminated clusters.
url The JobTracker URL for the job.
job_id The job ID.
error_message A message indicating an error when retrieving the job details. It is not present if there is no error.
http_error_code The HTTP error code if any. It is present only if there was an HTTP error retrieving job details.
Example

Goal

To view the status and counters of hadoop jobs spawned by command, 1234.

curl  -i -X GET -H "X-AUTH-TOKEN: $AUTH_TOKEN" -H "Content-Type: application/json" -H "Accept: application/json" \
"https://gcp.qubole.com/api/v1.2/commands/1234/jobs"

Response

[
{
    "job_stats": {
        "finished_at": "Mon Feb 16 07:54:56 UTC 2015",
        "user": "foo@bar.com",
        "pool_name": "foo@bar.com",
        "status": "SUCCEEDED",
        "job_name": "110163-ShellCommand",
        "reduce": {
            "num_tasks": "0",
            "pending_tasks": "0",
            "complete_tasks": "0",
            "complete_percent": "100.0",
            "killed_task_attemps": "0",
            "running_tasks": "0",
            "failed_task_attempts": "0",
            "killed_tasks": "0"
        },
        "counters": {
            "org.apache.hadoop.mapred.JobInProgress$Counter": [
                {
                    "name": "Total time spent by all maps",
                    "total_value": "4,545",
                    "reduce_value": "0",
                    "map_value": "0"
                },
                {
                    "name": "Total time spent by all reduces waiting after reserving slots (ms)",
                    "total_value": "0",
                    "reduce_value": "0",
                    "map_value": "0"
                },
                {
                    "name": "Total time spent by all maps waiting after reserving slots (ms)",
                    "total_value": "0",
                    "reduce_value": "0",
                    "map_value": "0"
                },
                {
                    "name": "Launched map tasks",
                    "total_value": "1",
                    "reduce_value": "0",
                    "map_value": "0"
                },
                {
                    "name": "Total time spent by all reduces",
                    "total_value": "0",
                    "reduce_value": "0",
                    "map_value": "0"
                }
            ],
            "com.qubole.ShellLauncher$MapCounter": [
                {
                    "name": "HasActualJobStartedYet",
                    "total_value": "1",
                    "reduce_value": "0",
                    "map_value": "1"
                },
                {
                    "name": "ShellExitCode",
                    "total_value": "0",
                    "reduce_value": "0",
                    "map_value": "0"
                }
            ],
            "org.apache.hadoop.mapred.Task$Counter": [
                {
                    "name": "Map input records",
                    "total_value": "1",
                    "reduce_value": "0",
                    "map_value": "1"
                },
                {
                    "name": "Total physical memory in bytes",
                    "total_value": "89,743,360",
                    "reduce_value": "0",
                    "map_value": "89,743,360"
                },
                {
                    "name": "Spilled Records",
                    "total_value": "0",
                    "reduce_value": "0",
                    "map_value": "0"
                },
                {
                    "name": "MAP_TASK_WALLCLOCK",
                    "total_value": "2,402",
                    "reduce_value": "0",
                    "map_value": "2,402"
                },
                {
                    "name": "Total cumulative CPU milliseconds",
                    "total_value": "530",
                    "reduce_value": "0",
                    "map_value": "530"
                },
                {
                    "name": "Map input bytes",
                    "total_value": "1",
                    "reduce_value": "0",
                    "map_value": "1"
                },
                {
                    "name": "Total virtual memory in bytes",
                    "total_value": "1,122,668,544",
                    "reduce_value": "0",
                    "map_value": "1,122,668,544"
                },
                {
                    "name": "Map output records",
                    "total_value": "0",
                    "reduce_value": "0",
                    "map_value": "0"
                }
            ]
        },
        "started_at": "Mon Feb 16 07:54:50 UTC 2015",
        "map": {
            "num_tasks": "1",
            "pending_tasks": "0",
            "complete_tasks": "1",
            "complete_percent": "100.0",
            "killed_task_attemps": "0",
            "running_tasks": "0",
            "failed_task_attempts": "0",
            "killed_tasks": "0"
        }
    },
    "url": "https://gcp.qubole.com.net/qpal/handle_proxy?query=http%3A%2F%2F2-54-161-105-44.compute-1.gcp.com%3A50030%2Fjobdetails_json.jsp%3Fjobid%3Djob_11.201502160742_0001",
    "job_id": "job_11.201502160742_0001"
},
{
    "job_stats": {
        "finished_at": "Mon Feb 16 08:02:47 UTC 2015",
        "user": "foo@bar.com",
        "pool_name": "foo@bar.com",
        "status": "SUCCEEDED",
        "job_name": "TeraGen",
        "reduce": {
            "num_tasks": "0",
            "pending_tasks": "0",
            "complete_tasks": "0",
            "complete_percent": "100.0",
            "killed_task_attemps": "0",
            "running_tasks": "0",
            "failed_task_attempts": "0",
            "killed_tasks": "0"
        },
        "counters": {
            "org.apache.hadoop.mapred.JobInProgress$Counter": [
                {
                    "name": "Total time spent by all maps",
                    "total_value": "6,293,376",
                    "reduce_value": "0",
                    "map_value": "0"
                },
                {
                    "name": "Total time spent by all reduces waiting after reserving slots (ms)",
                    "total_value": "0",
                    "reduce_value": "0",
                    "map_value": "0"
                },
                {
                    "name": "Total time spent by all maps waiting after reserving slots (ms)",
                    "total_value": "0",
                    "reduce_value": "0",
                    "map_value": "0"
                },
                {
                    "name": "Launched map tasks",
                    "total_value": "46",
                    "reduce_value": "0",
                    "map_value": "0"
                },
                {
                    "name": "Total time spent by all reduces",
                    "total_value": "0",
                    "reduce_value": "0",
                    "map_value": "0"
                }
            ],
            "FileSystemCounters": [
                {
                    "name": "HDFS_FILES_CREATED",
                    "total_value": "40",
                    "reduce_value": "0",
                    "map_value": "40"
                },
                {
                    "name": "HDFS_BYTES_WRITTEN",
                    "total_value": "100,000,000,000",
                    "reduce_value": "0",
                    "map_value": "100,000,000,000"
                }
            ],
            "org.apache.hadoop.mapred.Task$Counter": [
                {
                    "name": "Map input records",
                    "total_value": "1,000,000,000",
                    "reduce_value": "0",
                    "map_value": "1,000,000,000"
                },
                {
                    "name": "Total physical memory in bytes",
                    "total_value": "5,167,636,480",
                    "reduce_value": "0",
                    "map_value": "5,167,636,480"
                },
                {
                    "name": "Spilled Records",
                    "total_value": "0",
                    "reduce_value": "0",
                    "map_value": "0"
                },
                {
                    "name": "MAP_TASK_WALLCLOCK",
                    "total_value": "6,066,445",
                    "reduce_value": "0",
                    "map_value": "6,066,445"
                },
                {
                    "name": "Total cumulative CPU milliseconds",
                    "total_value": "3,237,210",
                    "reduce_value": "0",
                    "map_value": "3,237,210"
                },
                {
                    "name": "Map input bytes",
                    "total_value": "1,000,000,000",
                    "reduce_value": "0",
                    "map_value": "1,000,000,000"
                },
                {
                    "name": "Total virtual memory in bytes",
                    "total_value": "66,841,980,928",
                    "reduce_value": "0",
                    "map_value": "66,841,980,928"
                },
                {
                    "name": "Map output records",
                    "total_value": "1,000,000,000",
                    "reduce_value": "0",
                    "map_value": "1,000,000,000"
                }
            ]
        },
        "started_at": "Mon Feb 16 07:55:11 UTC 2015",
        "map": {
            "num_tasks": "40",
            "pending_tasks": "0",
            "complete_tasks": "40",
            "complete_percent": "100.0",
            "killed_task_attemps": "6",
            "running_tasks": "0",
            "failed_task_attempts": "0",
            "killed_tasks": "0"
        }
    },
    "url": "https://gcp.qubole.com.net/qpal/handle_proxy?query=http%3A%2F%2F54-161-105-44.compute-1.gcp.com%3A50030%2Fjobdetails_json.jsp%3Fjobid%3Djob_11.201502160742_0003",
    "job_id": "job_11.201502160742_0003"
}
]
View Command Logs
GET /api/v1.2/commands/(int: command_id)/logs

Retrieves the log (i.e. stderr) of the command (command_id).

Required Role

The following users can make this API call:

  • Users who belong to the system-admin or system-user group.
  • Users who belong to a group associated with a role that allows viewing commands’ logs. See Managing Groups and Managing Roles for more information.
Response

The response is raw text containing the log of the command.

For Workflow commands, the sequence_number parameter enables downloading of the logs of a workflow subcommand.

Example

Goal

To view the logs of command, example QUERYID=1234

curl  -i -X GET -H "X-AUTH-TOKEN: $AUTH_TOKEN"
-H "Content-Type: application/json"
-H "Accept: text/plain"
"https://gcp.qubole.com/api/v1.2/commands/${QUERYID}/logs"

Response

Total MapReduce jobs = 1
Getting Hadoop cluster information ...
Cluster not found - provisioning cluster machines ...
Waiting for Hadoop to come up ...
Launching Job 1 out of 1
Number of reduce tasks determined at compile time: 1
In order to change the average load for a reducer (in bytes):
  set hive.exec.reducers.bytes.per.reducer=<number>
In order to limit the maximum number of reducers:
  set hive.exec.reducers.max=<number>
In order to set a constant number of reducers:
  set mapred.reduce.tasks=<number>
......
......
......
Hadoop job information for Stage-1: number of mappers: 2; number of reducers: 1
2012-10-11 16:57:30,346 Stage-1 map = 0%,  reduce = 0%
2012-10-11 16:57:59,725 Stage-1 map = 1%,  reduce = 0%
2012-10-11 16:58:02,784 Stage-1 map = 8%,  reduce = 0%
......
......
......
2012-10-11 16:59:46,147 Stage-1 map = 99%,  reduce = 0%, Cumulative CPU 73.01 sec
2012-10-11 16:59:47,159 Stage-1 map = 99%,  reduce = 0%, Cumulative CPU 73.01 sec
2012-10-11 16:59:48,172 Stage-1 map = 99%,  reduce = 0%, Cumulative CPU 73.01 sec
2012-10-11 16:59:49,183 Stage-1 map = 100%,  reduce = 0%, Cumulative CPU 73.01 sec
2012-10-11 16:59:50,195 Stage-1 map = 100%,  reduce = 0%, Cumulative CPU 73.01 sec
2012-10-11 16:59:51,207 Stage-1 map = 100%,  reduce = 0%, Cumulative CPU 73.01 sec
2012-10-11 16:59:52,218 Stage-1 map = 100%,  reduce = 0%, Cumulative CPU 73.01 sec
.....
.....
.....
2012-10-11 17:00:06,655 Stage-1 map = 100%,  reduce = 17%, Cumulative CPU 115.15 sec
2012-10-11 17:00:07,668 Stage-1 map = 100%,  reduce = 100%, Cumulative CPU 115.15 sec
2012-10-11 17:00:08,684 Stage-1 map = 100%,  reduce = 100%, Cumulative CPU 119.52 sec
MapReduce Total cumulative CPU time: 1 minutes 59 seconds 520 msec
Ended Job = job_14.201210111655_0001
1 Rows loaded to s3n://paid-qubole/......
MapReduce Jobs Launched:
Job 0: Map: 2  Reduce: 1   Accumulative CPU: 119.52 sec   HDFS Read: 0 HDFS Write: 0 SUCESS
Total MapReduce CPU Time Spent: 1 minutes 59 seconds 520 msec
OK
Time taken: 303.329 seconds
View Command Results
GET /api/v1.2/commands/(int: command_id)/results

This retrieves results for a completed command (command_id).

Required Role

The following users can make this API call:

  • Users who belong to the system-admin or system-user group.
  • Users who belong to a group associated with a role that allows viewing command results. See Managing Groups and Managing Roles for more information.
Parameters

Note

Presto is not currently supported on all Cloud platforms; see QDS Components: Supported Versions and Cloud Platforms.

Parameter Description
raw By default, it is set to false. Set it to true to see the result as is without converting delimiters (^A) into tabs. This works well for Presto query results. However, for a Hive command with number of result rows less than 1000, then delimiters are still tabs.
include_headers By default, it is set to false. This is an option to view headers in the results.
Response

When the command results in the cloud storage directory location are less than 20MB and contain less than 700 files, the result is returned inline in the JSON response. When the results are greater 20MB or the number of files is more than 700, the cloud storage directory location that contains the result files is returned.

Status Code 422: Command is not done. Results are unavailable.

For Workflow commands, the sequence_number parameter enables downloading of the results of a workflow subcommand.

Example

Goal

To view the results of command, example QUERYID=1234

curl -X GET -H "X-AUTH-TOKEN: $AUTH_TOKEN" \
-H "Content-Type: application/json" \
-H "Accept: application/json" \
"https://gcp.qubole.com/api/v1.2/commands/${QUERYID}/results"

Response

The following is the response, if the result is inlined:

HTTP/1.1 200 OK
Content-Type: application/json; charset=utf-8

{"inline":true, "results":"1\t240\r\n2\t300"}

The following is the response, if the result is NOT inlined:

HTTP/1.1 200 OK
Content-Type: application/json; charset=utf-8

{"inline":false, "result_location":[ "An array of cloud storage paths. Directories end with '/' in end" ]}

The following is the response when result is inline and is very large set of files, and shows the complete path of the results file to download from.

HTTP/1.1 200 OK
Content-Type: application/json; charset=utf-8

{ "inline": true, "results": "Too many files to process - download from ... ", "result_location": ["s3://mybucket/results/results..."] }
Example to include headers in the response

Goal

To view headers in the response, example QUERYID=183560526

curl -X GET -H "X-AUTH-TOKEN: $AUTH_TOKEN" \
-H "Content-Type: application/json" \
-H "Accept: application/json" \
 "https://gcp.qubole.com/api/v1.2/commands/183560526/results?include_headers=true"

Response

The following is the response:

HTTP/1.1 200 OK
Content-Type: application/json; charset=utf-8

{
 "inline": true,
 "results": "itinid\tmktid\tseqnum\tcoupons\tyear\tquarter\torigin\toriginaptind\torigincitynum\torigincountry\
 toriginstatefips\toriginstate\toriginstatename\toriginwac\tdest\tdestaptind\tdestcitynum\tdestcountry\tdeststatefips\
 tdeststate\tdeststatename\tdestwac\tbreak\tcoupontype\ttkcarrier\topcarrier\trpcarrier\tpassengers\tfareclass\tdistance\
 tdistancegroup\tgateway\titingeotype\tcoupongeotype\r\n\"ItinID\"\t\"MktID\"\t\"SeqNum\"\t\"Coupons\"\t\"Year\"\
 t\"Quarter\"\t\"Origin\"\t\"OriginAptInd\"\t\"OriginCityNum\"\t\"OriginCountry\"\t\"OriginStateFips\"\t\"OriginState\
 "\t\"OriginStateName\"\t\"OriginWac\"\t\"Dest\"\t\"DestAptInd\"\t\"DestCityNum\"\t\"DestCountry\"\t\"DestStateFips\"\t\
 "DestState\"\t\"DestStateName\"\t\"DestWac\"\t\"Break\"\t\"CouponType\"\t\"TkCarrier\"\t\"OpCarrier\"\t\"RPCarrier\"\t\
 "Passengers\"\t\"FareClass\"\t\"Distance\"\t\"DistanceGroup\"\t\"Gateway\"\t\"ItinGeoType\"\t\"CouponGeoType\"\r\n\
 "200734005923\"\t\"200737154697\"\t2\t4\t2007\t3\t\"LGA\"\t2\t63760\t\"US\"\t\"36\"\t\"NY\"\t\"New York\"\t22\t\"BOS\"\t0\
 t12200\t\"US\"\t\"25\"\t\"MA\"\t\"Massachusetts\"\t13\t\"X\"\t\"A\"\t\"US\"\t\"US\"\t\"ZW\"\t1.00\t\"X\"\t185.00\t1\t0.00\t2\t2\r\n"
}
Example to download the results file from the notebook/dashboard convert command

Goal

To download the results file in the pdf format from the notebook/dashboard convert command.

curl -X GET -H "X-AUTH-TOKEN: $AUTH_TOKEN" -d 'fileFormat=pdf' \
"https://gcp.qubole.com/api/v1.2/commands/${QUERYID}/results" > note.pdf

Note

The above command downloads the report to note.pdf file. You can change the file extension of the results file to html or png, based on the requirement.

View Command Status
GET /api/v1.2/commands/(int: command_id)

Use this API to check the status of any command. A user can check any query for the whole account.

Required Role

The following users can make this API call:

  • Users who belong to the system-admin or system-user group.
  • Users who belong to a group associated with a role that allows viewing command status. See Managing Groups and Managing Roles for more information.
Example
curl -i -X GET -H "X-AUTH-TOKEN: $AUTH_TOKEN" -H "Content-Type: application/json" -H "Accept: application/json" \
"https://gcp.qubole.com/api/v1.2/commands/${QUERYID}"

Sample response:

{
  "command": {
    "approx_mode": false,
    "approx_aggregations": false,
    "query": "select count(*) as num_rows from miniwikistats;",
    "sample": false
  },
  "qbol_session_id": 0000,
  "created_at": "2012-10-11T16:54:57Z",
  "user_id": 00,
  "status": "done",
  "command_type": "HiveCommand",
  "id": 3852,
  "progress": 100,
  "meta_data": {
    "results_resource": "commands\/3852\/results",
    "logs_resource": "commands\/3852\/logs"
  }:
  "email": "email@example.com",
  "throttled": true
}

Note

If the maximum concurrency of the account has been reached, the API contains an additional throttled field set to true. The status of the command will still be in the waiting state.

When checking the status of a Hadoop command, the JSON response object contains an additional field (job_url), the value of which is URL to JobTracker page specific to this job. A detailed information related to the job such as the number of mappers and reducers, current status, counters, and so on can be retrieved using this URL.

A command’s status can have one of the following values:

  • cancelled
  • done
  • error
  • running
  • waiting
View Command Error Logs
GET /api/v1.2/commands/<Command-ID>/error_logs

Use this API to see the failed command’s error logs. However, this API is currently supported only on Presto queries and Spark Scala/Spark Command-Line commands.

Required Role

The following users can make this API call:

  • Users who belong to the system-admin or system-user group.
  • Users who belong to a group associated with a role that allows viewing commands’ logs. See Managing Groups and Managing Roles for more information.
Request API Syntax
curl -X GET -H "X-AUTH-TOKEN: $AUTH_TOKEN" -H "Content-Type: application/json" -H "Accept: application/json" \
"https://gcp.qubole.com/api/v1.2/commands/<Command-ID>/error_logs"
Sample API Request

Here is a sample request of a failed Spark Scala command with 1200 as its ID.

curl -X GET -H "X-AUTH-TOKEN: $AUTH_TOKEN" -H "Content-Type: application/json" -H "Accept: application/json" \
"https://gcp.qubole.com/api/v1.2/commands/1200/error_logs"

Here is the sample response.

{
  "error_log": "org.apache.spark.sql.catalyst.analysis.NoSuchTableException: Table or view 'products_orc_202' not found in database 'default';"
}
View Command Status with Results
GET /api/v1.2/commands/<command-ID>/status_with_results

Use this API to view the command status and its result.

For a running command, the response returns status and progress fields. For a completed command, the response returns results and path fields.

The response contains either the results inline (TAB separated) or contains an cloud storage directory location that contains the actual results. The response returns the packaged and inline results when the command output is less than 20 MB and contains less than 700 files.

Required Role

The following users can make this API call:

  • Users who belong to the system-admin or system-user group.
  • Users who belong to a group associated with a role that allows viewing command status. See Managing Groups and Managing Roles for more information.
Parameters

Note

Parameters marked in bold below are mandatory. Others are optional and have default values. Presto is not currently supported on all Cloud platforms; see QDS Components: Supported Versions and Cloud Platforms.

Parameter Description
raw By default, it is set to false. Set it to true to see the result as is without converting delimiters (^A) into tabs. This works well for Presto query results. However, for a Hive command with number of result rows less than 1000, then delimiters are still tabs.
Request API Syntax

Here is the request API syntax.

curl -i -X GET -H "X-AUTH-TOKEN: $AUTH_TOKEN" -H "Content-Type: application/json" -H "Accept: application/json" \
"https://gcp.qubole.com/api/v1.2/commands/${QUERYID}/status_with_results"
Sample API Requests

Here is a sample request to view results of a command with 546935 as its ID.

curl -i -X GET -H "X-AUTH-TOKEN: $AUTH_TOKEN" -H "Content-Type: application/json" -H "Accept: application/json" \
"https://gcp.qubole.com/api/v1.2/commands/546935/status_with_results"
Response
{
  "id": "546935",
  "status": "done",
  "progress": 100,
  "error": null,
  "inline": true,
  "results": "default_qubole_airline_origin_destination\r\ndefault_qubole_memetracker\r\n",
  "qlog": "{\"QBOL-QUERY-SCHEMA\":{\"-1\":[{\"ColumnType\":\"string\",\"ColumnName\":\"tab_name\"}]}}",
  "path": "/tmp/2016-04-21/4483/546935"
}

Here is a sample request to view results of a command with 573178 as its ID.

curl -i -X GET -H "X-AUTH-TOKEN: $AUTH_TOKEN" -H "Content-Type: application/json" -H "Accept: application/json" \
"https://gcp.qubole.com/api/v1.2/commands/573178/status_with_results"
Response
{
  "id": "573178",
  "status": "waiting",
  "progress": 0
}

Custom Metastore API

Qubole supports connecting to a custom metastore through the Explore UI if you do not want to use the Qubole Hive metastore. For more information, see Connecting to a Custom Hive Metastore. If any custom metastore is connected to QDS, you can view it through the REST API that is given below.

Note

This feature is not enabled by default. To enable it, create a ticket with Qubole Support.

Connect to a Custom Metastore
POST /api/v1.2/custom_metastores

Use this API to connect to the custom metastore through QDS. Currently, QDS supports connecting to only MySQL meta stores. For information on how to connect to a metastore through the QDS UI, see Connecting to a Custom Hive Metastore.

Note

This feature is not enabled by default. To enable it, create a ticket with Qubole Support.

Required Role

The following users can make this API call:

  • Users who belong to the system-user or system-admin group.
  • Users who belong to a group associated with a role that allows submitting a command. See Managing Groups and Managing Roles for more information.
Parameters

Note

Parameters marked in bold below are mandatory. Others are optional and have default values.

Parameter Description
db_name This is the metastore name to be connected to the QDS.
db_host It is the IP address or hostname of the metastore.
db_user It is the user name to login to the metastore.
db_passwd Password to login to the metastore.
db_port It is the TCP Port to connect on to. If it is not specified, then the default port 3306 is used.
qubole_managed It is set to false by default as you cannot use Qubole_managed Hive metastore and a custom metastore at the same time.
enable_cluster_access Set it to true if the cluster has direct access to the custom metastore.
use_bastion_node Set this parameter to true when the metastore that you want to connect to is within a VPC and is accessed through the Bastion node. It is set to false by default.
bastion_node_public_dns It is the Public DNS address of the bastion node through which you can connect to the metastore. It becomes mandatory when use_bastion_node is set to true.
bastion_node_user It is the user ID that is used to log into the Bastion node through QDS. It becomes mandatory when use_bastion_node is set to true.
bastion_node_private_key It is the private key of the Bastion node. It becomes mandatory when use_bastion_node is set to true.
Request API Syntax

Here is the API syntax for the REST API call to connect to a metastore that is not within a VPC.

curl -X POST -H "X-AUTH-TOKEN:<AUTH-TOKEN>" -H "Content-Type: application/json" -H "Accept: application/json" \
-d '{"db_name" : "<Metastore Name>", "db_host": "<n.n.n.n/hostname>", "db_port":"<Port>", "db_user":"user_name", "db_passwd":"<Password>"}' \
"https://gcp.qubole.com/api/v1.2/custom_metastores/"

Here is the API syntax where you want to connect to a metastore that is in a VPC through the Bastion Node.

curl -X POST -H "X-AUTH-TOKEN:<AUTH-TOKEN>" -H "Content-Type: application/json" -H "Accept: application/json" \
-d '{"db_name" : "<Metastore Name>", "db_host": "<n.n.n.n/hostname>", "db_port":"<Port>", "db_user":"user_name", "db_passwd":"<Password>",
    "use_bastion_node":"true", "bastion_node_public_dns" : "<bastion node Public DNS>", "bastion_node_user" : "<bastion node user name>",
    "bastion_node_private_key" : "<Bastion Node Private Key>"}' \ "https://gcp.qubole.com/api/v1.2/custom_metastores/"
Sample Requests

Here is a sample API call to connect to a custom metastore that is not in a VPC.

curl -X POST -H "X-AUTH-TOKEN:<AUTH-TOKEN>" -H "Content-Type: application/json" -H "Accept: application/json" \
-d '{"db_name" : "HiveMetaStoreA", "db_host": "10.10.10.1", "db_user":"EC2-User", "db_passwd":"Met@1_P@ss"}' \
"https://gcp.qubole.com/api/v1.2/custom_metastores/"

Here is a sample request API to connect to a metastore that is within a VPC.

curl -X POST -H "X-AUTH-TOKEN:<AUTH-TOKEN>" -H "Content-Type: application/json" -H "Accept: application/json" \
-d '{"db_name" : "HiveMetaStoreA", "db_host": "10.10.10.1", "db_user":"EC2-User", "db_passwd":"Met@1_P@ss",
    "use_bastion_node":"true", "bastion_node_public_dns" : "10.10.10.8", "bastion_node_user" : "BastionAdmin",
    "bastion_node_private_key" : "<Bastion Node Private Key>"}' \
    "https://gcp.qubole.com/api/v1.2/custom_metastores/"
View the Custom Metastores
GET /api/v1.2/custom_metastores

Use this API to see the custom metastores that are connected to QDS.

For information on how to connect/edit to a metastore through the QDS UI, see Connecting to a Custom Hive Metastore.

Note

This feature is not enabled by default. To enable it, create a ticket with Qubole Support.

Required Role

The following users can make this API call:

  • Users who belong to the system-user or system-admin group.
  • Users who belong to a group associated with a role that allows submitting a command. See Managing Groups and Managing Roles for more information.
Parameters

None

Request API Syntax

Here is the API syntax for the REST API call that can be considered as an example too as there are no parameters in the data payload.

curl -X GET -H "X-AUTH-TOKEN:<AUTH-TOKEN>" -H "Content-Type: application/json" -H "Accept: application/json" \
"https://gcp.qubole.com/api/v1.2/custom_metastores/"

Dashboard API

This section explains how to create a dashboard and list dashboards by using the REST APIs. For more information on Dashboards UI, see Dashboards. This section covers:

Create a Dashboard
POST /api/v1.2/notebook_dashboard

Use this API to create a dashboard. To know how to create a dashboard using the Notebooks UI, see Publishing Dashboards.

Required Role

The following users can make this API call:

  • Users who belong to the system-admin or system-user group.
  • Users who belong to a group associated with a role that allows submitting a command. See Managing Groups and Managing Roles for more information.
Parameters

Note

Parameters marked in bold below are mandatory. Others are optional and have default values.

Parameter Description
name Name of the dashboard.
note_id Id of the notebook that has to be associated with the dashboard.
is_scheduled Specifies if the dashboard has to be scheduled. Possible values are true or false.
frequency Specifies how often the schedule should run. Input is an integer.
time_unit Denotes the time unit for the frequency. Possible values are months, weeks, days, hours or minutes.
location Location of the folder. Users/current_user_email_id is the default location.
Request API Syntax
curl -X POST  -H 'X-AUTH-TOKEN: <AUTH TOKEN>'  -H 'Content-Type: application/json'  -H 'Accept: application/json'
-d  '{"note_id":<notebook-id>,  "is_scheduled":"<true or false>", "frequency": <number>, "time_unit":<months, weeks, days, hours or minutes>,
 "name":"Name", "location":"<folder-location>"}'
"https://gcp.qubole.com/api/v1.2/notebook_dashboard"
Sample API Request
curl -X POST  -H 'X-AUTH-TOKEN: <AUTH TOKEN>'  -H 'Content-Type: application/json'  -H 'Accept: application/json'
-d  '{"note_id":97480,  "is_scheduled":"true", "frequency": 1440, "time_unit":"minutes", "name":"Sample", "location":"Users/email"}'
"https://gcp.qubole.com/api/v1.2/notebook_dashboard"
List Dashboards
GET /api/v1.2/notebooks/<note_id>/list_dashboards

Use this API to list dashboards for a notebook. To view all the dashboards associated with the notebook using the Notebooks UI, see Viewing All Dashboards.

Required Role

The following users can make this API call:

  • Users who belong to the system-admin or system-user group.
  • Users who belong to a group associated with a role that allows submitting a command. See Managing Groups and Managing Roles for more information.
Parameters

Note

Parameters marked in bold below are mandatory. Others are optional and have default values.

Parameter Description
note_id Id of the notebook for which the dashboards have to be listed.
Request API Syntax
curl -X GET  -H 'X-AUTH-TOKEN: <AUTH TOKEN>'  -H 'Content-Type: application/json'  -H 'Accept: application/json'
"https://gcp.qubole.com/api/v1.2/notebooks/<note_id>/list_dashboards"
Sample API Request
curl -X GET  -H 'X-AUTH-TOKEN: <AUTH TOKEN>'  -H 'Content-Type: application/json'  -H 'Accept: application/json'
"https://gcp.qubole.com/api/v1.2/notebooks/97480/list_dashboards"

DbTap API

It is often useful to import or export data to and from data stores other than Cloud storage. For example, you may want to run a periodic Hive query to summarize a dataset and export the results to a MySQL database; or you may want to import data from a relational database into Hive. You can identify an external data store for such purposes.

For more information, see Understanding a Data Store. The following APIs help you to create, edit, view, and delete data stores.

Create a DbTap
POST /api/v1.2/db_taps/

Use this API to create a data store (formerly known as DbTap). Adding a Data Store describes how to add a data store in the Explore UI page.

Resource URI db_taps/
Request Type POST
Supporting Versions v2.0
Return Value Json object representing the newly created DbTap.
Required Role

The following users can make this API call:

  • Users who belong to the system-admin group.
  • Users who belong to a group associated with a role that allows creating a DbTap. See Managing Groups and Managing Roles for more information.
Parameters

Note

Parameters marked in bold below are mandatory. Others are optional and have default values. Presto is not currently supported on all Cloud platforms; see QDS Components: Supported Versions and Cloud Platforms.

Parameter Description
catalog_name

This parameter is mandatory to make the data stores accessible through Presto and Spark clusters. The catalog_name can contain lower-case alphabets or numerals. Its value can be NULL if you do not want to export data source to Presto or Spark clusters. This parameter is supported for MySQL, Postgres, Snowflake, and Redshift type of data stores only through Presto/Spark.

Note

Apart from adding the catalog name, create a ticket with Qubole Support to enable this feature.

name Use this parameter to add a name to the data store. If you do not add a name, then Qubole adds the name as a combination of db_host and db_name.
db_name Database Name. This is the data store name that is created in the QDS.
db_host IP address or hostname of the data store.
db_user User name to login to the data store.
db_passwd Password to login to the data store.
port TCP Port to connect on. If not specified, the default port for the datastore type is used.
db_type Type of database. Valid values are mysql, vertica, mongo, postgresql, and redshift. The default value is mysql.
db_location Location of database. Valid values are us-east-1, us-west-2, ap-southeast-1, eu-west-1, and on-premise.
gateway_ip IP address or hostname of the gateway machine.
gateway_port The default port is 22. Specify a non-default port to connect to the gateway machine.
gateway_username User name to login to the gateway machine.
gateway_private_key Private key for the aforementioned user to login to the gateway machine. If you add the private key, you must add the associated public key to the bastion node as described in clusters-in-vpc.
skip_validation Set it to true to skip the data store validation when you create or edit a data store. It is set to false by default.
Note
  • Gateway parameters (gateway_ip, gateway_username, gateway_private_key) can be specified only if db_location is ‘on-premise’.
  • If you do not want to use a gateway machine to access your data store, you need not specify any of the gateway parameters.
  • Though the gateway parameters are optional, if any one of the gateway parameters is specified then all three must be specified.
  • Port 22 must be open on the gateway machine and it must have access to the data store.
Example
payload:
{
  "db_name":"doc_example",
  "db_host":"localhost",
  "db_user":"doc_writer",
  "db_passwd":"",
  "db_type":"mysql",
  "db_location":"us-east-1"
}
curl -i -X POST -H "Content-Type: application/json" -H "Accept: application/json" -H "X-AUTH-TOKEN: $AUTH_TOKEN"
-d @payload https://gcp.qubole.com/api/v1.2/db_taps/
{
  "account_id": 10,
  "active": true,
  "db_user": "doc_writer",
  "user_id": 1,
  "db_passwd": "",
  "db_name": "doc_example",
  "created_at": "2013-03-15T10:02:42Z",
  "db_host": "localhost",
  "db_location":"us-east-1",
  "db_type":"mysql"
  "id": 3,
  "port": null
}

Take a note of the DbTap id (in this case 3). It is used in later examples.

export DBTAPID=3
Delete a DbTap
DELETE /api/v1.2/db_taps/(int: dbtap_id)/

Use this API to delete a data store. See Understanding a Data Store for more information.

Required Role

The following users can make this API call:

  • Users who belong to the system-admin group.
  • Users who belong to a group associated with a role that allows deleting a DbTap. See Managing Groups and Managing Roles for more information.
Response

The response contains a JSON object with the status of the operation.

Example
Sample Request

Goal: delete a Understanding a Data Store having id 123

curl -i -X DELETE -H "Content-Type: application/json" \
    -H "Accept: application/json" -H "X-AUTH-TOKEN:$X_AUTH_TOKEN"  \
    https://gcp.qubole.com/api/v1.2/db_taps/123
Sample Response
HTTP/1.1 200 OK
Content-Type: application/json; charset=utf-8

{"status":"deleted"}
Edit a DbTap
PUT /api/v1.2/db_taps/<db-tap-id>

Use this API to edit a data store.

Resource URI db_taps/id
Request Type PUT
Supporting Versions v2.0
Return Value Json object representing the updated DbTap.
Required Role

The following users can make this API call:

  • Users who belong to the system-admin group.
  • Users who belong to a group associated with a role that allows editing a DbTap. See Managing Groups and Managing Roles for more information.
Parameters

Note

Parameters marked in bold below are mandatory. Others are optional and have default values. Presto is not currently supported on all Cloud platforms; see QDS Components: Supported Versions and Cloud Platforms.

Parameter Description
catalog_name

This parameter is mandatory to make the data stores accessible through Presto and Spark clusters. The catalog_name can contain lower-case alphabets and numerals. Its value can be NULL if you do not want to export data source to Presto or Spark clusters. This parameter is supported only for MySQL, Postgres, Snowflake, and Redshift type of data stores through Presto/Spark.

Note

Apart from adding the catalog name, create a ticket with Qubole Support to enable this feature.

name Use this parameter to add a name to the data store. If you do not add a name, then Qubole adds the name as a combination of db_host and db_name.
db_name Database Name. This is the data store name that is created in the QDS.
db_host IP address or hostname of the data store.
db_user User name to login to the data store.
db_passwd Password to login to the data store.
port TCP Port to connect on.
db_type Type of database. Valid values are mysql, vertica, mongo, postgresql, and redshift. The default value is mysql.
db_location Location of database. Valid values are us-east-1, us-west-2, ap-southeast-1, eu-west-1, and on-premise.
gateway_ip IP address or hostname of the gateway machine.
gateway_port The default port is 22. Specify a non-default port to connect to the gateway machine.
gateway_username User name to login to the gateway machine.
gateway_private_key Private key for the aforementioned user to login to the gateway machine. If you add the private key, you must add the associated public key to the bastion node as described in clusters-in-vpc.
skip_validation Set it to true to skip the data store validation when you create or edit a data store. It is set to false by default.
Example
curl -i -X PUT -H "Content-Type: application/json" -H "Accept: application/json" -H "X-AUTH-TOKEN: $AUTH_TOKEN" \
-d '{"port":3306}'  \ https://gcp.qubole.com/api/v1.2/db_taps/${DBTAPID}
Sample Response
{
  "account_id": 00000,
  "active": true,
  "db_user": "doc_writer",
  "user_id": 1,
  "db_passwd": "",
  "db_name": "doc_example",
  "created_at": "2013-03-15T10:02:42Z",
  "db_type":"mysql",
  "db_location":"us-east-1"
  "db_host": "localhost",
  "id": 3,
  "port": 3306
}
List DbTaps
GET /api/v1.2/db_taps/

Use this API to list data stores.

Resource URI db_taps
Request Type GET
Supporting Versions v2.0
Return Value JSON array of DbTaps.
page Use this parameter to specify the page number that contains the clusters’ history. Its default value is 10.
per_page Use this parameter to specify the number of results to be retrieved per page. Its default value is 10. When its value is out of bound, the Page number: >page> is out of bounds. error is displayed.
Required Role

The following users can make this API call:

  • Users who belong to the system-admin or system-user group.
  • Users who belong to a group associated with a role that allows viewing all DbTaps. See Managing Groups and Managing Roles for more information.
Example
curl -i -X GET -H "Content-Type: application/json" -H "Accept: application/json" -H "X-AUTH-TOKEN: $AUTH_TOKEN" \
https://gcp.qubole.com/api/v1.2/db_taps/
Sample Response
{
  "paging_info": {
    "next_page": null,
    "per_page": 10,
    "previous_page": null
  },
  "db_taps": [
    {
      "active": true,
      "account_id": 10,
      "user_id": 1,
      "db_user": "root",
      "db_passwd": "",
      "db_location":"us-east-1",
      "db_type":"mysql",
      "db_name": "jenkins",
      "created_at": "2012-12-27T17:20:32Z",
      "db_host": "localhost",
      "port": null,
      "id": 1
    },
    {
      "active": true,
      "account_id": 10,
      "user_id": 1,
      "db_user": "doc_writer",
      "db_passwd": "",
      "db_name": "doc_example",
      "created_at": "2013-03-15T10:02:42Z",
      "db_location":"us-east-1",
      "db_type":"postgresql",
      "db_host": "localhost",
      "port": null,
      "id": 3
    }
  ]
}

Note

The DbTaps.py SDK calls the above mentioned API internally.

List Tables in a DbTap
GET /api/v1.2/db_taps/<db-tap-id>/tables

Use this API to get a list of all tables in a data store.

Resource URI db_taps/id/tables
Request Type GET
Supporting Versions v2.0
Return Value JSON Array of tables in a DbTap.
Required Role

The following users can make this API call:

  • Users who belong to the system-admin or system-user group.
  • Users who belong to a group associated with a role that allows viewing a specific DbTap’s details. See Managing Groups and Managing Roles for more information.
Sample Request
curl -i -X GET -H "Content-Type: application/json" -H "Accept: application/json" -H "X-AUTH-TOKEN: $AUTH_TOKEN" \
https://gcp.qubole.com/api/v1.2/db_taps/${DBTAPID}/tables
Sample Response
["CUSTOMER","LINEITEM","NATION","ORDERS","PART","PARTSUPP","REGION","SUPPLIER"]
View a DbTap
GET /api/v1.2/db_taps/<db-tap-id>

Use this API to view a data store.

Resource URI db_taps/id
Request Type GET
Supporting Versions v2.0
Return Value Json object representing the DbTap.
Required Role

The following users can make this API call:

  • Users who belong to the system-admin or system-user group.
  • Users who belong to a group associated with a role that allows viewing a specific DbTap’s details. See Managing Groups and Managing Roles for more information.
Example
curl -i -X GET -H "Content-Type: application/json" -H "Accept: application/json" -H "X-AUTH-TOKEN: $AUTH_TOKEN" \
https://gcp.qubole.com/api/v1.2/db_taps/${DBTAPID}
Sample Response
{
  "account_id": 00000,
  "active": true,
  "db_user": "doc_writer",
  "user_id": 1,
  "db_location":"us-east-1",
  "db_type":"mysql",
  "db_passwd": "",
  "db_name": "doc_example",
  "created_at": "2013-03-15T10:02:42Z",
  "db_host": "localhost",
  "id": 3,
  "port": null
}

Folder API

This section explains how to create, edit, move, rename and delete a folder for notebooks and dashboards through REST API calls. For more information on:

This section covers:

Create a Folder
POST /api/v1.2/folders/

Use this API to create a notebook or dashboard folder. For more information, see Using Folders in Notebooks and Dashboards.

Required Role

The following users can make this API call:

  • Users who belong to the system-admin or system-user group.
  • Users who belong to a group associated with a role that allows submitting a command. See Managing Groups and Managing Roles for more information.
Parameters

Note

Parameters marked in bold below are mandatory. Others are optional and have default values.

Parameter Description
name It is the name of the folder. It is a string and can accept alpha-numerical characters.
location It is the location of the folder. By default, it goes to Users/current_user_email_id folders. For more information on notebook folders, see Using Folders in Notebooks. The accepted folder locations are: Users/current_user_email_id, Common, and Users/other_user_email_id based on permissions. The default location is Users/current_user_email_id and it is equivalent to My Home on the Notebooks UI. You need privileges to create/edit notebooks in Common and Users/other_user_email_id. For more information, see Managing Folder-level Permissions.
type It denotes if the type is Notebooks or Dashboards. Its value must be notes for a Notebook folder and notebook_dashboards for a Dashboard folder.
Request API Syntax
curl -i -X POST -H "X-AUTH-TOKEN: $AUTH_TOKEN" -H "Content-Type: application/json" -H "Accept: application/json" \
-d '{"name":"<Name>", "location":"<Location>", "type": "<notes/notebook_dashboards>"}' \ "https://gcp.qubole.com/api/v1.2/folders"
Sample API Requests

Here is an example to create a notebook folder.

curl -i -X POST -H "X-AUTH-TOKEN: $AUTH_TOKEN" -H "Content-Type: application/json" -H "Accept: application/json" \
-d '{"name":"Folder1", "location":"Users/user@qubole.com/Notebooks", "type": "notes"}' \ "https://gcp.qubole.com/api/v1.2/folders"

Here is an example to create a Dashboard folder.

curl -i -X POST -H "X-AUTH-TOKEN: $AUTH_TOKEN" -H "Content-Type: application/json" -H "Accept: application/json" \
-d '{"name":"Folder1", "location":"Users/user@qubole.com/DashboardUser", "type": "notebook_dashboards"}' \ "https://gcp.qubole.com/api/v1.2/folders"
Rename a Folder
PUT /api/v1.2/folders/rename

Use this API to rename a Notebook/Dashboard folder.

Required Role

The following users can make this API call:

  • Users who belong to the system-admin or system-user group.
  • Users who belong to a group associated with a role that allows submitting a command. See Managing Groups and Managing Roles for more information.
Parameters

Note

Parameters marked in bold below are mandatory. Others are optional and have default values.

Parameter Description
name It is the new name to the folder.
folder_id It is the ID of the folder that you want to rename.
location It is the current location of the folder.
type It denotes if the type is Notebooks or Dashboards. Its value must be notes for a Notebook folder and notebook_dashboards for a Dashboard folder.
Request API Syntax
curl -i -X PUT -H "X-AUTH-TOKEN: $AUTH_TOKEN" -H "Content-Type: application/json" -H "Accept: application/json" \
-d '{"name":"<New Folder Name>","folder_id":"<folder_id>","location":"<Folder Location>","type":"<notes/notebook_dashboards>"}' \
 "https://gcp.qubole.com/api/v1.2/folders/rename"
Sample API Requests

Here is a sample API call to rename the Spark1 Notebook folder.

curl -i -X PUT -H "X-AUTH-TOKEN: $AUTH_TOKEN" -H "Content-Type: application/json" -H "Accept: application/json" \
-d '{"name":"SparkManage","folder_id":"12","location":"Users/user1@qubole.com/Spark1","type":"notes"}' \
"https://gcp.qubole.com/api/v1.2/folders/rename"

Here is a sample API call to rename the Sparkuser Dashboard folder.

curl -i -X PUT -H "X-AUTH-TOKEN: $AUTH_TOKEN" -H "Content-Type: application/json" -H "Accept: application/json" \
-d '{"name":"SparkDashuser","folder_id":"14","location":"Users/user1@qubole.com/Sparkuser","type":"notebook_dashboards"}' \
"https://gcp.qubole.com/api/v1.2/folders/rename"
Move and Delete a Folder

You can move and delete a Notebook/Dashboard folder.

Move a Folder
POST /api/v1.2/folders/move

Use this API to move the folder to a different location.

Required Role

The following users can make this API call:

  • Users who belong to the system-admin or system-user group.
  • Users who belong to a group associated with a role that allows submitting a command. See Managing Groups and Managing Roles for more information.
Parameters

Note

Parameters marked in bold below are mandatory. Others are optional and have default values.

Parameter Description
destination_location It is the folder location to which you want to move this folder to which write/manage permission is granted.
location It is the current location of the folder.
type It denotes if the type is Notebooks or Dashboards. Its value must be notes for a Notebook folder and notebook_dashboards for a Dashboard folder.
Request API Syntax
curl -i -X POST -H "X-AUTH-TOKEN: $AUTH_TOKEN" -H "Content-Type: application/json" -H "Accept: application/json" \
-d '{"destination_location":"<Destined Folder>", "location":"<Current Folder Location>", "type":"<notes/notebook_dashboards>"}' \
"https://gcp.qubole.com/api/v1.2/folders/move"
Sample API Request

Here is a sample API call to move the Notebook folder to a new destination.

curl -i -X POST -H "X-AUTH-TOKEN: $AUTH_TOKEN" -H "Content-Type: application/json" -H "Accept: application/json" \
-d '{"destination_location":"Users/user1@qubole.com/Spark1", "location":"Users/user2@qubole.com/Spark1", "type":"notes"}' \
"https://gcp.qubole.com/api/v1.2/folders/move"
Delete a Folder
POST api/api/v1.2/folders/delete
Required Role

The following users can make this API call:

  • Users who belong to the system-admin or system-user group.
  • Users who belong to a group associated with a role that allows submitting a command. See Managing Groups and Managing Roles for more information.
Parameters

Note

Parameters marked in bold below are mandatory. Others are optional and have default values.

Parameter Description
folder_id It is the ID of the folder that you want to delete.
location It is the current location of the folder.
type It denotes if the type is Notebooks or Dashboards. Its value must be notes for a Notebook folder and notebook_dashboards for a Dashboard folder.
Request API Syntax
curl -i -X POST -H "X-AUTH-TOKEN: $AUTH_TOKEN" -H "Content-Type: application/json" -H "Accept: application/json" \
-d '{"folder_id":"<folder_id>","location":"<Folder Location>","type":"<notes/notebook_dashboards>"}' \
"https://gcp.qubole.com/api/v1.2/folders/delete"
Sample API Request

Here is a sample API call to delete the Notebook folder that has 12 as its ID.

curl -i -X POST -H "X-AUTH-TOKEN: $AUTH_TOKEN" -H "Content-Type: application/json" -H "Accept: application/json" \
-d '{"folder_id":"12","location":"Users/user1@qubole.com/Spark1","type":"notes"}' \
"https://gcp.qubole.com/api/v1.2/folders/delete"
Set a Folder Policy
PUT /api/v1.2/folders/policy

Use this API to set access permissions for a Notebook/Dashboard folder. For more information, see Using Folders in Notebooks.

Required Role

The following users can make this API call:

  • Users who belong to the system-admin or system-user group or a user with allowed manage permissions on folders.
  • Users who belong to a group associated with a role that allows submitting a command. See Managing Groups and Managing Roles for more information.
Parameters

Note

Parameters marked in bold are mandatory. Others are optional and have default values.

Parameter Description
location It is the current location of the folder.
source_type For a folder, the source type is Folder.
type It denotes if the type is Notebooks or Dashboards. Its value must be notes for a Notebook folder and notebook_dashboards for a Dashboard folder.
policy

Array of policies to be assigned to a cluster. Each policy include following parameters:

Note

Escape the values of policy elements and corresponding values except the user ID value and group ID value.

  • action: Name of the action with particular resource. The actions can be read, write, manage, or all. The actions imply as given below:
    • read: This action is set to allow/deny a user/group to view the folder.
    • write: This action is set to allow/deny a user/group to edit, rename, or move the folder.
    • manage: This action is set to allow/deny a user/group to manage the folder permissions.
    • all: This action is set to allow/deny a user/group to do read/edit/delete the folder. It gets the lowest priority always. That is read, write, and manage actions get precedence over the all action.
  • access: It is set to allow or deny actions. Its value is either allow or deny.
  • condition: Array of user IDs and group IDs that have to be assigned with this policy:
    • qbol_users: An array of IDs of the user for whom this policy needs to be set. But these IDs are not similar to user IDs and create a ticket with Qubole Support to get the user IDs.
    • qbol_groups: An array of IDs of the groups for whom this policy needs to be set.
Request API Syntax
curl -X PUT -H "X-AUTH-TOKEN: <API-TOKEN>" -H "Content-Type: application/json" -H "Accept: application/json" \
-d '{"location":"<FolderLocation>", "type": "<notes/notebook_dashboards>", "source_type":"Folder",
      "policy": "[{\"access\":\"<Access>\",\"condition\":{\"qbol_users\":[<User ID>]},\"action\":[\"<Actions>\"]},
                 \"action\":[\"<Actions>\"]}, {\"access\":\"<Access>\",
                 \"condition\":{\"qbol_groups\":[<Group ID>] \"action\":[\"<Actions>\"]}, {\"access\":\"<Access>\"}]"}' \
"https://gcp.qubole.com/api/v1.2/folders/policy"
Sample API Requests

Here is a sample API call to assign permissions to the SparkNotes Notebook folder.

curl -i -X PUT -H "X-AUTH-TOKEN: $AUTH_TOKEN" -H "Content-Type: application/json" -H "Accept: application/json" \
-d '{"name":"SparkNotes", "type":"notes", "location":"Users/user1@qubole.com/SparkNotes",
"source_type":"Folder","policy":"[{\"access\":\"allow\", \"condition\":{\"qbol_users\":[12902]},\"action\":[\"read\",\"write\"]},
{\"condition\":{\"qbol_groups\":[129]},\"access\":\"deny\",\"action\":[\"all\"]}]"}' \
"https://gcp.qubole.com/api/v1.2/folders/policy"

Here is a sample API call to assign permissions to SparkStatus Dashboard folder.

curl -i -X PUT -H "X-AUTH-TOKEN: $AUTH_TOKEN" -H "Content-Type: application/json" -H "Accept: application/json" \
-d '{"name":"SparkStatus", "type":"notebook_dashboards", "location":"Users/user1@qubole.com/SparkStatus","source_type":"Folder","
policy":"[{\"access\":\"allow\",\"condition\":  {\"qbol_users\":[12902]},\"action\":[\"read\",\"write\"]},
{\"condition\":{\"qbol_groups\":[129]},\"access\":\"deny\",\"action\":[\"all\"]}]"}' \
"https://gcp.qubole.com/api/v1.2/folders/policy"

.._get-folder-policy:

View a Folder Policy
GET /api/v1.2/folders/policy

Use this API to see the permissions of a Notebook/Dashboard folder.

Required Role

The following users can make this API call:

  • Users who belong to the system-admin or system-user group.
  • Users who belong to a group associated with a role that allows submitting a command. See Managing Groups and Managing Roles for more information.
Parameters

Note

Parameters marked in bold are mandatory. Others are optional and have default values.

Parameter Description
location It is the current location of the folder.
source_type For a folder, the source type is Folder.
type It denotes if the type is Notebooks or Dashboards. Its value must be notes for a Notebook folder and notebook_dashboards for a Dashboard folder.
Request API Syntax
curl -X GET -H "X-AUTH-TOKEN: <API-TOKEN>" -H "Content-Type: application/json" -H "Accept: application/json" \
-d '{"location":"<FolderLocation>", "type": "<notes/notebook_dashboards>", "source_type":"Folder"}' \
"https://gcp.qubole.com/api/v1.2/folders/policy"
Sample API Requests

Here is a sample API call to view permissions of the SparkNotes Notebook folder.

curl -i -X GET -H "X-AUTH-TOKEN: $AUTH_TOKEN" -H "Content-Type: application/json" -H "Accept: application/json" \
-d '{"name":"SparkNotes", "type":"notes", "location":"Users/user1@qubole.com/SparkNotes", "source_type":"Folder"}' \
"https://gcp.qubole.com/api/v1.2/folders/policy"

Here is a sample API call to view permissions of the SparkStatus Dashboard folder.

curl -i -X PUT -H "X-AUTH-TOKEN: $AUTH_TOKEN" -H "Content-Type: application/json" -H "Accept: application/json" \
-d '{"name":"SparkStatus", "type":"notebook_dashboards", "location":"Users/user1@qubole.com/SparkStatus",
     "source_type":"Folder"}' \
"https://gcp.qubole.com/api/v1.2/folders/policy

Groups API

Create a Group
POST /api/v1.2/groups

This API is used to create a group on QDS.

Required Role

The following users can make this API call:

  • Users who belong to the system-admin or system-user group.
  • Users who belong to a group associated with a role that allows creating a group. See Managing Groups and Managing Roles for more information.
Parameters

Note

Parameters marked in bold are mandatory. Others are optional and have default values.

Parameter Description
name Name of the group that must be unique in the Qubole account.
members An array of Qubole users’ email addresses, who are already members of the Qubole account.
roles An array of Qubole role IDs or role names. Once a Qubole group is created, roles are attached to the group.
Request API Syntax

Here is the request API syntax.

curl -X POST -H "X-AUTH-TOKEN:  $AUTH_TOKEN" -H "Content-Type: application/json" -H "Accept: application/json" \
-d '{"name":"<group-name>","members":"<member1,member2,..","roles":"<role1>,<role2>,..."}' \
"https://gcp.qubole.com/api/v1.2/groups"
Sample Request

Here is a sample request.

curl -X POST -H "X-AUTH-TOKEN:  $AUTH_TOKEN" -H "Content-Type: application/json" -H "Accept: application/json" \
-d '{"name":"my_group_name","members":"71,72","roles":"1,2"}' \ "https://gcp.qubole.com/api/v1.2/groups"
Sample Response
Success
{"status":"done"}
Add Users to an Existing Group
PUT /api/v1.2/groups/<qubole-group-id>/qbol_users/<user-email-address>/add

This API is used to add users to an existing group on QDS.

Required Role

The following users can make this API call:

  • Users who belong to the system-admin or system-user group.
  • Users who belong to a group associated with a role that allows adding users to a group. See Managing Groups and Managing Roles for more information.
Parameters

Note

Parameters marked in bold are mandatory. Others are optional and have default values.

Parameter Description
<qubole_group_id> ID/name of the Qubole group
<user-email-address> An array of Qubole users’ email addresses, who are already members of a Qubole account.
Request Syntax

Here is the request API syntax.

curl -X PUT -H "X-AUTH-TOKEN: $AUTH_TOKEN" -H "Content-Type: application/json" -H "Accept: application/json" \
-d '{}' \ "https://gcp.qubole.com/api/v1.2/groups/my_group_name/qbol_users/<<userID1,userID4,..., userIDN>/<user1-emailaddress1,
         useremailaddress2,....,useremailaddressN>>/add"
Sample Request

Here is a sample request.

curl -X PUT -H "X-AUTH-TOKEN: $AUTH_TOKEN" -H "Content-Type: application/json" -H "Accept: application/json" \
-d '{}' \ "https://gcp.qubole.com/api/v1.2/groups/Users/qbol_users/71,72/add"
Sample Response
Success
{"status":"done"}
Assign and Unassign Group Roles

A role can be assigned to a specific group to perform certian functions such as administrator and modifier. Similarly, a role can be unassigned from a group if it does not require the role function.

Assign a Role to a Group
PUT /api/v1.2/groups/<qbol_group_id>/roles/<role-id/name>/assign

This API is used to assign a specific role to a group.

Required Role

The following users can make this API call:

  • Users who belong to the system-admin or system-user group.
  • Users who belong to a group associated with a role that allows assigning roles to a group. See Managing Groups and Managing Roles for more information.
Parameters

Note

Parameters marked in bold are mandatory. Others are optional and have default values.

Parameter Description
<qubole_group_id> Qubole group ID to which a role is to be assigned
<role-id/name> ID or name of the role that is to be assigned to a group
Request API Syntax

Here is the syntax of the Request API.

curl -X PUT -H "X-AUTH-TOKEN:  $AUTH_TOKEN" -H "Content-Type: application/json" -H "Accept: application/json" \
-d '{}' \ "https://gcp.qubole.com/api/v1.2/groups/<qubole-group-id>/roles/<role-id/name>/assign"
Sample Request
curl -X PUT -H "X-AUTH-TOKEN:  $AUTH_TOKEN" -H "Content-Type: application/json" -H "Accept: application/json" \
-d '{}' \ "https://gcp.qubole.com/api/v1.2/groups/105/roles/19/assign"
Sample Response
Success
{"status":"done"}
Unassign a Role from a Group

When an assinged role is not required by a group, remove it. To remove an assigned role from a group, use this API:

PUT /api/v1.2/groups/<qubole_group_id>/roles/<role-id/name>/unassign
Required Role

The following users can make this API call:

  • Users who belong to the system-admin or system-user group.
  • Users who belong to a group associated with a role that allows unassigning roles from a group. See Managing Groups and Managing Roles for more information.
Parameters

Note

Parameters marked in bold are mandatory. Others are optional and have default values.

Parameter Description
<qubole_group_id> Qubole group ID from which a specific role must be removed
<role-id/name> ID/name of the role that must be unassigned from a group
Request API Syntax

Here is the syntax of the Request API.

curl -X PUT -H "X-AUTH-TOKEN:  $AUTH_TOKEN" -H "Content-Type: application/json" -H "Accept: application/json" \
-d '{}' \ "https://gcp.qubole.com/api/v1.2/groups/<qubole-group-id>/roles/<role-id/name>/unassign"
Sample Request

Here is a sample request.

curl -X PUT -H "X-AUTH-TOKEN:  $AUTH_TOKEN" -H "Content-Type: application/json" -H "Accept: application/json" \
-d '{}' \ "https://gcp.qubole.com/api/v1.2/groups/105/roles/19/unassign"
Sample Response
Success
{"status":"done"}
List Roles Mapped to a Group
GET /api/v1.2/groups/<group-id/name>/roles

This API is used to list all roles mapped to a given Qubole group.

Required Role

The following users can make this API call:

  • Users who belong to the system-admin or system-user group.
  • Users who belong to a group associated with a role that allows viewing roles assigned to a group. See Managing Groups and Managing Roles for more information.
Parameters

Note

Parameters marked in bold are mandatory. Others are optional and have default values.

Parameter Description
group-id Group’s ID or name that has all roles mapped
Request API Syntax

Here is the syntax of the API request.

curl -X GET -H "X-AUTH-TOKEN: $AUTH_TOKEN" -H "Content-Type: application/json" -H "Accept: application/json" \
-d '{}' \ "https://gcp.qubole.com/api/v1.2/groups/<group-id>/roles"
Sample API Request
curl -X GET -H "X-AUTH-TOKEN: $AUTH_TOKEN" -H "Content-Type: application/json" -H "Accept: application/json" \
-d '{}' \ "https://gcp.qubole.com/api/v1.2/groups/24/roles"
List Users in an Existing Group
GET /api/v1.2/groups/<group-id/name>/qbol_users

This API is used to list all users in a given Qubole group.

Required Role

The following users can make this API call:

  • Users who belong to the system-admin or system-user group.
  • Users who belong to a group associated with a role that allows viewing users in a specific group. See Managing Groups and Managing Roles for more information.
Parameters

Note

Parameters marked in bold are mandatory. Others are optional and have default values.

Parameter Description
<group-id/name> Quole group’s ID or name
Request API Syntax
curl -X GET -H "X-AUTH-TOKEN:  $AUTH_TOKEN" -H "Content-Type: application/json" -H "Accept: application/json" \
-d '{}' \ "https://gcp.qubole.com/api/v1.2/groups/<group-id/name>/qbol_users"
Sample API Request
curl -X GET -H "X-AUTH-TOKEN:  $AUTH_TOKEN" -H "Content-Type: application/json" -H "Accept: application/json" \
-d '{}' \ "https://gcp.qubole.com/api/v1.2/groups/20/qbol_users"
Delete a Group
DELETE /api/v1.2/groups/<group-id>

This API is used to delete an existing group on QDS.

Required Role

The following users can make this API call:

  • Users who belong to the system-admin or system-user group.
  • Users who belong to a group associated with a role that allows deleting a group. See Managing Groups and Managing Roles for more information.
Parameters

Note

Parameters marked in bold are mandatory. Others are optional and have default values.

Parameter Description
<group-id> ID/name of the group that needs to be deleted
Request API Syntax

Here is the request API syntax.

curl -X DELETE -H "X-AUTH-TOKEN: $AUTH_TOKEN" -H "Content-Type: application/json" -H "Accept: application/json" \
-d '{}' \ "https://gcp.qubole.com/api/v1.2/groups/<group-id>"
Sample Request

Here is a sample request.

curl -X DELETE -H "X-AUTH-TOKEN: $AUTH_TOKEN" -H "Content-Type: application/json" -H "Accept: application/json" \
-d '{}' \ "https://gcp.qubole.com/api/v1.2/groups/2192"
Sample Response
Success
"status":"success"
Delete Users from an Existing Group
PUT /api/v1.2/groups/<qubole-group-id/name>/qbol_users/<user-email-address>/remove

This API is used to delete users in an existing group on QDS.

Required Role

The following users can make this API call:

  • Users who belong to the system-admin or system-user group.
  • Users who belong to a group associated with a role that allows deleting users from a group. See Managing Groups and Managing Roles for more information.
Parameters

Note

Parameters marked in bold are mandatory. Others are optional and have default values.

Parameter Description
<qubole_group_id/name> ID or name of the group that contains users, who must be deleted.
<user-email-address> An array of the email addresses of users who belong to the Qubole group.
Request API Syntax

Here is the API request syntax.

curl -X PUT -H "X-AUTH-TOKEN: $AUTH_TOKEN" -H "Content-Type: application/json" -H "Accept: application/json" \
-d '{}' \ "https://gcp.qubole.com/api/v1.2/groups/<qubole-group-id/name>/qbol_users/<<user-id>/<user-email-address>>/remove"
Sample Request

Here is a sample request.

curl -X PUT -H "X-AUTH-TOKEN: $AUTH_TOKEN" -H "Content-Type: application/json" -H "Accept: application/json" \
"https://gcp.qubole.com/api/v1.2/groups/106/qbol_users/73,74/remove"

Hive Metadata API

Schema or Database
GET /api/v1.2/hive/default/

Use this API to view the Hive tables in Qubole.

Required Role

The following users can make this API call:

  • Users who belong to the system-admin group.
  • Users who belong to a group associated with a role that allows viewing Hive tables in Qubole. See Managing Groups and Managing Roles for more information.
Parameters
Parameter Description
filter A regular expression to filter the result.
describe Value has to be true. The result contains columns for each table.

Returns a json array of tables available in Qubole.

Example

Goal

With filter.

curl -i -X GET \
-H "Accept: application/json" \
-H "Content-type: application/json" \
-H "X-AUTH-TOKEN: $AUTH_TOKEN" \
"https://gcp.qubole.com/api/v1.2/hive/default/?filter=.*qubole.*"

Sample Response

["default_qubole_airline_origin_destination","default_qubole_memetracker"]

Goal

With filter and describe.

curl -i -X GET \
-H "Accept: application/json" \
-H "Content-type: application/json" \
-H "X-AUTH-TOKEN: $AUTH_TOKEN" \
"https://gcp.qubole.com/api/v1.2/hive/default/?filter=.*qubole.*&describe=true"

Sample Response

[
  {
    "default_qubole_airline_origin_destination": {
      "columns": [
        {
          "name": "break",
          "type": "string"
        },
        {
          "name": "coupongeotype",
          "type": "string"
        },
        {
          "name": "coupons",
          "type": "string"
        },
        {
          "name": "coupontype",
          "type": "string"
        },
        {
          "name": "dest",
          "type": "string"
        },
        {
          "name": "destaptind",
          "type": "string"
        },
        {
          "name": "destcitynum",
          "type": "string"
        },
        {
          "name": "destcountry",
          "type": "string"
        },
        {
          "name": "deststate",
          "type": "string"
        },
        {
          "name": "deststatefips",
          "type": "string"
        },
        {
          "name": "deststatename",
          "type": "string"
        },
        {
          "name": "destwac",
          "type": "string"
        },
        {
          "name": "distance",
          "type": "string"
        },
        {
          "name": "distancegroup",
          "type": "string"
        },
        {
          "name": "fareclass",
          "type": "string"
        },
        {
          "name": "gateway",
          "type": "string"
        },
        {
          "name": "itingeotype",
          "type": "string"
        },
        {
          "name": "itinid",
          "type": "string"
        },
        {
          "name": "mktid",
          "type": "string"
        },
        {
          "name": "opcarrier",
          "type": "string"
        },
        {
          "name": "origin",
          "type": "string"
        },
        {
          "name": "originaptind",
          "type": "string"
        },
        {
          "name": "origincitynum",
          "type": "string"
        },
        {
          "name": "origincountry",
          "type": "string"
        },
        {
          "name": "originstate",
          "type": "string"
        },
        {
          "name": "originstatefips",
          "type": "string"
        },
        {
          "name": "originstatename",
          "type": "string"
        },
        {
          "name": "originwac",
          "type": "string"
        },
        {
          "name": "passengers",
          "type": "string"
        },
        {
          "name": "quarter",
          "type": "string"
        },
        {
          "name": "rpcarrier",
          "type": "string"
        },
        {
          "name": "seqnum",
          "type": "string"
        },
        {
          "name": "tkcarrier",
          "type": "string"
        },
        {
          "name": "year",
          "type": "string"
        }
      ]
    }
  },
  {
    "default_qubole_memetracker": {
      "columns": [
        {
          "name": "lnks",
          "type": "string"
        },
        {
          "name": "phr",
          "type": "string"
        },
        {
          "name": "site",
          "type": "string"
        },
        {
          "name": "ts",
          "type": "string"
        },
        {
          "name": "month",
          "type": "string"
        }
      ]
    }
  }
]
View a Hive Table Definition
GET /api/v1.2/hive/<schema_name>/<table>

Use this API to get the Hive table definition. This API also fetches AVRO table schema.

Required Role

The following users can make this API call:

  • Users who belong to the system-admin or system-user group.
  • Users who belong to a group associated with a role that allows viewing a Hive table definition. See Managing Groups and Managing Roles for more information.
Example
curl -i -X GET -H "Accept: application/json" -H "Content-type: application/json" -H "X-AUTH-TOKEN: $AUTH_TOKEN" \
https://gcp.qubole.com/api/v1.2/hive/default/miniwikistats

Sample Response

[
    {
        "name": "projcode",
        "type": "string",
        "comment": null
    },
    {
        "name": "pagename",
        "type": "string",
        "comment": null
    },
    {
        "name": "pageviews",
        "type": "int",
        "comment": null
    },
    {
        "name": "bytes",
        "type": "int",
        "comment": null
    },
    {
        "name": "dt",
        "type": "string",
        "comment": null
    }
]
Store Table Properties
POST /api/v1.2/hive/schema/table

Modify metadata of tables in the given schema of the hive metastore.

Required Role

The following users can make this API call:

  • Users who belong to the system-admin group.
  • Users who belong to a group associated with a role that allows modifying Hive table metadata in Qubole. See Managing Groups and Managing Roles for more information.
Parameters

Note

Parameters marked in bold below are mandatory. Others are optional and have default values.

Parameter Description
interval Number representing the interval at which data is loaded.
interval_unit Unit of the interval. Valid values are minutes, hours, days, weeks and months.
columns JSON Hash with Date/Time Format of partition columns. Date format should be a valid input to the strftime function. If there are no partition columns, then it should be an empty hash. For partition columns that are not date/time, the value should be an empty string.
Example

Imagine a table daily_tick_data in the default hive schema that has the following partitions

  1. stock_exchange
  2. stock_symbol
  3. year
  4. date
$ cat payload.json
{
 "interval": 1,
 "interval_unit": "days",
 "columns": {
     "stock_exchange": "",
     "stock_symbol": "",
     "year": "%Y",
     "date": "%Y-%m-%d"
 }
}
$ curl -i -X POST -H "Accept: application/json" \
     -H "Content-type: application/json" \
     -H "X-AUTH-TOKEN: $AUTH_TOKEN" \
     --data @payload.json \
       https://gcp.qubole.com/api/v1.2/hive/default/daily_tick_data/properties

Response

{"status":"successful"}
Get Table Properties
GET /api/v1.2/hive/<schema_name>/<table_name>/table_properties

Use this API to get the Hive table properties.

Required Role

The following users can make this API call:

  • Users who belong to the system-admin or system-user group.
  • Users who belong to a group associated with a role that allows viewing a Hive table’s properties. See Managing Groups and Managing Roles for more information.
Example
curl -i -X GET -H "Accept: application/json" -H "Content-type: application/json" -H "X-AUTH-TOKEN: $AUTH_TOKEN" \
"https://gcp.qubole.com/api/v1.2/hive/default/daily_tick_data/table_properties"

Sample Response

{
    "location": "gs://paid-qubole/default-datasets/stock_ticker",
    "owner": "ec2-user",
    "create-time": 1362388416,
    "table-type": "EXTERNAL_TABLE",
    "field.delim": ",",
    "serialization.format": ","
}
Delete Table Properties
DELETE /api/v1.2/hive/default/(string: table_name)/properties

Delete properties associated with a Hive table.

Required Role

The following users can make this API call:

  • Users who belong to the system-admin group.
  • Users who belong to a group associated with a role that allows deleting a Hive table’s properties. See Managing Groups and Managing Roles for more information.
Example

Goal

To delete the Hive table properties.

curl -i -X DELETE -H "Accept: application/json" \
-H "Content-type: application/json" \
-H "X-AUTH-TOKEN: $AUTH_TOKEN" \
"https://gcp.qubole.com/api/v1.2/hive/default/daily_tick_data/properties"
Response

The response will be a JSON object with success or failure information.

HTTP/1.1 200 OK
Content-Type: application/json; charset=utf-8

{"status":"successful"}
View Table Partitions and Location
GET /api/v1.2/hive/<schema_name>/<table>/partitions

Use this API to get the Hive table partitions and locations.

Required Role

The following users can make this API call:

  • Users who belong to the system-admin or system-user group.
  • Users who belong to a group associated with a role that allows viewing a Hive table definition. See Managing Groups and Managing Roles for more information.
Example
curl -X GET -H "Accept: application/json" -H "Content-type: application/json" -H "X-AUTH-TOKEN: 02a9528b98ad4b0c8abe
50fd4535960d7777g37b18514fffa41017f5dac16e2b" "https://gcp.qubole.com/api/v1.2/hive/default/cars/partitions"

Sample Response

[
     {
        "PART_NAME":"year=2013/month=1",
        "LOCATION":"gs://dev.canopydata.com/sqoop/qa/account_id/4646/warehouse/cars/year=2013/month=1"
     }
]

Notebook API

This section explains how to create, configure, clone, bind, run, and delete a notebook through REST API calls. For more information on Notebooks UI, see Notebooks. This section covers:

Create a Notebook
POST /api/v1.2/notebooks/

Use this API to create a notebook. To know how to create a notebook using the Notebooks UI, see Creating a Notebook.

Required Role

The following users can make this API call:

  • Users who belong to the system-admin or system-user group.
  • Users who belong to a group associated with a role that allows submitting a command. See Managing Groups and Managing Roles for more information.
Parameters

Note

Parameters marked in bold below are mandatory. Others are optional and have default values. Presto is not currently supported on all Cloud platforms; see QDS Components: Supported Versions and Cloud Platforms.

Parameter Description
name It is the name of the notebook. It is a string and can accept alpha-numerical characters.
location It is the location of the folder. By default, it goes to Users/current_user_email_id folders. For more information on notebook folders, see Using Folders in Notebooks. The accepted folder locations are: Users/current_user_email_id, Common, and Users. The default location is Users/current_user_email_id and it is equivalent to My Home on the Notebooks UI. You need privileges to create/edit notebooks in Common and Users. For more information, see Managing Folder-level Permissions.
note_type It is the type of notebook. The values are hive, spark, and presto. Hive notebooks is a beta feature. The Hive notebook beta-feature is available by default.
cluster_id It is the cluster ID to which the notebook gets assigned with. You must assign the notebook with a cluster to use it even though it is not mandatory to assign a notebook with the cluster when you create it. Assign a cluster only with the corresponding type of notebook that is assign a Spark note with a Spark cluster and a Presto notebook with a Presto cluster.
Request API Syntax
curl -i -X POST -H "X-AUTH-TOKEN: $AUTH_TOKEN" -H "Content-Type: application/json" -H "Accept: application/json" \
-d '{"name":"<Name>", "location":"<Location>", "note_type": "<Note Type>", "cluster_id":"<Cluster ID"}' \
"https://gcp.qubole.com/api/v1.2/notebooks"
Sample API Request

Here is an example to create a Spark notebook and assign it to a cluster with its ID as 4001.

curl -i -X POST -H "X-AUTH-TOKEN: $AUTH_TOKEN" -H "Content-Type: application/json" -H "Accept: application/json" \
-d '{"name":"Spark", "location":"Users/user@abc.com/Notebooks", "note_type": "spark", "cluster_id":"4001"}' \
"https://gcp.qubole.com/api/v1.2/notebooks"
Configure a Notebook
PUT /api/v1.2/notebooks/<notebook ID>

Use this API to configure/edit a notebook. You cannot change the notebook type while configuring or editing it.

You can change the cluster associated with the notebook only when the following conditions are met:

  • Notebook does not have any active command.
  • Notebook does not have any active schedules associated.
  • Notebook does not have any scheduled dashboard.

To know how to configure a notebook using the Notebooks UI, see Configuring a Notebook.

Required Role

The following users can make this API call:

  • Users who belong to the system-admin or system-user group.
  • Users who belong to a group associated with a role that allows submitting a command. See Managing Groups and Managing Roles for more information.
Parameters

Note

Parameters marked in bold below are mandatory. Others are optional and have default values.

Parameter Description
name It is the name of the notebook. It is a string and can accept alpha-numerical characters.
location It is the location of the folder. By default, it goes to Users/current_user_email_id folders. For more information on notebook folders, see Using Folders in Notebooks. The accepted folder locations are: Users/current_user_email_id, Common, and Users. The default location is Users/current_user_email_id and it is equivalent to My Home on the Notebooks UI. You need privileges to create/edit notebooks in Common and Users. For more information, see Managing Folder-level Permissions.
default_lang Default language of the notebook. Default value is spark. Other possible values are pyspark, sql, and r.
cluster_id The ID of the cluster to which the notebook is assigned. You must assign the notebook to a cluster to use it, though you can leave the notebook unassigned when you create it. Make sure you assign the notebook to the appropriate type of cluster: for example, assign a Spark notebook to a Spark cluster.
Request API Syntax
curl -i -X PUT -H "X-AUTH-TOKEN: $AUTH_TOKEN" -H "Content-Type: application/json" -H "Accept: application/json" \
-d '{"name":"<Name>", "location":"<Location>", "cluster_id":"<Cluster ID>", "default_lang":"<language>"}' \
"https://gcp.qubole.com/api/v1.2/notebooks/<notebook ID>"
Sample API Request

Here is an example to configure a Spark notebook with its ID as 2000.

curl -i -X PUT -H "X-AUTH-TOKEN: $AUTH_TOKEN" -H "Content-Type: application/json" -H "Accept: application/json" \
-d '{"name":"Spark", "location":"Users/user@abc.com/Notebooks", "cluster_id":"4001", "default_lang":"sql"}' \
"https://gcp.qubole.com/api/v1.2/notebooks/2000"
Import a Notebook
POST /api/v1.2/notebooks/import

Use this API to import a Spark notebook from a location and add it to the notebooks list in the QDS account. As a prerequisite, you must ensure that the object in a cloud storage or Github is public.

Required Role

The following users can make this API call:

  • Users who belong to the system-admin or system-user group.
  • Users who belong to a group associated with a role that allows submitting a command. See Managing Groups and Managing Roles for more information.
Parameters

Note

Parameters marked in bold below are mandatory. Others are optional and have default values. Presto is not currently supported on all Cloud platforms; see QDS Components: Supported Versions and Cloud Platforms.

Parameter Description
name It is the name of the notebook. It is a string and can accept alpha-numerical characters.
location It is the location of the folder. By default, it goes to Users/current_user_email_id folders. For more information on notebook folders, see Using Folders in Notebooks. The accepted folder locations are: Users/current_user_email_id, Common, and Users. The default location is Users/current_user_email_id and it is equivalent to My Home on the Notebooks UI. You need privileges to create/edit notebooks in Common and Users. For more information, see Managing Folder-level Permissions.
note_type It is the type of notebook. The values are hive, spark, and presto. Hive notebooks is a beta feature. The Hive notebook beta-feature is available by default.
file It is a parameter that you can only specify to import a notebook from a location on the local hard disk. For adding the complete location path, you must start it with @/. For example, "file":"@/home/spark....
nbaddmode It must only be used with the file parameter. Its value is import-from-computer.
url It is the cloud storage or Github location, a valid JSON URL, or an ipynb URL of the notebook that you want to import.
cluster_id It is the ID of the cluster to which the notebook is assigned. If you specify this parameter, then the notebook is imported with the attached cluster.
Request API Syntax

Syntax to use for importing a notebook that is on a cloud storage bucket.

curl -X "POST" -H "X-AUTH-TOKEN:$AUTH_TOKEN" -H "Content-Type:multipart/form-data" -H "Accept: application/json" \
 -F "name"="<Name>" -F "location"="<Location>" -F "note_type"="<Note Type>" -F "url"="<storage location in the URL
 format/valid-JSON-URL/ipynb-URL>" \
  "https://gcp.qubole.com/api/v1.2/notebooks/import"

Syntax to use for importing a notebook that is on Github.

curl -X "POST" -H "X-AUTH-TOKEN:$AUTH_TOKEN" -H "Content-Type:multipart/form-data" -H "Accept: application/json" \
 -F "name"="<Name>" -F "location"="<Location>" -F "note_type"="<Note Type>" -F "url"="<Github location in the URL
 format/valid-JSON-URL/ipynb-URL>" \
  "https://gcp.qubole.com/api/v1.2/notebooks/import"

Syntax to use for importing a notebook that is on a local hard disk.

Note

For adding the file location in the local hard disk, you must start it with @/. For example, "file":"@/home/spark....

curl   -X "POST" -H "X-AUTH-TOKEN: $AUTH_TOKEN" -H "Content-Type: multipart/form-data" -H "Accept: application/json" \
-F "name"="<Name>" -F  "location"="<Location>" -F "note_type"="<Note Type>" -F  "file"="<local hard disk location>"
-F  "nbaddmode"="import-from-computer" \
"https://gcp.qubole.com/api/v1.2/notebooks/import"
Sample API Request

Here is an example to import a Spark notebook from a storage bucket.

curl -X "POST" -H "X-AUTH-TOKEN:$AUTH_TOKEN" -H "Content-Type:multipart/form-data" -H "Accept: application/json" \
-F "name"="SparkNote" -F "location"="Users/user1@qubole.com" -F "note_type"="spark"
-F "url"="https://spk.gcp.com/notebook-samples/spark_examples" \
"https://gcp.qubole.com/api/v1.2/notebooks/import"

Here is an example to import a Spark notebook from Github using the raw Github link.

curl -X "POST" -H "X-AUTH-TOKEN:$AUTH_TOKEN" -H "Content-Type:multipart/form-data" -H "Accept: application/json" \
-F "name"="SparkNote" -F "location"="Users/user1@qubole.com" -F "note_type"="spark"
-F "url"="https://raw.githubusercontent.com/phelps-sg/python-bigdata/master/src/main/ipynb/intro-python.ipynb" \
"https://gcp.qubole.com/api/v1.2/notebooks/import"

Here is an example to import a Spark notebook from Github using the gist link.

curl -X "POST" -H "X-AUTH-TOKEN:$AUTH_TOKEN" -H "Content-Type:multipart/form-data" -H "Accept: application/json" \
-F "name"="SparkNote" -F "location"="Users/user1@qubole.com" -F "note_type"="spark"
-F "url"="https://gist.githubusercontent.com/user1/fbc05748f4c660ee656d50f1d8cdad11/raw/a6332f4b0cba2fc3cd44eac9956a2fe135744a8f/urltest2.ipynb" \
"https://gcp.qubole.com/api/v1.2/notebooks/import"

Here is an example to import a Spark notebook from the local hard disk.

curl -X "POST" -H "X-AUTH-TOKEN: $AUTH_TOKEN" -H "Content-Type: multipart/form-data" -H "Accept: application/json" \
-F "name"="SparkNote1" -F  "location"="Users/user2@qubole.com" -F "note_type"="spark" -F  "file"="@/home/spark/SparkNoteb.ipynb"
-F  "nbaddmode"="import-from-computer" \
"https://gcp.qubole.com/api/v1.2/notebooks/import"
Export a Notebook
GET /api/v2/notes/note_id/export

Use this API to export a notebook in the JSON format.

Required Role

The following users can make this API call:

  • Users who belong to the system-admin or system-user group.
Parameters

Note

Parameters marked in bold below are mandatory. Others are optional and have default values.

Parameter Description
note_id Id of the notebook that has to be exported.
Request API Syntax
curl -X GET  -H 'X-AUTH-TOKEN: <AUTH TOKEN>'  -H 'Content-Type: application/json'  -H 'Accept: application/json'
"https://gcp.qubole.com/v2/notes/<note_id>/export"
Sample API Request

Here is an example to export a Spark notebook.

curl -X GET  -H 'X-AUTH-TOKEN: <AUTH TOKEN>'  -H 'Content-Type: application/json'  -H 'Accept: application/json'
"https://gcp.qubole.com/v2/notes/26962/export"
Clone a Notebook
PUT /api/v1.2/notebooks/<notebook ID>/clone

Use this API to clone a notebook. You cannot change the notebook type of the parent notebook while cloning it. To know how to clone a notebook using the Notebooks UI, see Cloning a Notebook.

Required Role

The following users can make this API call:

  • Users who belong to the system-admin or system-user group.
  • Users who belong to a group associated with a role that allows submitting a command. See Managing Groups and Managing Roles for more information.
Parameters

Note

Parameters marked in bold below are mandatory. Others are optional and have default values. Presto is not currently supported on all Cloud platforms; see QDS Components: Supported Versions and Cloud Platforms.

Parameter Description
name It is the name of the notebook. It is a string and can accept alpha-numerical characters. By default, - Clone is added to the name. You can also change the name if it is required.
location It is the location of the folder. By default, it goes to Users/current_user_email_id folders. For more information on notebook folders, see Using Folders in Notebooks. The accepted folder locations are: Users/current_user_email_id, Common, and Users. The default location is Users/current_user_email_id and it is equivalent to My Home on the Notebooks UI. You need privileges to create/edit notebooks in Common and Users. For more information, see Managing Folder-level Permissions.
cluster_id It is the cluster ID to which the notebook gets assigned with. You must assign the notebook with a cluster to use it even though it is not mandatory to assign a notebook with the cluster when you create it. Assign a cluster only with the corresponding type of notebook that is assign a Spark note with a Spark cluster and a Presto notebook with a Presto cluster.
Request API Syntax
 curl -i -X POST -H "X-AUTH-TOKEN: $AUTH_TOKEN" -H "Content-Type: application/json" -H "Accept: application/json" \
-d '{"name":"<Name>", "location":"<Location>", "cluster_id":"<Cluster ID"}' \
 "https://gcp.qubole.com/api/v1.2/notebooks/<notebook ID>/clone"
Sample API Request

Here is an example to configure a Spark notebook with its ID as 2000.

curl -i -X PUT -H "X-AUTH-TOKEN: $AUTH_TOKEN" -H "Content-Type: application/json" -H "Accept: application/json" \
-d '{"name":"Spark", "location":"Users/user@abc.com/Notebooks", "cluster_id":"4001"}' \
"https://gcp.qubole.com/api/v1.2/notebooks/2000/clone"
Bind a Notebook to a Cluster
PUT /api/v1.2/notebooks/<notebook ID>

Use this API to assign a cluster to a notebook. To know how to bind a notebook using the Notebooks UI, see Viewing a Notebook Information or Configuring a Notebook.

Required Role

The following users can make this API call:

  • Users who belong to the system-admin or system-user group.
  • Users who belong to a group associated with a role that allows submitting a command. See Managing Groups and Managing Roles for more information.
Parameters

Note

Parameters marked in bold below are mandatory. Others are optional and have default values.

Parameter Description
cluster_id The cluster ID of the cluster that you want to get it assigned to a specific notebook.
Request API Syntax
curl -i -X PUT -H "X-AUTH-TOKEN: $AUTH_TOKEN" -H "Content-Type: application/json" -H "Accept: application/json" \
-d '{"cluster_id":"<Cluster ID"}' \
 "https://gcp.qubole.com/api/v1.2/notebooks/<notebook ID>"
Sample API Request

Here is a sample request to assign a notebook with its ID as 2002 to a cluster with its ID as 4002

curl -i -X PUT -H "X-AUTH-TOKEN: $AUTH_TOKEN" -H "Content-Type: application/json" -H "Accept: application/json" \
-d '{"cluster_id":"4002"}' \
 "https://gcp.qubole.com/api/v1.2/notebooks/2002"
Run a Notebook
POST /api/v1.2/commands

Use this API to run a Spark notebook with optional parameters. Currently, this API does not support Presto notebooks. You can view the command’s status, result, or cancel a command using the corresponding Command API that are used for other types of command.

Note

You can run a Spark notebook only when it is associated with a Spark cluster. You must have permissions to run or schedule a notebook.

These are a few points to know about running a notebook through the API:

  • Invoking a notebook API is useful when you want to run all its paragraphs at once.
  • When you invoke an API to run a notebook and simultaneously that notebook is edited by another user, then the API does not successfully run or at least does not respond with the expected result.
  • When two users simultaneously invoke a API to run the same notebook, both the notebooks run successfully but the paragraphs may not run in the same order.
  • As other command APIs, when a notebook is invoked/run through the API, you can see the Qubole logs and the App Logs.

A Spark notebook can also be scheduled using the Scheduler API as a Spark command. scheduler-api describes how to create, edit, view, and list schedules.

Required Role

The following users can make this API call:

  • Users who belong to the system-admin or system-user group.
  • Users who belong to a group associated with a role that allows submitting a command. See Managing Groups and Managing Roles for more information.
Parameters

Note

Parameters marked in bold below are mandatory. Others are optional and have default values.

Parameter Description
command_type SparkCommand
note_id Specify the notebook’s ID that you want to run.
language The language is notebook and if not specified, it gets added by default when a notebook’s ID is specified in the API call.
label Specify one of the labels of the cluster which is associated with the notebook that you want to run.
name Add a name to the command that is useful while filtering commands from the command history. It does not accept & (ampersand), < (lesser than), > (greater than), ” (double quotes), and ‘ (single quote) special characters, and HTML tags as well. It can contain a maximum of 255 characters.
tags Add a tag to a command so that it is easily identifiable and searchable from the commands list in the Commands History. Add a tag as a filter value while searching commands. It can contain a maximum of 255 characters. A comma-separated list of tags can be associated with a single command. While adding a tag value, enclose it in square brackets. For example, {"tags":["<tag-value>"]}.
arguments

It is used to add parameters in the run notebook API call. These parameters fill dynamic forms of the notebook with the given parameters. You can pass more than one variable. The syntax for using arguments is given below.

"arguments": {"key1":"value1", "key2":"value2", ..., "keyN":"valueN"}

Where key1, key2, … keyN are the parameters that you want to pass before you run the notebook. You can just change the corresponding values (value1, value2,…, valueN) each time you run the API call (if required).

Request API Syntax

Here is the Request API syntax for running a Spark notebook.

curl  -i -X POST -H "X-AUTH-TOKEN: $AUTH_TOKEN" -H "Content-Type: application/json" -H "Accept: application/json" \
-d '{"name" : "<name_command>", "command_type":"SparkCommand", "language":"notebook", "note_id":"<Notebook_Id>",
"tag":"<tags>", "label":"<cluster-label>", "arguments": {"key1":"value1", "key2":"value2", ..., "keyN":"valueN"} }' \
"https://gcp.qubole.com/api/v1.2/commands"
Sample API Requests

Here is an example with a successful response.

curl  -i -X POST -H "X-AUTH-TOKEN: $AUTH_TOKEN" -H "Content-Type: application/json" -H "Accept: application/json" \
-d '{"name" : "note_command", "command_type":"SparkCommand", "language":"notebook", "note_id":"123","tag":"notes",
     "label":"spark1"}' \
     "https://gcp.qubole.com/api/v1.2/commands"

Successful Response

{
  "id": 363,
  "path": "/tmp/2016-10-03/1/363",
  "status": "waiting",
  "created_at": "2016-10-03T07:14:39Z",
  "command_type": "SparkCommand",
  "progress": 0,
  "qbol_session_id": 69,
  "qlog": null,
  "resolved_macros": null,
  "pid": null,
  "template": "generic",
  "submit_time": 1475478879,
  "start_time": null,
  "end_time": null,
  "can_notify": false,
  "num_result_dir": 0,
  "pool": null,
  "timeout": null,
  "name": "note_command",
  "command_source": "API",
  "account_id": 1,
  "saved_query_mutable_id": null,
  "user_id": 1,
  "label": "spark1",
  "meta_data": {
    "results_resource": "commands/363/results",
    "logs_resource": "commands/363/logs"
  },
  "uid": 1,
  "perms": null,
  "command": {
    "cmdline": null,
    "language": "notebook",
    "note_id": 123,
    "program": null,
    "arguments": "",
    "user_program_arguments": null,
    "sql": null,
    "md_cmd": true,
    "app_id": null,
    "retry": 0
  },
  "instance": null
}

Here is an example with a failed response.

curl  -i -X POST -H "X-AUTH-TOKEN: $AUTH_TOKEN" -H "Content-Type: application/json" -H "Accept: application/json" \
-d '{"name" : "note_command", "command_type":"SparkCommand", "language":"notebook", "note_id":"111", "tag":"notes"}' \
    "https://gcp.qubole.com/api/v1.2/commands"

Failed Response

{
"error": {
 "error_code": 422,
 "error_message": "Command type could not be created. Errors: There is no cluster associated with notebook with Id: 111"
 }
}

Here is another example with a failed response.

curl  -i -X POST -H "X-AUTH-TOKEN: $AUTH_TOKEN" -H "Content-Type: application/json" -H "Accept: application/json" \
-d  '{"name" : "note_command", "command_type":"SparkCommand", "language":"notebook", "note_id":"12321", "tag":"notes",
      "label":"spark1"}' \
      "https://gcp.qubole.com/api/v1.2/commands"

Failed Response

{
  "error": {
   "error_code": 422,
   "error_message": ""Command type could not be created. Errors: There is no spark notebook for account 54321 with Id: 3333""
 }
}

Here is a sample REST API call with optional parameters.

curl  -i -X POST -H "X-AUTH-TOKEN: $AUTH_TOKEN" -H "Content-Type: application/json" -H "Accept: application/json" \
-d  '{"name" : "note_command", "command_type":"SparkCommand", "language":"notebook", "note_id":"1000", "tag":"notes",
      "label":"spark2", "arguments":{"Name":"AssetNote", "Year":"2017"}}' \
      "https://gcp.qubole.com/api/v1.2/commands"
Delete a Notebook
DELETE /api/v1.2/notebooks/<notebook ID>

Use this API to delete a notebook. To know how to clone a notebook using the Notebooks UI, see Deleting a Notebook.

Before deleting a notebook, you must ensure that the notebook does not have any active command, active schedules associated with the notebook, or any scheduled dashboard.

Required Role

The following users can make this API call:

  • Users who belong to the system-admin or system-user group.
  • Users who belong to a group associated with a role that allows submitting a command. See Managing Groups and Managing Roles for more information.
Parameters

Note

No additional parameters are required in the delete a notebook API call.

Request API Syntax
curl -i -X DELETE -H "X-AUTH-TOKEN: $AUTH_TOKEN" -H "Content-Type: application/json" -H "Accept: application/json" \
"https://gcp.qubole.com/api/v1.2/notebooks/<notebook ID>"
Sample API Request

Here is an example to delete a Spark notebook with its ID as 2000.

curl -i -X DELETE -H "X-AUTH-TOKEN: $AUTH_TOKEN" -H "Content-Type: application/json" -H "Accept: application/json" \
"https://gcp.qubole.com/api/v1.2/notebooks/2000"

Jupyter Notebook API

This section explains how to run and schedule a Jupyter notebook through REST API calls. For more information on JupyterLab interface, see Jupyter Notebooks. This section covers:

Run a Jupyter Notebook
POST /api/v1.2/commands

Use this API to run a Jupyter notebook. You can view the command’s status, result, or cancel a command using the corresponding Command API that are used for other types of command.

Required Role

The following users can make this API call:

  • Users who belong to the system-user or system-admin group.
  • Users who belong to a group associated with a role that allows update on Jupyter Notebook and directory. See Managing Groups and Managing Roles for more information.
Parameters

Note

Parameters marked in bold below are mandatory. Others are optional and have default values.

Parameter Description
path Path including name of the Jupyter notebook to be run with extension (.ipynb).
label Label of the cluster on which the Jupyter notebook should be run. If this parameter is not specified then label = “default” is used.
command_type Type of command to be executed. For Jupyter notebook, the command type is JupyterNotebookCommand.
arguments

Valid JSON to be sent to the notebook. Specify the parameters in notebooks and pass the parameter value using the JSON format. key is the parameter’s name and value is the parameter’s value.

Supported types in parameters are string, integer, float, and boolean.

Request API Syntax

Here is the Request API syntax for running a Jupyter notebook.

curl  -i -X POST -H "X-AUTH-TOKEN: <auth_token>" -H "Content-Type: application/json" -H "Accept: application/json" \
-d '{"path":"<Path>/<Name>”, "command_type":"JupyterNotebookCommand",  "label":"<ClusterLabel>", "arguments": {"key1":"value1", "key2":"value2", ..., "keyN":"valueN"}}' \
 "https://gcp.qubole.com/api/v1.2/commands"
Sample API Request
curl  -i -X POST -H "X-AUTH-TOKEN: $AUTH_TOKEN" -H "Content-Type: application/json" -H "Accept: application/json" \
-d '{"path":"Users/abc@xyz.com/note1.ipynb”, "command_type":"JupyterNotebookCommand",  "label":"spark-cluster-1", "arguments": {"name":"xyz", "age":"20"}}' \
 "https://gcp.qubole.com/api/v1.2/commands"
Known Limitation

If there is a warning in one of the cells when this API is run, the notebook stops executing at that cell. As a workaround, to skip the warning and continue execution, add raises-exception in that cell’s metadata field by performing the following steps:

  1. Select the cell that shows the warning.
  2. Click on the Tools icon on the left side bar.
  3. Click Advanced Tools.
  4. Add raises-exception in the Cell Metadata tags field.
  5. Re-run the API.
Schedule a Jupyter Notebook
POST /api/v1.2/scheduler

Use this API to schedule a Jupyter notebook. You can view the command’s status, result, or cancel a command using the corresponding Command API that are used for other types of command.

Note

This API is not available by default. Create a ticket with Qubole Support to enable this API on your QDS account.

Required Role

The following users can make this API call:

  • Users who belong to the system-user or system-admin group.
  • Users who belong to a group associated with a role that allows update on Jupyter Notebook and directory. See Managing Groups and Managing Roles for more information.
  • Users who belong to a group associated with a role that allows create on the Jupyter Notebook command.
Parameters

Note

Parameters marked in bold below are mandatory. Others are optional and have default values.

Parameter Description
name Name for the schedule. If name is not specified, then a system-generated Schedule ID is set as the name.
label Label of the cluster on which the Jupyter notebook should be scheduled.
command_type Type of command to be executed. For Jupyter notebook, the command type is JupyterNotebookCommand.
command

JSON object that contains path (path including name of the Jupyter notebook to be run with extension (.ipynb).

retry (optional): denotes the number of retries for a job. Valid values are 1, 2, and 3.

retry_delay(optional): denotes the time interval (in minutes) between the retries when a job fails.

arguments (optional): Valid JSON to be sent to the notebook. Specify the parameters in notebooks and pass the parameter value using the JSON format. key is the parameter’s name and value is the parameter’s value. Supported types in parameters are string, integer, float, and boolean.

start_time Start datetime for the schedule. In the Cron expression, the scheduler calculates the Next Materialized Time (NMT)/Start time considering the current time as the base time and Cron expression passed. Start time is not honored in the Cron expression.
end_time End datetime for the schedule.
frequency Set this option or cron_expression but do not set both options. Specify how often the schedule should run. Input is an integer. For example, frequency of one hour/day/month is represented as {"frequency":"1"}
time_unit Denotes the time unit for the frequency. Its default value is days. Accepted value is minutes, hours, days, weeks, or months.

For more information about the schedule parameters, see scheduler-api.

Request API Syntax

Here is the Request API syntax for scheduling a Jupyter notebook.

curl -i -X POST -H "X-AUTH-TOKEN: <token>" -H "Accept: application/json" -H "Content-type: application/json" -d \
 '{"command_type":"JupyterNotebookCommand", "command": {"path":"<Path>/<Name>", "retry": 2, "retry_delay": 4, "arguments": {"key1": "value1", …, "keyN": "valueN"}}, "start_time": "2019-12-26T02:00Z","end_time": "2020-07-01T02:00Z","frequency": 1,"time_unit": "days", "label": "<ClusterLabel>"}' \
 "https://gcp.qubole.com/api/v1.2/scheduler"
Sample API Request
curl -i -X POST -H "X-AUTH-TOKEN: $AUTH_TOKEN" -H "Accept: application/json" -H "Content-type: application/json" \
-d  '{"command_type":"JupyterNotebookCommand", "command": {"path":"Users/abc@xyz.com/note1.ipynb", "retry": 2, "retry_delay": 4, "arguments": {"name": "abc", "age": "20"}}, "start_time": "2019-12-26T02:00Z","end_time": "2020-07-01T02:00Z","frequency": 1,"time_unit": "days", "label": "spark-cluster-1"}' \
"https://gcp.qubole.com/api/v1.2/scheduler"
Known Limitation

If there is a warning in one of the cells when a scheduled notebook runs, the notebook stops executing at that cell. As a workaround, to skip the warning and continue execution, add raises-exception in that cell’s metadata field by performing the following steps:

  1. Select the cell that shows the warning.
  2. Click on the Tools icon on the left side bar.
  3. Click Advanced Tools.
  4. Add raises-exception in the Cell Metadata tags field.
  5. Re-run the API.

Object Policy API

Qubole supports managing access control for each notebook and cluster by overriding the access granted to the object at the account-level in the Control Panel. Qubole supports overriding the account-level access to individual cluster or notebook of a specific account through REST API calls. For more information, see Managing Roles.

Managing access control to each object is described in these topics:

Set Object Policy for a Cluster

You can set a policy for an individual object and restrict users or groups from accessing the object. This overrides the access granted to the object at the account-level in the Control Panel. For more information, see Managing Roles.

Managing Cluster Permissions through the UI describes how to set cluster permissions through the QDS UI.

Note

Understanding the Soft-enforced Cluster Permissions provides the list of cluster permissions that would be enforced with one cluster permission.

PUT /api/v1.2/object_policy/policy

Use this API to set an object policy. Qubole supports object policy API on notebooks and clusters. This section describes setting an object policy for a cluster.

Note

If you allow a user with a permission who is part of the group that has restricted access, then that user is allowed access and vice versa. For more information, see Understanding the Precedence of Cluster Permissions.

Required Role

The following users can make this API call:

  • Users who belong to the system-admin group or owner of the object.
  • Users who belong to a group associated with a role that allows setting an object policy. See Managing Groups and Managing Roles for more information.
Parameters

Note

Parameters marked in bold are mandatory. Others are optional and have default values.

Parameter Description
source_id It specifies the ID of the object based on the source_type.
source_type It specifies the object. It must be a cluster for a cluster.
policy

Array of policies to be assigned to a cluster. Each policy include following parameters:

Note

Escape the values of policy elements and corresponding values except the user ID value and group ID value.

  • action: Name of the action with particular resource. The actions can be read, update, delete, manage, or all. The actions imply as given below:
    • read: This action is set to allow/deny a user/group to view the object.
    • update: This action is set to allow/deny a user/group to edit the object.
    • delete: This action is set to allow/deny a user/group to delete the object.
    • manage: This action allows the user/group to manage the cluster’s permissions.
    • all: This action is set to allow/deny a user/group to do read/edit/delete the object. It gets the lowest priority always. That is read, update, and delete actions get precedence over the all action.
    • start: This action is set to allow/deny a user/group to start the cluster.
    • terminate: This action is set to allow/deny a user/group to terminate the cluster.
  • access: It is set to allow or deny actions. Its value is either allow or deny.
  • condition: Array of user IDs and group IDs that have to be assigned with this policy:
    • qbol_users: An array of IDs of the user for whom this policy needs to be set. But these IDs are not similar to user IDs and create a ticket with Qubole Support to get the user IDs.
    • qbol_groups: An array of IDs of the groups for whom this policy needs to be set.
Request API Syntax
curl -X PUT -H "X-AUTH-TOKEN: <API-TOKEN>" -H "Content-Type: application/json" -H "Accept: application/json" \
-d '{"source_id":"<Object_ID>", "source_type": "<Object>",
      "policy": "[{\"access\":\"<Access>\",\"condition\":{\"qbol_users\":[<User ID>]},\"action\":[\"<Actions>\"]},
                 {\"access\":\"<Access>\",\"condition\":{\"qbol_groups\":[<Group ID>]},\"action\":[\"<Actions>\"]},
                 {\"access\":\"<Access>\",\"condition\":{\"qbol_users\":[<User ID>],\"qbol_groups\":[<Group ID>]},\"action\":[\"<Actions>\"]}]"}' \
"https://gcp.qubole.com/api/v1.2/object_policy/policy"
Sample API Request

Here is a sample API call to set an object policy for a cluster with its ID as 2001.

curl -X PUT -H "X-AUTH-TOKEN: <API-TOKEN>" -H "Content-Type: application/json" -H "Accept: application/json" \
-d '{"source_id":"2001", "source_type": "cluster",
      "policy": "[{\"access\":\"allow\",\"condition\":{\"qbol_users\":[1715]},\"action\":[\"read\"]},
                 {\"access\":\"allow\",\"condition\":{\"qbol_groups\":[2352]},\"action\":[\"read\",\"update\"]},
                 {\"access\":\"deny\",\"condition\":{\"qbol_users\":[1715],\"qbol_groups\":[2352]},\"action\":[\"all\"]}]"}' \
"https://gcp.qubole.com/api/v1.2/object_policy/policy"
Set Object Policy for a Notebook

You can set a policy for an individual object and restrict users or groups from accessing the object. This overrides the access granted to the object at the account-level in the Control Panel. For more information, see Managing Roles.

PUT /api/v1.2/object_policy/policy

Use this API to set an object policy. Qubole supports object policy API on notebooks, clusters, and scheduler. This section describes setting an object policy for a notebook.

Managing Notebook Permissions describes how to control access for each notebook.

Note

If you a allow a user with a permission who is part of the group that has restricted access, then that user is allowed access and vice versa.

Required Role

The following users can make this API call:

  • Users who belong to the system-admin group or owner of the object.
  • Users who belong to a group associated with a role that allows setting an object policy. See Managing Groups and Managing Roles for more information.
Parameters

Note

Parameters marked in bold are mandatory. Others are optional and have default values.

Parameter Description
source_id It specifies the ID of the object based on the source_type.
source_type It specifies the object. It must be a note for a notebook.
policy

Array of policies to be assigned to a notebook. Each policy include following parameters:

Note

Escape the values of policy elements and corresponding values except the user ID value and group ID value.

  • action: Name of the action with particular resource. The actions can be read, update, delete, manage, or all. The actions imply as given below:
    • read: This action is set to allow/deny a user/group to view the object.
    • update: This action is set to allow/deny a user/group to edit the object.
    • delete: This action is set to allow/deny a user/group to delete the object.
    • manage: This action allows the user/group to manage the notebook’s permissions.
    • all: This action is set to allow/deny a user/group to do read/edit/delete the object. It gets the lowest priority always. That is read, update, and delete actions get precedence over the all action.
  • access: It is set to allow or deny actions. Its value is either allow or deny.
  • condition: Array of user IDs and group IDs that have to be assigned with this policy:
    • qbol_users: An array of IDs of the user for whom this policy needs to be set. But these IDs are not similar to user IDs and create a ticket with Qubole Support to get the user IDs.
    • qbol_groups: An array of IDs of the groups for whom this policy needs to be set.
Request API Syntax
curl -X PUT -H "X-AUTH-TOKEN: <API-TOKEN>" -H "Content-Type: application/json" -H "Accept: application/json" \
-d '{"source_id":"<Object_ID>", "source_type": "<Object>",
      "policy": "[{\"access\":\"<Access>\",\"condition\":{\"qbol_users\":[<User ID>]},\"action\":[\"<Actions>\"]},
                 {\"access\":\"<Access>\",\"condition\":{\"qbol_groups\":[<Group ID>]},\"action\":[\"<Actions>\"]},
                 {\"access\":\"<Access>\",\"condition\":{\"qbol_users\":[<User ID>],\"qbol_groups\":[<Group ID>]},\"action\":[\"<Actions>\"]}]"}' \
"https://gcp.qubole.com/api/v1.2/object_policy/policy"
Sample API Request

Here is a sample API call to set an object policy for a notebook with its ID as 250.

curl -X PUT -H "X-AUTH-TOKEN: <API-TOKEN>" -H "Content-Type: application/json" -H "Accept: application/json" \
-d '{"source_id":"250", "source_type": "note",
      "policy": "[{\"access\":\"allow\",\"condition\":{\"qbol_users\":[1715]},\"action\":[\"read\"]},
                 {\"access\":\"allow\",\"condition\":{\"qbol_groups\":[2352]},\"action\":[\"read\",\"update\"]},
                 {\"access\":\"deny\",\"condition\":{\"qbol_users\":[1715],\"qbol_groups\":[2352]},\"action\":[\"all\"]}]"}` \
"https://gcp.qubole.com/api/v1.2/object_policy/policy"

Note

It is recommended to have a deny all action to the list of users and groups as it would be in tandem with the UI managing permissions of notebooks.

In the above example, the last condition meets that requirement.

{\"access\":\"deny\",\"condition\":{\"qbol_users\":[1715],\"qbol_groups\":[2352]},\"action\":[\"all\"]}

Set Object Policy for a Scheduler

You can set a policy for an individual object and restrict users or groups from accessing the object. This overrides the access granted to the object at the account-level in the Control Panel. For more information, see Managing Roles.

Managing Scheduler Permissions through the UI describes how to set schedule permissions through the QDS UI.

PUT /api/v1.2/object_policy/policy

Use this API to set an object policy. Qubole supports object policy API on notebooks, clusters, and scheduler. This section describes setting an object policy for a schedule.

Note

If you allow a user with a permission who is part of the group that has restricted access, then that user is allowed access and vice versa. For more information, see Understanding the Precedence of Scheduler Permissions.

Required Role

The following users can make this API call:

  • Users who belong to the system-admin group or owner of the object.
  • Users who belong to a group associated with a role that allows setting an object policy. See Managing Groups and Managing Roles for more information.
Parameters

Note

Parameters marked in bold are mandatory. Others are optional and have default values.

Parameter Description
source_id It specifies the ID of the object based on the source_type.
source_type It specifies the object. It must be a scheduler for a schedule.
policy

Array of policies to be assigned to a schedule. Each policy include following parameters:

Note

Escape the values of policy elements and corresponding values except the user ID value and group ID value.

  • action: Name of the action with particular resource. The actions can be read, update, delete, manage, or all. The actions imply as given below:
    • read: This action is set to allow/deny a user/group to view the object.
    • update: This action is set to allow/deny a user/group to edit the object.
    • delete: This action is set to allow/deny a user/group to delete the object.
    • manage: This action allows the user/group to manage the schedule’s permissions.
    • all: This action is set to allow/deny a user/group to do read/edit/delete the object. It gets the lowest priority always. That is read, update, and delete actions get precedence over the all action.
  • access: It is set to allow or deny actions. Its value is either allow or deny.
  • condition: Array of user IDs and group IDs that have to be assigned with this policy:
    • qbol_users: An array of IDs of the user for whom this policy needs to be set. But these IDs are not similar to user IDs and create a ticket with Qubole Support to get the user IDs.
    • qbol_groups: An array of IDs of the groups for whom this policy needs to be set.
Request API Syntax
curl -X PUT -H "X-AUTH-TOKEN: <API-TOKEN>" -H "Content-Type: application/json" -H "Accept: application/json" \
-d '{"source_id":"<Object_ID>", "source_type": "<Object>",
      "policy": "[{\"access\":\"<Access>\",\"condition\":{\"qbol_users\":[<User ID>]},\"action\":[\"<Actions>\"]},
                 {\"access\":\"<Access>\",\"condition\":{\"qbol_groups\":[<Group ID>]},\"action\":[\"<Actions>\"]},
                 {\"access\":\"<Access>\",\"condition\":{\"qbol_users\":[<User ID>],\"qbol_groups\":[<Group ID>]},\"action\":[\"<Actions>\"]}]"}' \
"https://gcp.qubole.com/api/v1.2/object_policy/policy"
Sample API Request

Here is a sample API call to set an object policy for a schedule with its ID as 140.

curl -X PUT -H "X-AUTH-TOKEN: <API-TOKEN>" -H "Content-Type: application/json" -H "Accept: application/json" \
-d '{"source_id":"140", "source_type": "scheduler",
      "policy": "[{\"access\":\"allow\",\"condition\":{\"qbol_users\":[1715]},\"action\":[\"read\"]},
                 {\"access\":\"allow\",\"condition\":{\"qbol_groups\":[2352]},\"action\":[\"read\",\"update\"]},
                 {\"access\":\"deny\",\"condition\":{\"qbol_users\":[1715],\"qbol_groups\":[2352]},\"action\":[\"all\"]}]"}' \
"https://gcp.qubole.com/api/v1.2/object_policy/policy"
Set Object Policy for a Package Management Environment

You can set a policy for an individual object and restrict users or groups from accessing the object. This overrides the access granted to the object at the account-level in the Control Panel. For more information, see Managing Roles.

PUT /api/v1.2/object_policy/policy

Use this API to set an object policy. Qubole supports object policy API on notebooks, clusters, and scheduler, and environment. This section describes setting an object policy for a Package Management environment.

Managing Permissions of an Environment describes how to control access for each .

Note

If you allow a user with a permission who is part of the group that has restricted access, then that user is allowed access and vice versa.

Required Role

The following users can make this API call:

  • Users who belong to the system-admin group or owner of the object.
  • Users who belong to a group associated with a role that allows setting an object policy. See Managing Groups and Managing Roles for more information.
Parameters

Note

Parameters marked in bold are mandatory. Others are optional and have default values.

Parameter Description
source_id It specifies the ID of the object based on the source_type.
source_type It specifies the object. It must be the environment for a Package Management Environment.
policy

Array of policies to be assigned to a Package Management environment. Each policy include following parameters:

Note

Escape the values of policy elements and corresponding values except the user ID value and group ID value.

  • action: Name of the action with particular resource. The actions can be read, update, delete, manage, or all. The actions imply as given below:
    • read: This action is set to allow/deny a user/group to view the object.
    • update: This action is set to allow/deny a user/group to edit the object.
    • delete: This action is set to allow/deny a user/group to delete the object.
    • manage: This action allows the user/group to manage the object’s permissions.
    • all: This action is set to allow/deny a user/group to do read/edit/delete the object. It gets the lowest priority always. That is read, update, and delete actions get precedence over the all action.
  • access: It is set to allow or deny actions. Its value is either allow or deny.
  • condition: Array of user IDs and group IDs that have to be assigned with this policy:
    • qbol_users: An array of IDs of the user for whom this policy needs to be set. But these IDs are not similar to user IDs and create a ticket with Qubole Support to get the user IDs.
    • qbol_groups: An array of IDs of the groups for whom this policy needs to be set.
Request API Syntax
curl -X PUT -H "X-AUTH-TOKEN: <API-TOKEN>" -H "Content-Type: application/json" -H "Accept: application/json" \
-d '{"source_id":"<Object_ID>", "source_type": "<Object>",
      "policy": "[{\"access\":\"<Access>\",\"condition\":{\"qbol_users\":[<User ID>]},\"action\":[\"<Actions>\"]},
                 {\"access\":\"<Access>\",\"condition\":{\"qbol_groups\":[<Group ID>]},\"action\":[\"<Actions>\"]},
                 {\"access\":\"<Access>\",\"condition\":{\"qbol_users\":[<User ID>],\"qbol_groups\":[<Group ID>]},\"action\":[\"<Actions>\"]}]"}' \
"https://gcp.qubole.com/api/v1.2/object_policy/policy"
Sample API Request

Here is a sample API call to set an object policy for a Package Management environment with its ID as 20.

curl -X PUT -H "X-AUTH-TOKEN: <API-TOKEN>" -H "Content-Type: application/json" -H "Accept: application/json" \
-d '{"source_id":"20", "source_type": "environment",
      "policy": "[{\"access\":\"allow\",\"condition\":{\"qbol_users\":[1715]},\"action\":[\"read\"]},
                 {\"access\":\"allow\",\"condition\":{\"qbol_groups\":[2352]},\"action\":[\"read\",\"update\"]},
                 {\"access\":\"deny\",\"condition\":{\"qbol_users\":[1715],\"qbol_groups\":[2352]},\"action\":[\"all\"]}]"}` \
"https://gcp.qubole.com/api/v1.2/object_policy/policy"

Note

It is recommended to have a deny all action to the list of users and groups as it would be in tandem with the UI managing permissions of environments.

In the above example, the last condition meets that requirement.

{\"access\":\"deny\",\"condition\":{\"qbol_users\":[1715],\"qbol_groups\":[2352]},\"action\":[\"all\"]}

View the Object Policy
GET /api/v1.2/object_policy/policy

Use this API to view a policy/policies set for an individual object. Since Qubole supports object policy API on notebooks, clusters, package management environments, and scheduler, you can see the policy set for them.

Parameters

Note

Parameters marked in bold are mandatory. Others are optional and have default values.

Parameter Description
source_id It specifies the ID of the object, a cluster, an environment, a notebook, or a scheduler based on the source_type.
source_type It specifies the object. The values are cluster, environment, note, and scheduler.
Required Role

The following users can make this API call:

  • Users who belong to the system-admin or system-user group.
  • Users who belong to a group associated with a role that allows viewing the object policy. See Managing Groups and Managing Roles for more information.
Request API Syntax
curl -X GET -H "X-AUTH-TOKEN: <API-TOKEN>" -H "Content-Type: application/json" -H "Accept: application/json" \
-d `{"source_id":"<Object_ID>", "source_type": "<Object>"}` \ "https://gcp.qubole.com/api/v1.2/object_policy/policy"
Sample API Requests

Here is an example to see the object policy for a cluster with its ID as 2001.

curl -X GET -H "X-AUTH-TOKEN: <API-TOKEN>" -H "Content-Type: application/json" -H "Accept: application/json" \
-d `{"source_id":"2001", "source_type": "cluster"}` \ "https://gcp.qubole.com/api/v1.2/object_policy/policy"

Here is an example to see the object policy for a notebook with its ID as 250.

curl -X GET -H "X-AUTH-TOKEN: <API-TOKEN>" -H "Content-Type: application/json" -H "Accept: application/json" \
-d `{"source_id":"250", "source_type": "note"}` \ "https://gcp.qubole.com/api/v1.2/object_policy/policy"

Package Management Environment API

QDS provides a package management environment to add and manage Python and R packages in Spark applications. For more information, see Using the Default Package Management UI.

Note

This feature is not available by default. Create a ticket with Qubole Support to enable this feature on the QDS account.

Create a Package Management Environment
POST /api/v1.2/package/create_environment

Use this API to create a package management environment.

Note

This feature is not available by default. Create a ticket with Qubole Support to enable this feature on the QDS account.

Required Role

The following users can make this API call:

  • Users who belong to the system-admin or system-user group.
  • Users who belong to a group associated with a role that allows creating an environment. See Managing Groups and Managing Roles for more information.
Parameters

Note

Parameters marked in bold are mandatory. Others are optional and have default values.

Parameter Description
name Provide a name to the environment and it is mandatory.
description You can add a brief description about the environment.
python_version The default Python version is 2.7. The other supported version is 3.5.
r_version It is the version of R and its supported and default version is 3.3.
Request API Syntax

Here is the syntax to create an environment.

curl -X POST -H "X-AUTH-TOKEN: <API-TOKEN>" -H "Content-Type: application/json" -H "Accept: application/json" \
-d '{"name": "<EnvironmentName>", "description": "<Description>", "python_version":"<Supported Python Version>",
      "r_version":"<Supported R version>"}' \ "https://gcp.qubole.com/api/v1.2/package/create_environment"
Sample API Request

Here is a sample API call to create an environment.

curl -X POST -H "X-AUTH-TOKEN: <API-TOKEN>" -H "Content-Type: application/json" -H "Accept: application/json" \
-d '{"name": "PackageEnv", "description": "Environment for adding packages", "python_version":"3.5",
    "r_version":"3.3"}' \ "https://gcp.qubole.com/api/v1.2/package/create_environment"

If you do not specify the Python and R versions, the default Python 2.7 and R 3.3 versions would be installed. A newly created environments gets a list of Python and R packages installed by default. For more information, see Using the Default Package Management UI.

Response

{ "environment":
    {
        "account_id": 1,
        "cluster_id": null,
        "created_at": "2017-11-22T12:19:40Z",
        "description": "Environment for adding packages",
        "env_predefine_python_id": 1,
        "env_predefine_r_id": 3,
        "env_python_detail_id": 200,
        "env_r_detail_id": null,
        "id": 81,
        "name": "PackageEnv",
        "python_version_ui": "3.5",
        "qbol_user_id": 88,
        "r_version_ui": "3.3",
        "status": true,
        "updated_at": "2017-12-06T12:05:48Z"
    }
}
Attach a Cluster to the Package Management Environment
PUT /api/v1.2/package/<env ID>/attach_cluster

Use this API to attach a cluster to a package management environment. A cluster can have only one environment attached to it. You can attach a cluster to an environment only when it is down/inactive.

When you create a new Spark cluster, by default a package environment gets created and is attached to the cluster. This feature is not enabled by default. Create a ticket with Qubole Support to enable this feature on the QDS account.

In addition to attaching Spark clusters, you can attach environments with Python 3.5 to Airflow clusters that use Airflow version 1.8.2 only if the environment is detached from the cluster. For more information, see Configuring an Airflow Cluster.

A Conda virtual environment gets created for Python and R environments. In the Spark cluster, Python and R Conda environments are located in /usr/lib/envs/. The spark.pyspark.python configuration in /usr/lib/spark/conf/spark-defaults.conf points to the Python version installed in the Conda virtual environment for a Spark cluster.

In a Spark notebook associated with a cluster attached to the package management environment, configure these in its interpreter settings to point to the virtual environment:

  • Set zeppelin.R.cmd to cluster_env_default_r
  • Set zeppelin.pyspark.python to cluster_env_default_py
Required Role

The following users can make this API call:

  • Users who belong to the system-admin or system-user group.
  • Users who belong to a group associated with a role that allows updating an environment. See Managing Groups and Managing Roles for more information.
Parameters

Note

Parameters marked in bold are mandatory. Others are optional and have default values.

Parameter Description
cluster_id It is the Spark cluster’s unique ID to which you want to attach the environment. One Spark cluster can only have one environment attached to it.
Request API Syntax
curl -X PUT -H "X-AUTH-TOKEN: <API-TOKEN>" -H "Content-Type: application/json" -H "Accept: application/json" \
-d '{"cluster_id": "<Cluster ID>"}` \
"https://gcp.qubole.com/api/v1.2/package/<env ID>/attach_cluster"
Sample API Request

Here is attaching a cluster to the environment with 100 as its ID.

curl -X PUT -H "X-AUTH-TOKEN: <API-TOKEN>" -H "Content-Type: application/json" -H "Accept: application/json" \
-d '{"cluster_id": "125"}` \
"https://gcp.qubole.com/api/v1.2/package/100/attach_cluster"
Detach the Cluster from the Package Management Environment
PUT /api/v1.2/package/<env ID>/detach_cluster

Use this API to detach the cluster from the Package Management environment. In case, if you detach an environment from the Airflow 1.8.2 cluster with Python 3.5, then you must attach another Python 3.5 environment to get the Airflow cluster running. Otherwise, the cluster does not start.

Required Role

The following users can make this API call:

  • Users who belong to the system-admin or system-user group.
  • Users who belong to a group associated with a role that allows updating an environment. See Managing Groups and Managing Roles for more information.
Parameters

None

Request API Syntax
curl -X PUT -H "X-AUTH-TOKEN: <API-TOKEN>" -H "Content-Type: application/json" -H "Accept: application/json" \
"https://gcp.qubole.com/api/v1.2/package/<env ID>/detach_cluster"
Sample API Request

Here is detaching a cluster from the environment with 120 as its ID.

curl -X PUT -H "X-AUTH-TOKEN: <API-TOKEN>" -H "Content-Type: application/json" -H "Accept: application/json" \
"https://gcp.qubole.com/api/v1.2/package/120/detach_cluster"
Edit a Package Management Environment
PUT /api/v1.2/package/<Env ID>/update_environment

Use this API to edit a package management environment. You cannot change the Python and R versions.

Required Role

The following users can make this API call:

  • Users who belong to the system-admin or system-user group.
  • Users who belong to a group associated with a role that allows updating an environment. See Managing Groups and Managing Roles for more information.
Parameters

Note

Parameters marked in bold are mandatory. Others are optional and have default values.

Parameter Description
name Edit the name if it is required.
description You can add a brief description about the environment.
Request API Syntax
curl -X PUT -H "X-AUTH-TOKEN: <API-TOKEN>" -H "Content-Type: application/json" -H "Accept: application/json" \
-d '{"name": "<EnvironmentName>", "description": "<Description>"}' \
"https://gcp.qubole.com/api/v1.2/package/<env ID>/update_environment"
Sample API Request

Here is a sample request to edit a package environment that has 97 as its ID.

curl -X PUT -H "X-AUTH-TOKEN: <API-TOKEN>" -H "Content-Type: application/json" -H "Accept: application/json" \
-d '{"name": "PackageEnv1", "description": "Environment for adding Python and R packages"}' \
"https://gcp.qubole.com/api/v1.2/package/97/update_environment"
Clone a Package Management Environment
PUT /api/v1.2/package/<env ID>/clone_environment

Use this API to clone a package management environment. You cannot change the Python and R versions.

Required Role

The following users can make this API call:

  • Users who belong to the system-admin or system-user group.
  • Users who belong to a group associated with a role that allows creating an environment. See Managing Groups and Managing Roles for more information.
Parameters

Note

Parameters marked in bold are mandatory. Others are optional and have default values.

Parameter Description
name Provide a name to the environment. By default, the cloned environment inherits the parent environment’s name with -clone as it suffix that is <environmentname-clone>.
description You can add a brief description about the environment.
Request API Syntax
curl -X PUT -H "X-AUTH-TOKEN: <API-TOKEN>" -H "Content-Type: application/json" -H "Accept: application/json" \
-d '{"name": "<EnvironmentName>", "description": "<Description>"}' \
"https://gcp.qubole.com/api/v1.2/package/<env ID>/clone_environment"
Sample API Request

Here is a sample request to clone a package environment that has 100 as its ID.

curl -X PUT -H "X-AUTH-TOKEN: <API-TOKEN>" -H "Content-Type: application/json" -H "Accept: application/json" \
-d '{"name": "PackageEnv2.0", "description": "Environment for adding packages"}' \
"https://gcp.qubole.com/api/v1.2/package/100/clone_environment"
List Package Management Environments
GET /api/v1.2/package/list_environments

Use this API to list all environments. On the QDS UI, navigate to Control Panel > Environments to see the list of environments. For more information, see Using the Default Package Management UI.

Required Role

The following users can make this API call:

  • Users who belong to the system-admin or system-user group.
  • Users who belong to a group associated with a role that allows reading/updating an environment. See Managing Groups and Managing Roles for more information.
Parameters

None

Request API Syntax

As there are no parameters, this syntax does not require a sample.

curl -X GET -H "X-AUTH-TOKEN: <API-TOKEN>" -H "Content-Type: application/json" -H "Accept: application/json" \
"https://gcp.qubole.com/api/v1.2/package/list_environments"
Delete a Package Management Environment
DELETE /api/v1.2/package/<env ID>/delete_environment

Use this API to delete a Package Management environment. You can only delete an environment when its attached cluster is down/inactive. After the environment is deleted, its attached cluster gets detached automatically.

Required Role

The following users can make this API call:

  • Users who belong to the system-admin or system-user group.
  • Users who belong to a group associated with a role that allows deleting an environment. See Managing Groups and Managing Roles for more information.
Parameters

None

Request API Syntax
curl -X DELETE -H "X-AUTH-TOKEN: <API-TOKEN>" -H "Content-Type: application/json" -H "Accept: application/json" \
"https://gcp.qubole.com/api/v1.2/package/<env ID>/delete_environment"
Sample API Request

Here is a sample to delete the environment with 140 as its ID.

curl -X DELETE -H "X-AUTH-TOKEN: <API-TOKEN>" -H "Content-Type: application/json" -H "Accept: application/json" \
"https://gcp.qubole.com/api/v1.2/package/140/delete_environment"
Add Packages to the Package Management Environment
POST /api/v1.2/package/<env ID>/add_packages

Use this API to add Python and R packages to a Package Management environment. A newly created environment supports the conda-forge channel which supports more packages. Viewing the List of Pre-installed Python and R Packages provides the list of pre-installed packages. Questions about Package Management provides answers to questions related to adding packages.

Note

In addition to the default Conda package repo, you can also add only R packages from the CRAN package repo. For more information, see Using the Default Package Management UI.

Required Role

The following users can make this API call:

  • Users who belong to the system-admin or system-user group.
  • Users who belong to a group associated with a role that allows updating an environment. See Managing Groups and Managing Roles for more information.
Parameters

Note

Parameters marked in bold are mandatory. Others are optional and have default values.

Parameter Description
source_type Only Python and R packages can be added. Specify python_package or r_package and it is mandatory while adding packages through an environment.
package_names Specify the name of the package. If you just specify the name, the latest version of the package gets installed. To install a specific version of the package, specify it in this format: <packagename>==<version-number>. For example, biopython==0.1. You can add any number of packages as a comma-separated list.
repo_url You must set this parameter to https://cran.r-project.org\ only when you want to add the R package from the CRAN package repo.
Request API Syntax
curl -X POST -H "X-AUTH-TOKEN: <API-TOKEN>" -H "Content-Type: application/json" -H "Accept: application/json" \
-d ' {"source_type": "<package type>", "package_names": "<packagename1>,<packagename2>,....,<packagenameN>"}' \
      "https://gcp.qubole.com/api/v1.2/package/<envID>/add_packages"
Sample API Requests

Note

If you upgrade or downgrade a Python package, the changed version is reflected only after you restart the Spark interpreter. Interpreter Operations lists the restart and other Spark interpreter operations.

Here is a sample API call for adding Python packages to an environment that has 100 as its ID.

curl -X POST -H "X-AUTH-TOKEN: <API-TOKEN>" -H "Content-Type: application/json" -H "Accept: application/json" \

-d '{"source_type": "python_package", "package_names": "numpy,bipython==0.1,tensorflow"}' \
     "https://gcp.qubole.com/api/v1.2/package/100/add_packages"

Here is a sample API call for adding R packages to an environment that has 100 as its ID.

curl -X POST -H "X-AUTH-TOKEN: <API-TOKEN>" -H "Content-Type: application/json" -H "Accept: application/json" \
-d '{"source_type": "r_package", "package_names": "r-rserve,r-brew==1.0"}' \
     "https://gcp.qubole.com/api/v1.2/package/100/add_packages"

Here is a sample API call for adding a R package from the CRAN package repo to an environment that has 100 as its ID.

curl -X POST -H "X-AUTH-TOKEN: <API-TOKEN>" -H "Content-Type: application/json" -H "Accept: application/json" \
-d '{"source_type": "r_package", "repo_packages": "[ {\"name\": \"RMySQL\", \"repo_url\": \"https://cran.r-project.org\"}]" }' \
"https://gcp.qubole.com/api/v1.2/package/100/add_packages"
List Packages of a Package Management Environment
GET /api/v1.2/package/<env ID>/list_packages

Use this API to see the packages added in a specific Package Management environment. <env ID> denotes the environment’s ID. Questions about Package Management provides answers to questions related to adding packages.

Required Role

The following users can make this API call:

  • Users who belong to the system-admin or system-user group.
  • Users who belong to a group associated with a role that allows reading/updating an environment. See Managing Groups and Managing Roles for more information.
Parameters

None

Request API Syntax
curl -X GET -H "X-AUTH-TOKEN: <API-TOKEN>" -H "Content-Type: application/json" -H "Accept: application/json" \
 "https://gcp.qubole.com/api/v1.2/package/<env ID>/list_packages"
Sample API Request

Here is the sample API request to list packages of a Package Management environment that has 516 as its ID.

curl -X GET -H "X-AUTH-TOKEN: <API-TOKEN>" -H "Content-Type: application/json" -H "Accept: application/json" \
 "https://gcp.qubole.com/api/v1.2/package/516/list_packages"
Response

Here is the sample response that provides the list of packages and installed packages’ history.

Note

installed_pkgs_history provides a list of successfully installed packages per request and packages provides a list of last 5 unsuccessful add-packages’ requests and the last successful add-packages’ request.

{
 "env_id": 516,
 "packages": {
     "python": [
         {
             "cluster_status": null,
             "cluster_status_msg": null,
             "created_at": "2018-01-09T14:50:53Z",
             "env_id": 516,
             "error": null,
             "id": 548,
             "qbol_user_id": <user ID>,
             "requirements": "tensorflow,ipython,numpy",
             "status": "success",
             "updated_at": "2018-01-09T14:53:03Z"
         },
    {
             "cluster_status": null,
             "cluster_status_msg": null,
             "created_at": "2018-01-09T10:32:18Z",
             "env_id": 516,
             "error": "  Could not find a version that satisfies the requirement biopython==0.1 (from versions: 1.0a3, 1.0a4, 1.10, 1.20, 1.21, 1.22, 1.23, 1.24, 1.30, 1.40b0, 1.41, 1.42, 1.43, 1.44, 1.45, 1.46, 1.47, 1.48, 1.49b0, 1.49, 1.50b0, 1.50, 1.51b0, 1.51, 1.52, 1.53, 1.54b0, 1.54, 1.55b0, 1.55, 1.56, 1.57, 1.58, 1.59, 1.60, 1.61, 1.62b0, 1.62, 1.63b0, 1.63, 1.64, 1.65, 1.66, 1.67, 1.68, 1.69, 1.70)\nNo matching distribution found for biopython==0.1\nAddition of Python package biopython==0.1 creation failed for the last job",
             "id": 547,
             "qbol_user_id": 12609,
             "requirements": "tensorflow,ipython,biopython==0.1",
             "status": "error",
             "updated_at": "2018-01-09T10:33:54Z"
    }
    ],
   "r": [
         {
             "cluster_status": null,
             "cluster_status_msg": null,
             "created_at": "2018-01-09T14:59:55Z",
             "env_id": 516,
             "error": null,
             "id": 129,
             "qbol_user_id": <user ID>,
             "repo_packages": null,
             "requirements": "r-rserve",
             "status": "success",
             "updated_at": "2018-01-09T15:03:03Z"
         },
         {
             "cluster_status": null,
             "cluster_status_msg": null,
             "created_at": "2018-01-09T15:18:41Z",
             "env_id": 516,
             "error": "PackageNotFoundError: Package not found: '' Package missing in current linux-64 channels: \n  - r-rserve ==0.1\n\nYou can search for packages on anaconda.org with\n\n    anaconda search -t conda r-rserve\n\n\nAddition of R package r-rserve==0.1,r-brew creation failed for the last job",
             "id": 130,
             "qbol_user_id": 12609,
             "repo_packages": null,
             "requirements": "r-rserve,r-rserve==0.1,r-brew",
             "status": "error",
             "updated_at": "2018-01-09T15:21:03Z"
         }
   ]
   },
   "installed_pkgs_history": {
       "python": [
         {
             "cluster_status": null,
             "cluster_status_msg": null,
             "created_at": "2018-01-04T05:09:20Z",
             "env_id": 516,
             "error": null,
             "id": 527,
             "qbol_user_id": <user ID>,
             "requirements": "tensorflow",
             "status": "success",
             "updated_at": "2018-01-04T05:10:08Z"
         },
         {
             "cluster_status": null,
             "cluster_status_msg": null,
             "created_at": "2018-01-09T10:27:32Z",
             "env_id": 516,
             "error": null,
             "id": 546,
             "qbol_user_id": <user ID>,
             "requirements": "tensorflow,ipython",
             "status": "success",
             "updated_at": "2018-01-09T10:28:34Z"
         },
         {
             "cluster_status": null,
             "cluster_status_msg": null,
             "created_at": "2018-01-09T14:50:53Z",
             "env_id": 516,
             "error": null,
             "id": 548,
             "qbol_user_id": <user ID>,
             "requirements": "tensorflow,ipython,numpy",
             "status": "success",
             "updated_at": "2018-01-09T14:53:03Z"
         }
   ],
   "r":  [
         {
             "cluster_status": null,
             "cluster_status_msg": null,
             "created_at": "2018-01-09T14:59:55Z",
             "env_id": 516,
             "error": null,
             "id": 129,
             "qbol_user_id": <user ID>,
             "repo_packages": null,
             "requirements": "r-rserve",
             "status": "success",
             "updated_at": "2018-01-09T15:03:03Z"
         }
   ]
   }
 }
Remove Packages from a Package Management Environment
DELETE /api/v1.2/package/<env ID>/remove_packages

Use this API to remove Python and R packages from a Package Management environment.

Required Role

The following users can make this API call:

  • Users who belong to the system-admin or system-user group.
  • Users who belong to a group associated with a role that allows updating an environment. See Managing Groups and Managing Roles for more information.
Parameters

Note

Parameters marked in bold are mandatory. Others are optional and have default values.

Parameter Description
source_type Specify python_package or r_package and it is mandatory while removing packages from an environment.
package_names Specify the name of the package. To remove a specific version of the package, specify it in this format: <packagename>==<version-number>. For example, biopython==0.1. You can remove any number of packages as a comma-separated list.
Request API Syntax
curl -X DELETE -H "X-AUTH-TOKEN: <API-TOKEN>" -H "Content-Type: application/json" -H "Accept: application/json"
-d ' {"source_type": "<package type>", "package_names": "<packagename1>,<packagename2>,....,<packagenameN>"}
      "https://gcp.qubole.com/api/v1.2/package/<envID>/remove_packages"
Sample API Request

Here is a sample API call for removing Python packages from an environment that has 120 as its ID.

curl -X DELETE -H "X-AUTH-TOKEN: <API-TOKEN>" -H "Content-Type: application/json" -H "Accept: application/json" \
-d '{"source_type": "python_package", "package_names": "numpy,bipython==0.1,tensorflow"}' \
     "https://gcp.qubole.com/api/v1.2/package/120/remove_packages"

Here is a sample API call for removing R packages from an environment that has 120 as its ID.

curl -X DELETE -H "X-AUTH-TOKEN: <API-TOKEN>" -H "Content-Type: application/json" -H "Accept: application/json" \
-d '{"source_type": "r_package", "package_names": "r-rserve,r-brew==1.0"}' \
     "https://gcp.qubole.com/api/v1.2/package/120/remove_packages"

Reports API

All Commands Report
GET /api/v1.2/reports/all_commands

This API retrieves the All Commands report containing the query metrics in JSON format.

Note

The following points are related to a report API:

  • If the difference between start date and end date is more than 60 days, then the system defaults to 1 month window from the current day’s date.
  • If either start date or end date is not provided, then the system defaults to 1 month window from the current day’s date.
  • If you want to get data for a window more than 2 months, then create a ticket with Qubole Support.
Required Role

The following users can make this API call:

  • Users who belong to the system-admin or system-user group.
  • Users who belong to a group associated with a role that allows viewing commands reports. See Managing Groups and Managing Roles for more information.
Parameters
Parameter Description
start_date The date from which you want the report (inclusive). This parameter supports the timestamp in the UTC timezone (YYYY-MM-DDTHH:MM:SSZ) format. The date cannot be earlier than 180 days.
end_date The date until which you want the report (inclusive). The API default is today. This parameter also supports timestamp in the UTC timezone (YYYY-MM-DDTHH:MM:SSZ) format.
offset The starting point of the results. The API default is 0.
limit The number of results to fetch. The API default is 10.
sort_column The column used to sort the report. The valid choices are time, cpu, fs_bytes_read, fs_bytes_written. The API default is, time (chronological order).
by_user Report only those queries which are created by the current user. By default all queries by the current account are reported.
status
It enables you to filter queries based on the following status values: done, error, running, waiting, cancelled, and cancelling.
Response Parameters
Parameter Description
start_date The starting date of the report. This parameter filters on the created_at parameter value.
end_date The ending date of the report. This parameter filters on the created_at parameter value.
sort_column The sort column used.
queries Contains an array of query-related parameters and values as provided in the following table.

An array of parameters and values associated with the queries parameter.

id The query ID of the command.
created_at The time when the query was created.
submitted_by The Email address of the user who created the query.
command_type The type of the command (HiveCommand, PrestoCommand, and so on.)
command_summary The summary of the command (query/latin_statements/script_location, and so on.)
status The status of the command (whether it succeeded or failed, and so on.)
label It is the label of the cluster on which the command is run.
end_time Denotes the time at which the query execution completed.
cpu The total cumulative CPU, (in ms), consumed by this command. It signifies the cumulative CPU time spent by all cluster nodes in the cluster that processed the command.
fs_bytes_read The total bytes read by this command.
fs_bytes_written The total bytes written by this command.
Examples

Goal

To get the default report.

curl -i -X GET -H "X-AUTH-TOKEN: $AUTH_TOKEN"  -H "Accept: application/json"
"https://gcp.qubole.com/api/v1.2/reports/all_commands"

Goal

To get the report for commands executed only by the current user.

curl -i -X GET -H "X-AUTH-TOKEN: $AUTH_TOKEN" -H "Accept: application/json" \
"https://gcp.qubole.com/api/v1.2/reports/all_commands?by_user"

Goal

To get the report for commands executed during a specific time period and sorted by total bytes read.

curl -i -X GET -H "X-AUTH-TOKEN: $AUTH_TOKEN" -H "Accept: application/json" \
"https://gcp.qubole.com/api/v1.2/reports/all_commands?start_date=2017-06-10&end_date=2017-07-10&sort_column=fs_bytes_read"

Sample Response

{
"sort_column": "fs_bytes_read",
"start_date": "2017-06-10T00:00:00Z",
"end_date": "2017-07-10T00:00:00Z",
"queries": [
    {
        "id": 79520285,
        "created_at": "2017-06-19T08:56:06Z",
        "submitted_by": "user1@qubole.com",
        "command_type": "HiveCommand",
        "command_summary": "select count(*) from default_qubole_memetracker where month=\"2008-08\";",
        "status": "done",
        "end_time": "2017-06-19 09:05:12 +0000",
        "tags": "",
        "fs_bytes_read": 4429520896,
        "fs_bytes_written": 8,
        "cpu": null
    },
    {
        "id": 78826792,
        "created_at": "2017-06-15T07:01:23Z",
        "submitted_by": "user@qubole.com",
        "command_type": "HiveCommand",
        "command_summary": "SELECT * FROM `tpcds_orc_1000`.`customer`",
        "status": "done",
        "end_time": "2017-06-15 07:02:53 +0000",
        "tags": "JDBC 1.0.7",
        "fs_bytes_read": 683353856,
        "fs_bytes_written": 1629612160,
        "cpu": 74910
    },
    {
        "id": 78828387,
        "created_at": "2017-06-15T07:11:24Z",
        "submitted_by": "user@qubole.com",
        "command_type": "HiveCommand",
        "command_summary": "SELECT COUNT(* ) FROM `tpcds_orc_1000`.`customer`",
        "status": "done",
        "end_time": "2017-06-15 07:12:03 +0000",
        "tags": "JDBC 1.0.7-SNAPSHOT",
        "fs_bytes_read": 683353856,
        "fs_bytes_written": 9,
        "cpu": 25680
    },
    {
        "id": 78830962,
        "created_at": "2017-06-15T07:23:20Z",
        "submitted_by": "user@qubole.com",
        "command_type": "HiveCommand",
        "command_summary": "SELECT * FROM `tpcds_orc_1000`.`customer`",
        "status": "done",
        "end_time": "2017-06-15 07:24:44 +0000",
        "tags": "JDBC 1.0.7",
        "fs_bytes_read": 683353856,
        "fs_bytes_written": 1629612160,
        "cpu": 76200
    },
    {
        "id": 78652381,
        "created_at": "2017-06-14T08:22:17Z",
        "submitted_by": "user@qubole.com",
        "command_type": "HiveCommand",
        "command_summary": "set hive.cli.print.header=false;\nset hive.resultset.use.unique.column.names=false; \nSELECT * FROM `table`.`calcs`",
        "status": "done",
        "end_time": "2017-06-14 08:24:14 +0000",
        "tags": "ODBC 1.0.0.1001",
        "fs_bytes_read": 23357610,
        "fs_bytes_written": 7470759,
        "cpu": 4330
    },
    {
        "id": 78825573,
        "created_at": "2017-06-15T06:54:44Z",
        "submitted_by": "user@qubole.com",
        "command_type": "HiveCommand",
        "command_summary": "SELECT * FROM `table`.`calcs`",
        "status": "done",
        "end_time": "2017-06-15 06:57:54 +0000",
        "tags": "JDBC 1.0.7",
        "fs_bytes_read": 23357610,
        "fs_bytes_written": 7470759,
        "cpu": 4320
    }
    ]
}
Canonical Hive Commands Report
GET /api/v1.2/reports/canonical/_hive/_commands/

This API provides you the canonical Hive commands report in JSON format. Currently, this report is not generated and Qubole intends to provide this report very soon.

Note

The following points are related to a report API:

  • If the difference between start date and end date is more than 60 days, then the system defaults to 1 month window from the current day’s date.
  • If either start date or end date is not provided, then the system defaults to 1 month window from the current day’s date.
  • If you want to get data for a window more than 2 months, then write an email to help@quoble.com.
Required Role

The following users can make this API call:

  • Users who belong to the system-admin or system-user group.
  • Users who belong to a group associated with a role that allows viewing canonical Hive commands reports. See Managing Groups and Managing Roles for more information.
Parameters
Parameter Description
start_date The date from which you want the report (inclusive). This parameter supports the timestamp in the UTC timezone (YYYY-MM-DDTHH:MM:SSZ) format. The date cannot be earlier than 90 days.
end_date The date until which you want the report. The report contains data from this date also. The API default is today or now. This parameter also supports timestamp in the UTC timezone (YYYY-MM-DDTHH:MM:SSZ) format.
offset The starting point of the results. The API default is 0.
limit The number of results to fetch. The API default is 10.
sort_column The column used to sort the report. Since this report returns the top canonical_hive_commands, the sort order is always descending. Valid choices are ‘frequency’, ‘cpu’, ‘fs_bytes_read’, and ‘fs_bytes_written’. The API default is, frequency.
show_ast Also return the serialized AST corresponding to the canonical query. (Not returned by default.)
Response Parameters
Parameter Description
start_date The actual starting date of the report.
end_date The actual ending date of the report.
sort_column The sort column used.

An array of:

canonical_query_id The ID of the canonical query.
canonical_query The AST dump of the canonical query. (This is returned only when the show_ast parameter is passed.)
recent_example The most recent example of this type of queries.
frequency The number of queries of this type.
cpu The total cumulative CPU, (in ms), consumed by these queries.
fs_bytes_read The total bytes read by these queries.
fs_bytes_written The total bytes written by these queries.
Examples
Without any parameters
curl -i -X GET -H "X-AUTH-TOKEN: $AUTH_TOKEN" -H "Accept: application/json" \
"https://gcp.qubole.com/api/v1.2/reports/canonical_hive_commands"

Sample Response

{
  "sort_column": "frequency",
  "canonical_queries": [
    {
      "canonical_query_id": "af09cd5799e52f450a87e236f453b864833afac97603409a17f3df4d010b1814",
      "fs_bytes_written": 0,
      "fs_bytes_read": 0,
      "frequency": 8800,
      "cpu": 0,
      "recent_example": "alter table demo_memetracker recover partitions"
    },
    {
      "canonical_query_id": "9548ac7ec7defe3c3251da2544ec545c9bd578cb308c6c3c1936e48df0bdfdb4",
      "fs_bytes_written": 0,
      "fs_bytes_read": 0,
      "frequency": 4547,
      "cpu": 0,
      "recent_example": "alter table daily_tick_data recover partitions"
    },
    {
      "canonical_query_id": "69dae07fc876927b9daba6279c962bc343131c08d8f9f98adfa0c05ef90b40a4",
      "fs_bytes_written": 0,
      "fs_bytes_read": 0,
      "frequency": 768,
      "cpu": 0,
      "recent_example": "show tables"
    },
    {
      "canonical_query_id": "89720fe23d2a85ac217a3b230e992c45dd523b65e6d45863cc410f4b5e4795ea",
      "fs_bytes_written": 0,
      "fs_bytes_read": 0,
      "frequency": 53,
      "cpu": 0,
      "recent_example": "select * from `default`.`memetracker` limit 400"
    },
    {
      "canonical_query_id": "04bccd848172842c8fadd687aef72ac2161f72895dfd3c1d3c31a96411d34095",
      "fs_bytes_written": 0,
      "fs_bytes_read": 0,
      "frequency": 49,
      "cpu": 0,
      "recent_example": "select * from `default`.`30days_test` limit 1 "
    },
    {
      "canonical_query_id": "9996d665cf077f60b1ee87d3b5b80cd65ce078f77935096441433628909b9ddb",
      "fs_bytes_written": 9,
      "fs_bytes_read": 28482500000,
      "frequency": 48,
      "cpu": 0,
      "recent_example": "select count(*) from memetracker"
    },
    {
      "canonical_query_id": "492ea35ff3d58d0d07e70bcc68ea33eadb7bb572f6fdf14f7220931cc94b1abc",
      "fs_bytes_written": 0,
      "fs_bytes_read": 0,
      "frequency": 37,
      "cpu": 0,
      "recent_example": "select * from `default`.`default_qubole_airline_origin_destination` limit 1000"
    },
    {
      "canonical_query_id": "5e2bb3326c4cf2c5b669d60729c5697fb6fdef4f1e09be41e37f424eb96b0c74",
      "fs_bytes_written": 0,
      "fs_bytes_read": 0,
      "frequency": 30,
      "cpu": 0,
      "recent_example": "select * from default_qubole_memetracker limit 10;"
    },
    {
      "canonical_query_id": "aa1c7a294f6e18feec68175a643814f06e180f4ac5e62eb6b556d9bf72830bc2",
      "fs_bytes_written": 25326,
      "fs_bytes_read": 85944,
      "frequency": 22,
      "cpu": 0,
      "recent_example": "select * from test_csv limit 5;"
    },
    {
      "canonical_query_id": "2145af0ee70e1cd93c9901cd41dee8285faacaefe86e4a4f22880316cc4e63c3",
      "fs_bytes_written": 21050200,
      "fs_bytes_read": 321138000,
      "frequency": 21,
      "cpu": 0,
      "recent_example": "select * from demo_memetracker limit 100"
    }
  ]
}
With a different sort column and limit and show_ast=true
curl -i -X GET -H "X-AUTH-TOKEN: $AUTH_TOKEN" -H "Accept: application/json" \ "https://gcp.qubole.com/api/v1.2/reports/canonical_hive_commands?sort_column=cpu&limit=2&show_ast=true"

Sample Response

{
 "sort_column": "cpu",
 "canonical_queries": [
    {
      "canonical_query_id": "d9635e3ad5501c9ad47bb728c35b63e1b41f8c2ba0fb4f7533b9ab701ce503c4",
      "canonical_query": "(null(TOK_QUERY(TOK_FROM(TOK_TABREF(TOK_TABNAME(lineitem))))(TOK_INSERT(TOK_DESTINATION(TOK_DIR(TOK_TMP_FILE)))(TOK_SELECT(TOK_SELEXPR(TOK_FUNCTIONSTAR(count))))))())",
      "fs_bytes_written": 18,
      "fs_bytes_read": 7726360000,
      "frequency": 2,
      "cpu": 423130,
      "recent_example": "set fs.s3.inputpathprocessor=false;\nselect count(*) from lineitem;"
    },
    {
      "canonical_query_id": "ea1a37cf6b4694293d15cadfaf4bbae2459f12475cd86ea90e6c4f8e31945bda",
      "canonical_query": "(null(TOK_QUERY(TOK_FROM(TOK_TABREF(TOK_TABNAME(default_qubole_memetracker))))(TOK_INSERT(TOK_DESTINATION(TOK_DIR(TOK_TMP_FILE)))(TOK_SELECT(TOK_SELEXPR(TOK_FUNCTIONSTAR(count))))(TOK_WHERE(=(TOK_TABLE_OR_COL(month))(LITERAL)))))())",
      "fs_bytes_written": 108,
      "fs_bytes_read": 54673600000,
      "frequency": 20,
      "cpu": 416200,
      "recent_example": "SELECT count(*) FROM default_qubole_memetracker where month = '2008-08';"
    }
 ]
}
For a specific time period
curl -i -X GET -H "X-AUTH-TOKEN: $AUTH_TOKEN" -H "Accept: application/json" \ "https://gcp.qubole.com/api/v1.2/reports/canonical_hive_commands?start_date=2014-04-01&end_date=2014-04-21&sort_column=fs_bytes_read&limit=2"

Sample Response

{
  "canonical_queries": [
     {
       "canonical_query_id": "55ebb0cc47e0dc74c70245b026126bba191969dee1dc380a6f98698e6b194085",
       "cpu": 75720,
       "frequency": 1,
       "recent_example": "select dt, count(*) from  junk_temp \ngroup by dt order by dt\n\n\n",
       "fs_bytes_read": 308582016,
       "fs_bytes_written": 1514
     },
     {
       "canonical_query_id": "1c421d2e65407650650cbc2ee80f9a59863875f52d1a9ddd5a051118678a3a6c",
       "cpu": 34980,
       "frequency": 1,
       "recent_example": "select created_at from  junk_temp \nwhere dt=2014-01-03 limit 10\n\n",
       "fs_bytes_read": 308582016,
       "fs_bytes_written": 0
     }
  ],
   "start_date": "2014-03-31T10:00:00Z",
   "end_date": "2014-04-21T20:00:00Z",
   "sort_column": "fs_bytes_read"
}

To learn more about canonicalization of hive queries, see the Blog post.

Caution

Qubole started collecting the CPU metrics only from the last week of December, 2013. So, if you have some queries before that, the CPU metrics is considered as 0.

Cluster Nodes Report
GET /api/v1.2/reports/cluster_nodes

This API provides the cluster nodes report in JSON format.

Note

The following points are related to a report API:

  • If the difference between start date and end date is more than 60 days, then the system defaults to 1 month window from the current day’s date.
  • If either start date or end date is not provided, then the system defaults to 1 month window from the current day’s date.
  • If you want to get data for a window more than 2 months, then write an email to help@quoble.com.
Required Role

The following users can make this API call:

  • Users who belong to the system-admin or system-user group.
  • Users who belong to a group associated with a role that allows viewing cluster nodes reports. See Managing Groups and Managing Roles for more information.
Parameters
Parameter Description
start_date The date from which you want the report (inclusive). This parameter supports the timestamp in the UTC timezone (YYYY-MM-DDTHH:MM:SSZ) format. The date cannot be earlier than 90 days.
end_date The date until which you want the report (inclusive). The API default is today. This parameter also supports timestamp in the UTC timezone (YYYY-MM-DDTHH:MM:SSZ) format.
Response Parameters
Parameter Description
start_date The starting date of the report.
end_date The ending date of the report.

An array of:

role The role of the instance (coordinator or worker).
cluster_id The id of the cluster
public_ip The public hostname of the cluster node.
ec2_instance_id The ec2 instance ID of the cluster node.
private_ip The private hostname of the cluster node.
start_time The time at which the cluster node was started.
end_time The time at which the cluster node was terminated.
Examples

Goal

To get the default report.

curl -i -X GET -H "X-AUTH-TOKEN: $AUTH_TOKEN" \
-H "Accept: application/json" \
"https://gcp.qubole.com/api/v1.2/reports/cluster_nodes"

Goal

To get the report for clusters online during a specific time period.

curl -i -X GET -H "X-AUTH-TOKEN: $AUTH_TOKEN" \
-H "Accept: application/json" \
"https://gcp.qubole.com/api/v1.2/reports/cluster_nodes?start_date=2014-04-01&end_date=2014-04-21"

Sample Response

{
  "end_date": "2014-04-21T10:00:00Z",
  "cluster_nodes": [
    {
      "ec2_instance_id":"i-437ad9ac",
      "private_ip":"ip-10-40-7-209.ec2.internal",
      "start_time":"2015-02-12T06:59:42Z",
      "role":"master",
      "public_ip":"23-20-255-83.compute-1.gcp.com",
      "end_time":"2015-02-12T08:13:52Z",
      "cluster_id":10268
    },
    {
      "ec2_instance_id":"i-887bd867",
      "private_ip":"ip-10-165-32-171.ec2.internal",
      "start_time":"2015-02-12T06:59:42Z",
      "role":"node0001",
      "public_ip":"54-144-51-140.compute-1.gcp.com",
      "end_time":"2015-02-12T08:13:52Z",
      "cluster_id":10268
    }
  ],
  "start_date": "2014-04-01T05:00:00Z"
}

Roles API

Create a Role
POST /api/v1.2/roles

This API is used to create a role such as system user and system administrator in a QDS account.

Required Role

The following users can make this API call:

  • Users who belong to the system-admin or system-user group.
  • Users who belong to a group associated with a role that allows creating roles. See Managing Groups and Managing Roles for more information.
Parameters

Note

Parameters marked in bold are mandatory. Others are optional and have default values.

Parameter Description
name Name of the role that must be unique in the Qubole account.
policies

Array of policies to be assigned to role. Each policy include following parameters:

  • resource: Name of the resource such as all and cluster. The resources are:

    Note

    Resources, Actions, and What they Mean describes the policy resources that are listed in the Control Panel > Manage Roles UI.

    • all - It denotes All resources.
    • account - It denotes the Account resource.
    • app - It denotes the App resource.
    • cluster - It denotes the Clusters resource.
    • command - It denotes the Commands resource.
    • datastore - It denotes the Data Source resource.
    • data_preview - It denotes Read Access to data.
    • environment - It denotes the Environments and Packages resource.
    • folder - It denotes the Folder resource.
    • group - It denotes the Groups resource.
    • note - It denotes the Notes (notebooks) resource.
    • notebook_dashboard - It denotes the Notebook Dashboards resource.
    • object_storage - It denotes the Object Storage resource.
    • role - It denotes the Roles resource.
    • saved_query - It denotes the Workspace resource.
    • scheduler - It denotes the Scheduler resource.
    • scheduler_instance - It denotes the Scheduler Instance resource.
    • template - It denotes the Template resource.
    • qbol_user - It denotes the Users resource.
  • action: Name of the action with particular resource.

  • access: Giving or denying permissions. It can be either allow or deny.

Request API Syntax
curl -X POST -H "X-AUTH-TOKEN: $AUTH_TOKEN" -H "Content-Type: application/json" -H "Accept: application/json" \
-d '{"policies":"[{\"resource\":\"all\",\"action\":[\"all\"],\"access\":\"allow\"}]",
     "name":"my_role"}' \ "https://gcp.qubole.com/api/v1.2/roles"
Sample Request
curl -X POST -H "X-AUTH-TOKEN: $AUTH_TOKEN" -H "Content-Type: application/json" -H "Accept: application/json" \
-d '{"policies":"[{\"resource\":\"all\",\"action\":[\"all\"],\"access\":\"allow\"}]",
     "name":"my_role"}' \ "https://gcp.qubole.com/api/v1.2/roles"
Sample Response
{"status":"success","roles":{"id":6,"name":"my_role","source":"user-defined",
 "policy":"[{\"resource\":\"all\",\"action\":[\"all\"],\"access\":\"allow\"}]","account_id":64}}
List Groups with a Specific Role
GET /api/v1.2/roles/<role-id/name>/groups

This API is used to list groups assigned with a particular role.

Required Role

The following users can make this API call:

  • Users who belong to the system-admin or system-user group.
  • Users who belong to a group associated with a role that allows viewing groups assigned with a specific role. See Managing Groups and Managing Roles for more information.
Parameters

Note

Parameters marked in bold are mandatory. Others are optional and have default values.

Parameter Description
<role-id/name> ID or name of the role to list associated groups of the current account
Request API Syntax

Here is the syntax of the Request API.

curl -X GET -H "X-AUTH-TOKEN:  $AUTH_TOKEN" -H "Content-Type: application/json" -H "Accept: application/json" \
-d '{}' \ "https://gcp.qubole.com/api/v1.2/roles/<role-id/name>/groups"
Sample Request
curl -X GET -H "X-AUTH-TOKEN:  $AUTH_TOKEN" -H "Content-Type: application/json" -H "Accept: application/json" \
-d '{}' \ "https://gcp.qubole.com/api/v1.2/roles/system-admin/groups"
Delete a Role
DELETE /api/v1.2/roles/<role-id/name>

This API is used to delete a role that is not required.

Required Role

The following users can make this API call:

  • Users who belong to the system-admin or system-user group.
  • Users who belong to a group associated with a role that allows deleting roles. See Managing Groups and Managing Roles for more information.
Parameters

Note

Parameters marked in bold are mandatory. Others are optional and have default values.

Parameter Description
<role-id/name> ID or name of the role that is to be deleted
Request API Syntax

Here is the Request API syntax.

curl -X DELETE -H "X-AUTH-TOKEN:  $AUTH_TOKEN" -H "Content-Type: application/json" -H "Accept: application/json" \
-d '{}' \ "https://gcp.qubole.com/api/v1.2/roles/<role-id>"
Sample Request
curl -X DELETE -H "X-AUTH-TOKEN:  $AUTH_TOKEN" -H "Content-Type: application/json" -H "Accept: application/json" \
-d '{}' \ "https://gcp.qubole.com/api/v1.2/groups/105/roles/19"
Sample Response
Success
{"status":"done"}

Scheduler API

Create a Schedule
POST /api/v1.2/scheduler/

This API creates a new schedule to run commands automatically at certain frequency in a specified interval.

Required Role

The following users can make this API call:

  • Users who belong to the system-admin or system-user group.
  • Users who belong to a group associated with a role that allows creating a schedule. See Managing Groups and Managing Roles for more information.
Parameters

Note

Parameters marked in bold below are mandatory. Others are optional and have default values.

Parameter Description
command_type A valid command type supported by Qubole. For example, HiveCommand, HadoopCommand, PigCommand.
command

JSON object describing the command. Refer to the command-api for more details.

Sub fields can use macros. Refer to the Qubole Scheduler for more details.

start_time Start datetime for the schedule. In the Cron expression, the scheduler calculates the Next Materialized Time (NMT)/Start time considering the current time as the base time and Cron expression passed. Start time is not honored in the Cron expression.
end_time End datetime for the schedule
retry

Denotes the number of retries for a job. Valid values of retry are 1, 2, and 3.

Caution

Configuring retries will just do a blind retry of a Presto query. This may lead to data corruption for non-Insert Overwrite Directory (IOD) queries.

retry_delay Denotes the time interval between the retries when a job fails.
frequency Set this option or cron_expression but do not set both options. Specify how often the schedule should run. Input is an integer. For example, frequency of one hour/day/month is represented as {"frequency":"1"}
time_unit Denotes the time unit for the frequency. Its default value is days. Accepted value is minutes, hours, days, weeks, or months.
cron_expression Set this option or frequency but do not set both options. The standard cron format is “s, m, h, d, M, D, Y” where s is second, m is minute, M is month, d is date, and D is day of the week. Only year (Y) is optional. Example - "cron_expression":"0 0 12 * * ?". For more information, see Cron Trigger Tutorial.
name A user-defined name for a schedule. If name is not specified, then a system-generated Schedule ID is set as the name.
label Specify a cluster label that identifies the cluster on which the schedule API call must be run.
macros

Expressions to evaluate macros. Macros can be used in parameterized commands.

Refer to the Macros in Scheduler page for more details.

no_catch_up Set this parameter to true if you want to skip schedule actions that were supposed to have run in the past and run only the api/v2.0 schedule actions. By default, this parameter is set to false. When a new schedule is created, the scheduler runs schedule actions from start time to the current time. For example, if a daily schedule is created from Jun 1, 2015 on Dec 1, 2015, schedules are run for Jun 1, 2015, Jun 2, 2015, and so on. If you do not want the scheduler to run the missed schedule actions for months earlier to Dec, set no_catch_up to true. The main use of skipping a schedule action is if when you suspend a schedule and resume it later, in which case, there will be more than one schedule action and you might want to skip the earlier schedule actions. For more information, see Understanding the Qubole Scheduler Concepts.
time_zone

Timezone of the start and end time of the schedule.

Scheduler will understand ZoneInfo identifiers. For example, Asia/Kolkata.

For a list of identifiers, check column 3 in List of TZ in databases.

Default value is UTC.

command_timeout You can set the command timeout configurable in seconds. Its default value is 129600 seconds (36 hours) and any other value that you set must be less than 36 hours. QDS checks the timeout for a command every 60 seconds. If the timeout is set for 80 seconds, the command gets killed in the next minute that is after 120 seconds. By setting this parameter, you can avoid the command from running for 36 hours.
time_out Unit is minutes. A number that represents a maximum amount of time the schedule should wait for dependencies to be satisfied.
concurrency Specify how many schedule actions can run at a time. Default value is 1.
dependency_info

Describe dependencies for this schedule.

Check the Hive Datasets as Schedule Dependency for more information.

notification It is an optional parameter that is set to false by default. You can set it to true if you want to be notified through email about instance failure. notification provides more information.
notification
Parameter Description
is_digest It is a notification email type that is set to true if a schedule periodicity is in minutes or hours. If it set to false, the email type is immediate by default.
notify_failure If this option is set to true, you receive schedule failure notifications.
notify_success If this option is set to true, you receive schedule success notifications.
notification_channels It is the Notification Channel id. To know more about how to get Notification Channel id, see Creating Notification Channels.
dependency_info
Parameter Description
files Use this parameter if there is dependency on S3 files and it has the following sub options. For more information, see Configuring GS Files Data Dependency.
path It is the S3 path of the dependent file (with data) based on which the schedule runs.
window_start It denotes the start day or time.
window_end It denotes the end day or time.
hive_tables Use this parameter if there is dependency on Hive table data that has partitions. For more information, see Configuring Hive Tables Data Dependency.
schema It is the database that contains the partitioned Hive table.
name It is the name of the partitioned Hive table.
window_start It denotes the start day or time.
window_end It denotes the end day or time.
interval It denotes the dataset interval and defines how often the data is generated. Hive Datasets as Schedule Dependency provides more information. You must also specify the incremental time that can be in minutes, hours, days, weeks, or months. The usage is "interval":{"days":"1"}. The default interval is 1 day.
column It denotes the partitioned column name. You must specify the date-time mask through the_date parameter denotes how to convert from date to string for the partition. The usage is "columns":{"the_date":"<value>"}. The <value> can be a macro or a string.
Response

The response contains a JSON object representing the created schedule.

Note

There is a rerun limit for schedule reruns to be processed concurrently at a given point of time. Understanding the Qubole Scheduler Concepts provides more information.

Example 1

Goal: Create a new schedule to run Hive queries.

Use the following query as shown in the example below:

CREATE EXTERNAL TABLE daily_tick_data (
    date2 string,
    open float,
    close float,
    high float,
    low float,
    volume INT,
    average FLOAT)
PARTITIONED BY (
    stock_exchange STRING,
    stock_symbol STRING,
    year STRING,
    date1 STRING)
ROW FORMAT DELIMITED
FIELDS TERMINATED BY ','
LOCATION 's3n://paid-qubole/default-datasets/stock_ticker';

date1 is the date in the format YYYY-MM-DD

The dataset is available from 2012-07-01.

For this example, let us assume that the dataset is updated everyday at 1AM UTC, and the schedules are scheduled at 2AM UTC, everyday.

The query shown below aggregates the data for every stock symbol, every day.

Command

curl -i -X POST -H "X-AUTH-TOKEN: $AUTH_TOKEN" -H "Accept: application/json" -H "Content-type: application/json" \
-d  '{
      "command_type":"HiveCommand",
      "command": {
                  "query": "select stock_symbol, max(high), min(low), sum(volume) from daily_tick_data where date1='$formatted_date$' group by stock_symbol"
                 },
      "macros": [
                {
                 "formatted_date": "Qubole_nominal_time.format('YYYY-MM-DD')"
                }
                ],
      "notification":{"is_digest": false,
                    "notification_channels" : [728, 400],
                    "notify_failure": true, "notify_success": false},
      "start_time": "2012-07-01T02:00Z",
      "end_time": "2022-07-01T02:00Z",
      "frequency": 1,
      "time_unit": "days",
      "time_out":10,
      "command_timeout":36000,
      "dependency_info": {}
      }' \
      "https://gcp.qubole.com/api/v1.2/scheduler"

Sample Response

{
 "time_out":10,
 "status":"RUNNING",
 "start_time":"2012-07-01 02:00",
 "label":"default",
 "concurrency":1,
 "frequency":1,
 "no_catch_up":false,
 "template":"generic",
 "command":{
            "sample":false,"loader_table_name":null,"md_cmd":null,"script_location":null,"approx_mode":false,"query":"select stock_symbol, max(high), min(low), sum(volume) from daily_tick_data where date1='$formatted_date$' group by stock_symbol","loader_stable":null,"approx_aggregations":false
           },
 "command_timeout":"36000"
 "time_zone":"UTC",
 "time_unit":"days",
 "end_time":"2022-07-01 02:00",
 "user_id":108,
 "macros":[{"formatted_date":"Qubole_nominal_time.format('YYYY-MM-DD')"}],
 "incremental":{},
 "command_type":"HiveCommand",
 "name":"3159",
 "dependency_info":{},
 "id":3159,
 "next_materialized_time":null
 "template": "generic",
 "pool": null,
 "label": "default",
 "is_digest": false,
 "can_notify": false,
 "digest_time_hour": 0,
 "digest_time_minute": 0,
 "email_list": "qubole@qubole.com",
 "bitmap": 0
}

Note the schedule ID (in this case 3159), which is used in other examples.

export SCHEDID=3159
Example 2

Here is an API sample request that has notification parameters set.

curl -i -X POST -H "X-AUTH-TOKEN: $AUTH_TOKEN" -H "Accept: application/json" -H "Content-type: application/json" \
-d  '{
      "command_type":"HiveCommand",
      "command": {
                  "query": "select stock_symbol, max(high), min(low), sum(volume) from daily_tick_data where date1='$formatted_date$' group by stock_symbol"
                 },
      "macros": [
                 {
                  "formatted_date": "Qubole_nominal_time.format('YYYY-MM-DD')"
                  }
                ],
      "notification":{"is_digest": true, "digest_time_hour":04, "digest_time_minute":30,
                      "notification_channels" : [728, 400],
                      "notify_failure": true, "notify_success": false}`,
       "start_time": "2012-07-01T02:00Z",
       "end_time": "2022-07-01T02:00Z",
       "frequency": 1,
       "time_unit": "days",
       "time_out":10,
       "dependency_info": {"wait_for_s3_files": [{file1: {start_time:}, {end_time:}, {file2: }]
       }' \
       "https://gcp.qubole.com/api/v1.2/scheduler"
Example 3

Here is an API sample request that has dependency on files on S3.

curl -i -X POST -H "X-AUTH-TOKEN: $AUTH_TOKEN" -H "Accept: application/json" -H "Content-type: application/json" \
-d  '{
      "command_type":"HiveCommand",
      "command": {
                  "query": "select stock_symbol, max(high), min(low), sum(volume) from daily_tick_data where date1='$formatted_date$' group by stock_symbol"
                 },
      "macros": [
                 {
                  "formatted_date": "Qubole_nominal_time.format('YYYY-MM-DD')"
                  }
                ],
      "notification":{"is_digest": true, "digest_time_hour":04, "digest_time_minute":30,
                      "notification_channels" : [728, 400],
                      "notify_failure": true, "notify_success": false}`,
       "start_time": "2012-07-01T02:00Z",
       "end_time": "2022-07-01T02:00Z",
       "frequency": 1,
       "time_unit": "days",
       "time_out":10,
       "dependency_info": {
                      "files": [
                            {
                             "path" : "s3://<your S3 bucket>/data/data1_30days/170614/",
                      "window_start": -29,
                      "window_end": 0
                      }
                     ]
                   }
       }' \
       "https://gcp.qubole.com/api/v1.2/scheduler"
Example 4

Here is an API sample request that has dependency on partitioned columns of a Hive table.

curl -i -X POST -H "X-AUTH-TOKEN: $AUTH_TOKEN" -H "Accept: application/json" -H "Content-type: application/json" \
-d  '{
      "command_type":"HiveCommand",
      "command": {
              "query": "select stock_symbol, max(high), min(low), sum(volume) from daily_tick_data where date1='$formatted_date$' group by stock_symbol"
             },
      "macros": [
             {
              "formatted_date": "Qubole_nominal_time.format('YYYY-MM-DD')"
              }
            ],
      "notification":{"is_digest": true, "digest_time_hour":04, "digest_time_minute":30,
                  "notification_channels" : [728, 400],
                  "notify_failure": true, "notify_success": false}`,
      "start_time": "2018-02-12 00:00",
      "end_time": "2021-02-12 00:00",
      "frequency": 12,
      "time_unit": “months”,
      "time_out":10,
      "dependency_info": {
                  "hive_tables":[
                  {"schema":"daily_tick_data","name":"daily_cluster_nodes","window_start":"-1","window_end":"-1","interval":{"days":"1"},"columns":{"dt":"%Y-%d","source”:[“”]}}]
     }' \
     "https://gcp.qubole.com/api/v1.2/scheduler"
Example 5

Here is an API sample request to schedule a workflow command.

curl -i -X POST -H "X-AUTH-TOKEN: $AUTH_TOKEN" -H "Accept: application/json" -H "Content-type: application/json" \
-d  '{
       "command_type": "CompositeCommand",
       "command": {
                "sub_commands":
                [{
                    "query": "select stock_symbol, max(high), min(low), sum(volume) from daily_tick_data where date1='$formatted_date$' group by stock_symbol",
                    "command_type": "HiveCommand"
                 },
                 {
                     "query": "select stock_symbol, max(high), min(low), sum(volume) from daily_tick_data where date2='$formatted_date$' group by stock_symbol",
                     "command_type": "HiveCommand"
                 }
                ]
       },

       "start_time": "2012-07-01T02:00Z",
       "end_time": "2022-07-01T02:00Z",
       "frequency": 1,
       "time_unit": "days",
       "time_out": 10,
       "dependency_info": {}
       }' \ "https://gcp.qubole.com/api/v1.2/scheduler"
Clone a Schedule
POST /api/v1.2/scheduler/(SchedulerID)/duplicate

Use this API to clone an existing schedule by providing a new schedule name.

Required Role

The following users can make this API call:

  • Users who belong to the system-admin or system-user group.
  • Users who belong to a group associated with a role that allows cloning a schedule. See Managing Groups and Managing Roles for more information.
Parameters

Note

Parameters marked in bold below are mandatory. Others are optional and have default values.

Parameter Description
command_type A valid command type supported by Qubole. For example, HiveCommand, HadoopCommand, PigCommand.
command

JSON object describing the command. Refer to the command-api for more details.

Sub fields can use macros. Refer to the Qubole Scheduler for more details.

start_time Start datetime for the schedule
end_time End datetime for the schedule
frequency Set this option or cron_expression but do not set both options. Specify how often the schedule should run. Input is an integer. For example, frequency of one hour/day/month is represented as {"frequency":"1"}
time_unit Denotes the time unit for the frequency. Its default value is days. Accepted value is minutes, hours, days, weeks, or months.
cron_expression Set this option or frequency but do not set both options. The standard cron format is “s, m, h, d, M, D, Y” where s is second, m is minute, M is month, d is date, and D is day of the week. Only year (Y) is optional. Example - "cron_expression":"0 0 12 * * ?". For more information, see Cron Trigger Tutorial.
name A user-defined name for a schedule. If name is not specified, then a system-generated Schedule ID is set as the name. While cloning an existing schedule, you must change the name.
label Specify a cluster label that identifies the cluster on which the schedule API call must be run.
macros

Expressions to evaluate macros. Macros can be used in parameterized commands.

Refer to the Macros in Scheduler page for more details.

no_catch_up Set this parameter to true if you want to skip schedule actions that were supposed to have run in the past and run only the api/v2.0 schedule actions. By default, this parameter is set to false. When a new schedule is created, the scheduler runs schedule actions from start time to the current time. For example, if a daily schedule is created from Jun 1, 2015 on Dec 1, 2015, schedules are run for Jun 1, 2015, Jun 2, 2015, and so on. If you do not want the scheduler to run the missed schedule actions for months earlier to Dec, set no_catch_up to true. The main use of skipping a schedule action is if when you suspend a schedule and resume it later, in which case, there will be more than one schedule action and you might want to skip the earlier schedule actions. For more information, see Understanding the Qubole Scheduler Concepts.
time_zone

Timezone of the start and end time of the schedule.

Scheduler will understand ZoneInfo identifiers. For example, Asia/Kolkata.

For a list of identifiers, check column 3 in List of TZ in databases.

Default value is UTC.

command_timeout You can set the command timeout configurable in seconds. Its default value is 129600 seconds (36 hours) and any other value that you set must be less than 36 hours. QDS checks the timeout for a command every 60 seconds. If the timeout is set for 80 seconds, the command gets killed in the next minute that is after 120 seconds. By setting this parameter, you can avoid the command from running for 36 hours.
time_out Unit is minutes. A number that represents a maximum amount of time the schedule should wait for dependencies to be satisfied.
concurrency Specify how many scheudle actions can run at a time. Default value is 1.
dependency_info

Describe dependencies for this schedule.

Check the Hive Datasets as Schedule Dependency for more information.

notification It is an optional parameter that is set to false by default. You can set it to true if you want to be notified through email about instance failure. notification provides more information.
notification
Parameter Description
is_digest It is a notification email type that is set to true if a schedule periodicity is in minutes or hours. If it set to false, the email type is immediate by default.
notify_failure If this option is set to true, you receive schedule failure notifications.
notify_success If this option is set to true, you receive schedule success notifications.
notification_email_list By default, the current user’s email ID is added. You can add additional email IDs as required.
dependency_info
Parameter Description
files Use this parameter if there is dependency on S3 files and it has the following sub options. For more information, see Configuring GS Files Data Dependency.
path It is the S3 path of the dependent file (with data) based on which the schedule runs.
window_start It denotes the start day or time.
window_end It denotes the end day or time.
hive_tables Use this parameter if there is dependency on Hive table data that has partitions. For more information, see Configuring Hive Tables Data Dependency.
schema It is the database that contains the partitioned Hive table.
name It is the name of the partitioned Hive table.
window_start It denotes the start day or time.
window_end It denotes the end day or time.
interval It denotes the dataset interval and defines how often the data is generated. Hive Datasets as Schedule Dependency provides more information. You must also specify the incremental time that can be in minutes, hours, days, weeks, or months. The usage is "interval":{"days":"1"}. The default interval is 1 day.
column It denotes the partitioned column name. You must specify the date-time mask through the_date parameter denotes how to convert from date to string for the partition. The usage is "columns":{"the_date":"<value>"}. The <value> can be a macro or a string.
Response

The response contains a JSON object representing the cloned schedule.

Note

There is a rerun limit for schedule reruns to be processed concurrently at a given point of time. Understanding the Qubole Scheduler Concepts provides more information.

Example

Goal: Clone an existing schedule, for example: schedule ID 3159, to create a new schedule. For more information on how to create a schedule, see Create a Schedule.

While creating a schedule, we created a schedule that aggregates data every day, for every stock symbol, and for each stock exchange. For example, if you want to edit the query to also calculate the total transaction amount for the stock in a day, provide the following query.

{
  "command_type":"HiveCommand",
  "command": {
    "query": "select stock_symbol, stock_exchange, max(high), min(low), sum(volume) from daily_tick_data where date1='$formatted_date$' group by stock_symbol, stock_exchange"
  },
  "start_time": "2012-11-01T02:00Z",
  "end_time": "2022-10-01T02:00Z"
}

Command

curl -i -X POST -H "X-AUTH-TOKEN: $AUTH_TOKEN" -H "Accept: application/json" -H "Content-type: application/json" \
-d '{ "name": "schedule1" }' \
"https://gcp.qubole.com/api/v1.2/scheduler/3159/duplicate"

Sample Response

{
 "time_out":10,
 "status":"RUNNING",
 "start_time":"2012-07-01 02:00",
 "label":"default",
 "concurrency":1,
 "frequency":1,
 "no_catch_up":false,
 "template":"generic",
 "command":{
            "sample":false,"loader_table_name":null,"md_cmd":null,"script_location":null,"approx_mode":false,"query":"select stock_symbol, max(high), min(low), sum(volume) from daily_tick_data where date1='$formatted_date$' group by stock_symbol","loader_stable":null,"approx_aggregations":false
           },
 "time_zone":"UTC",
 "time_unit":"days",
 "end_time":"2022-07-01 02:00",
 "user_id":108,
 "macros":[{"formatted_date":"Qubole_nominal_time.format('YYYY-MM-DD')"}],
 "incremental":{},
 "command_type":"HiveCommand",
 "name":"schedule1",
 "dependency_info":{},
 "id":3160,
 "next_materialized_time":null
}
View a Schedule
GET /api/v1.2/scheduler/(int: id)

This API is used to view an existing schedule that is created to run commands automatically at certain frequency in a specified interval.

Resource URI scheduler/id
Request Type GET
Supporting Versions v2.0
Return Value Json object representing the schedule.
Required Role

The following users can make this API call:

  • Users who belong to the system-admin or system-user group.
  • Users who belong to a group associated with a role that allows viewing a schedule information. See Managing Groups and Managing Roles for more information.
Example
curl -i -X GET -H "X-AUTH-TOKEN: $AUTH_TOKEN" -H "Accept: application/json" -H "Content-type: application/json" \
"https://gcp.qubole.com/api/v1.2/scheduler/${SCHEDID}"

Response

{
  "concurrency": 1,
  "time_unit": "days",
  "command": {
    "approx_mode": false,
    "query": "select stock_symbol, max(high), min(low), sum(volume) from daily_tick_data where date1='$formatted_date$' group by stock_symbol",
    "approx_aggregations": false,
    "sample": false
  },
  "user_id": 39,
  "dependency_info": {
    "hive_tables": [
      {
        "window_end": "0",
        "initial_instance": "2012-07-01T00:00Z",
        "name": "daily_tick_data",
        "interval": {
          "days": "1"
        },
        "columns": {
          "stock_symbol": [
            "ibm",
            "orcl"
          ],
          "stock_exchange": [
            "nasdaq",
            "nyse"
          ]
        },
        "window_start": "-1",
        "time_zone": "UTC"
      }
    ]
  },
  "time_out": 10,
  "macros": [
    {
      "formatted_date": "Qubole_nominal_time.format('YYYY-MM-DD')"
    }
  ],
  "end_time": "2022-07-01 02:00",
  "start_time": "2012-07-01 02:00",
  "frequency": 1,
  "id": 2266,
  "time_zone": "UTC",
  "command_type": "HiveCommand",
  "status": "RUNNING"
}
Change Ownership
PUT api/v2/scheduler/(int: id)/owner

Use this API to change the ownership of an existing schedule.

Required Role

The following users can make this API call:

  • Users who belong to the system-admin group.
  • Users who belong to a group associated with a role that allows managing a schedule. See Managing Groups and Managing Roles for more information.
Parameters

Note

Parameters marked in bold below are mandatory.

Parameter Description
new_owner_email Provide the email ID of the (new) owner to whom you want to transfer the ownership of the schedule.
Response

The response contains a JSON object representing the email ID of the schedule’s new owner.

Example

Goal: Modify the owner of the schedule

curl -i -X PUT -H "X-AUTH-TOKEN: $AUTH_TOKEN" -H "Accept: application/json" -H "Content-type: application/json" \
-d '{"new_owner_email":"user@xyz.com"}' \
"https://gcp.qubole.com/api/v1.2/scheduler/24/owner"

Response

{
"success": true,
"new_owner_email":"user@xyz.com"
}
List Schedules
GET /api/v1.2/scheduler/

This API is used to list all existing schedules created to run commands automatically at certain frequency in a specified interval.

Resource URI scheduler/
Request Type GET
Supporting Versions v2.0
Return Value A JSON array of schedules. It displays all schedules in all states.

Note

A _SUCCESS file is created in the output folder for successful schedules. You can set mapreduce.fileoutputcommitter.marksuccessfuljobs to false to disable creation of _SUCCESS file or to true to enable creation of the _SUCCESS file.

Required Role

The following users can make this API call:

  • Users who belong to the system-admin or system-user group.
  • Users who belong to a group associated with a role that allows viewing all schedules. See Managing Groups and Managing Roles for more information.

Note

You can use the name parameter to fetch scheduled jobs by name. The search pattern must contain at least 3 characters. QDS displays partial and complete matches.

Parameters

Note

Parameters marked in bold below are mandatory. Others are optional and have default values.

Parameter Description
name Denotes the name of the job.
status Denotes the status (In Progress/Done/All/Killed/Failed ) of the job.
page Denotes the page number that contains the scheduler jobs’ history. Its default value is 1. To see the entire list of jobs, mention the value as all.
per_page Denotes the number of job instances to be displayed on a page. Its default value is 10.
Example
curl -i -X GET -H "X-AUTH-TOKEN: $AUTH_TOKEN" -H "Accept: application/json" -H "Content-type: application/json" \ "https://gcp.qubole.com/api/v1.2/scheduler"

Sample Response

{
    "paging_info": {
        "next_page": 2,
        "previous_page": null,
        "per_page": 10
    },
    "schedules": [
        {
            "id": 8,
            "name": "8",
            "status": "KILLED",
            "concurrency": 1,
            "frequency": 1,
            "time_unit": "days",
            "no_catch_up": false,
            "cron_expression": null,
            "user_id": 108,
            "start_time": "2012-07-01 02:00",
            "end_time": "2022-07-01 02:00",
            "created_at": "2012-07-01 02:00",
            "time_zone": "UTC",
            "next_materialized_time": null,
            "command": {
                "query": "select stock_symbol, max(high), min(low), sum(volume) from daily_tick_data where date1='$formatted_date$'",
                "sample": false,
                "approx_mode": false,
                "approx_aggregations": false,
                "loader_table_name": null,
                "loader_stable": null,
                "md_cmd": null,
                "script_location": null,
                "retry": 0
            },
            "dependency_info": {
                "hive_tables": [
                    {
                        "window_end": "0",
                        "time_zone": "UTC",
                        "window_start": "-1",
                        "interval": {
                            "days": "1"
                        },
                        "name": "daily_tick_data",
                        "initial_instance": "2012-07-01T00:00Z",
                        "columns": {
                            "stock_exchange": [
                                "nasdaq",
                                "nyse"
                            ],
                            "stock_symbol": [
                                "ibm",
                                "orcl"
                            ]
                        }
                    }
                ]
            },
            "incremental": {},
            "time_out": 10,
            "command_type": "HiveCommand",
            "macros": [
                {
                    "formatted_date": "Qubole_nominal_time.format('YYYY-MM-DD')"
                }
            ],
            "template": "generic",
            "pool": null,
            "label": "default",
            "is_digest": false,
            "can_notify": false,
            "digest_time_hour": 0,
            "digest_time_minute": 0,
            "email_list": "qubole@qubole.com",
            "bitmap": 0
        },
        {
            "id": 51,
            "name": "51",
            "status": "KILLED",
            "concurrency": 1,
            "frequency": 1,
            "time_unit": "days",
            "no_catch_up": false,
            "cron_expression": null,
            "user_id": 108,
            "start_time": "2013-03-30 07:30",
            "end_time": "2015-01-01 00:00",
            "created_at": "2012-07-01 02:00",
            "time_zone": "Amsterdam",
            "next_materialized_time": null,
            "command": {
                "query": "alter table recover partitions demo_data3",
                "sample": false,
                "approx_mode": false,
                "approx_aggregations": false,
                "loader_table_name": null,
                "loader_stable": null,
                "md_cmd": null,
                "script_location": null,
                "retry": 0
            },
            "dependency_info": {
                "hive_tables": null
            },
            "incremental": {},
            "time_out": 10,
            "command_type": "HiveCommand",
            "macros": [],
            "template": "generic",
            "pool": null,
            "label": "default"
        },
        {
            "id": 52,
            "name": "52",
            "status": "SUSPENDED",
            "concurrency": 1,
            "frequency": 1,
            "time_unit": "days",
            "no_catch_up": false,
            "cron_expression": null,
            "user_id": 108,
            "start_time": "2013-03-30 07:30",
            "end_time": "2015-01-01 00:00",
            "created_at": "2012-07-01 02:00",
            "time_zone": "Amsterdam",
            "next_materialized_time": "2013-04-02 07:30",
            "command": {
                "query": "alter table demo_data3 recover partitions",
                "sample": false,
                "approx_mode": false,
                "approx_aggregations": false,
                "loader_table_name": null,
                "loader_stable": null,
                "md_cmd": null,
                "script_location": null,
                "retry": 0
            },
            "dependency_info": {
                "hive_tables": null
            },
            "incremental": {},
            "time_out": 10,
            "command_type": "HiveCommand",
            "macros": [],
            "template": "generic",
            "pool": null,
            "label": "default",
            "is_digest": false,
            "can_notify": false,
            "digest_time_hour": 0,
            "digest_time_minute": 0,
            "email_list": "qubole@qubole.com",
            "bitmap": 0
        },
        {
            "id": 53,
            "name": "53",
            "status": "DONE",
            "concurrency": 1,
            "frequency": 1,
            "time_unit": "days",
            "no_catch_up": false,
            "cron_expression": null,
            "user_id": 108,
            "start_time": "2013-04-01 07:00",
            "end_time": "2015-01-01 00:00",
            "created_at": "2012-07-01 02:00",
            "time_zone": "Amsterdam",
            "next_materialized_time": "2015-01-01 07:00",
            "command": {
                "query": "alter table daily_tick_data recover partitions",
                "sample": false,
                "approx_mode": false,
                "approx_aggregations": false,
                "loader_table_name": null,
                "loader_stable": null,
                "md_cmd": null,
                "script_location": null,
                "retry": 0
            },
            "dependency_info": {
                "hive_tables": null
            },
            "incremental": {},
            "time_out": 10,
            "command_type": "HiveCommand",
            "macros": [],
            "template": "generic",
            "pool": null,
            "label": "default",
            "is_digest": false,
            "can_notify": false,
            "digest_time_hour": 0,
            "digest_time_minute": 0,
            "email_list": "qubole@qubole.com",
            "bitmap": 0
        },
        {
            "id": 71,
            "name": "71",
            "status": "KILLED",
            "concurrency": 1,
            "frequency": 1440000,
            "time_unit": "minutes",
            "no_catch_up": false,
            "cron_expression": null,
            "user_id": 12,
            "start_time": "2013-04-10 00:00",
            "end_time": "2037-04-10 00:00",
            "created_at": "2012-07-01 02:00",
            "time_zone": "UTC",
            "next_materialized_time": null,
            "command": {
                "mode": 2,
                "dbtap_id": 15,
                "hive_table": "xxx",
                "part_spec": null,
                "hive_serde": null,
                "db_where": null,
                "db_columns": null,
                "schema": null,
                "md_cmd": true,
                "db_parallelism": 1,
                "db_extract_query": "select a,b,c from 3int_100M where $CONDITIONS",
                "retry": 0
            },
            "dependency_info": {},
            "incremental": {},
            "time_out": 0,
            "command_type": "DbImportCommand",
            "macros": {},
            "template": "generic",
            "pool": null,
            "label": "default"
        },
        {
            "id": 108,
            "name": "108",
            "status": "KILLED",
            "concurrency": 1,
            "frequency": 1449000,
            "time_unit": "minutes",
            "no_catch_up": false,
            "cron_expression": null,
            "user_id": 12,
            "start_time": "2013-05-01 00:00",
            "end_time": "2037-05-01 00:00",
            "created_at": "2012-07-01 02:00",
            "time_zone": "UTC",
            "next_materialized_time": null,
            "command": {
                "query": "alter table 3int_100m_sqooped recover partitions",
                "sample": false,
                "approx_mode": false,
                "approx_aggregations": false,
                "loader_table_name": "3int_100m_sqooped",
                "loader_stable": 60,
                "md_cmd": null,
                "script_location": null,
                "retry": 0
            },
            "dependency_info": {},
            "incremental": {},
            "time_out": 0,
            "command_type": "HiveCommand",
            "macros": {},
            "template": "s3import",
            "pool": null,
            "label": "default"
        },
        {
            "id": 128,
            "name": "128",
            "status": "KILLED",
            "concurrency": 1,
            "frequency": 1440,
            "time_unit": "minutes",
            "no_catch_up": false,
            "cron_expression": null,
            "user_id": 108,
            "start_time": "2013-05-13 00:00",
            "end_time": "2037-05-13 00:00",
            "created_at": "2012-07-01 02:00",
            "time_zone": "UTC",
            "next_materialized_time": "2014-04-18 00:00",
            "command": {
                "query": "alter table demo_memetracker recover partitions",
                "sample": false,
                "approx_mode": false,
                "approx_aggregations": false,
                "loader_table_name": "demo_memetracker",
                "loader_stable": 60,
                "md_cmd": null,
                "script_location": null,
                "retry": 0
            },
            "dependency_info": {},
            "incremental": {},
            "time_out": 0,
            "command_type": "HiveCommand",
            "macros": {},
            "template": "s3import",
            "pool": null,
            "label": "default",
            "is_digest": false,
            "can_notify": false,
            "digest_time_hour": 0,
            "digest_time_minute": 0,
            "email_list": "qubole@qubole.com",
            "bitmap": 0
        },
        {
            "id": 200,
            "name": "200",
            "status": "RUNNING",
            "concurrency": 1,
            "frequency": 14,
            "time_unit": "days",
            "no_catch_up": false,
            "cron_expression": null,
            "user_id": 12,
            "start_time": "2013-05-15 00:00",
            "end_time": "2037-05-15 00:00",
            "created_at": "2012-07-01 02:00",
            "time_zone": "UTC",
            "next_materialized_time": "2016-04-27 00:00",
            "command": {
                "query": "show tables;",
                "sample": false,
                "approx_mode": false,
                "approx_aggregations": false,
                "loader_table_name": null,
                "loader_stable": null,
                "md_cmd": null,
                "script_location": null,
                "retry": 0
            },
            "dependency_info": {
                "hive_tables": null
            },
            "incremental": {},
            "time_out": 10,
            "command_type": "HiveCommand",
            "macros": [],
            "template": "generic",
            "pool": null,
            "label": "default",
            "is_digest": false,
            "can_notify": false,
            "digest_time_hour": 0,
            "digest_time_minute": 0,
            "email_list": "qubole1@qubole.com",
            "bitmap": 0
        },
        {
            "id": 201,
             "name": "201",
             "status": "KILLED",
            "concurrency": 1,
            "frequency": 1449000,
            "time_unit": "minutes",
            "no_catch_up": false,
            "cron_expression": null,
            "user_id": 12,
            "start_time": "2013-05-28 00:00",
            "end_time": "2037-05-28 00:00",
            "created_at": "2012-07-01 02:00",
            "time_zone": "UTC",
            "next_materialized_time": null,
            "command": {
                "query": "alter table 3int_100m_sqooped recover partitions",
                "sample": false,
                "approx_mode": false,
                "approx_aggregations": false,
                "loader_table_name": "3int_100m_sqooped",
                "loader_stable": 60,
                "md_cmd": null,
                "script_location": null,
                "retry": 0
            },
            "dependency_info": {
                "hive_tables": null
            },
            "incremental": {},
            "time_out": 10,
            "command_type": "HiveCommand",
            "macros": [],
            "template": "s3import",
            "pool": null,
            "label": "default"
        },
        {
            "id": 203,
            "name": "203",
            "status": "SUSPENDED",
            "concurrency": 1,
            "frequency": 40,
            "time_unit": "minutes",
            "no_catch_up": false,
            "cron_expression": null,
            "user_id": 108,
            "start_time": "2013-05-13 00:00",
            "created_at": "2012-07-01 02:00",
            "end_time": "2037-05-13 00:00",
            "time_zone": "UTC",
            "next_materialized_time": "2014-01-08 21:20",
            "command": {
                "query": "alter table demo_memetracker recover partitions",
                "sample": false,
                "approx_mode": false,
                "approx_aggregations": false,
                "loader_table_name": "demo_memetracker",
                "loader_stable": 60,
                "md_cmd": null,
                "script_location": null,
                "retry": 0
            },
            "dependency_info": {
                "hive_tables": null
            },
            "incremental": {},
            "time_out": 10,
            "command_type": "HiveCommand",
            "macros": [],
            "template": "s3import",
            "pool": null,
            "label": "default",
            "is_digest": false,
            "can_notify": false,
            "digest_time_hour": 0,
            "digest_time_minute": 0,
            "email_list": "qubole@qubole.com",
            "bitmap": 0
        }
    ]
}
Edit a Schedule
PUT /api/v1.2/scheduler/(Scheduler ID)

Use this API to edit an existing schedule that is created to run commands automatically in a specified interval. You can edit a schedule by sending a PUT request with attributes that you want to modify.

Required Role

The following users can make this API call:

  • Users who belong to the system-admin group.
  • Users who belong to a group associated with a role that allows editing a schedule. See Managing Groups and Managing Roles for more information.
Parameters

Note

Parameters marked in bold below are mandatory. Others are optional and have default values.

Parameter Description
command_type A valid command type supported by Qubole. For example, HiveCommand, HadoopCommand, PigCommand.
command

JSON object describing the command. Refer to the command-api for more details.

Sub fields can use macros. Refer to the Qubole Scheduler for more details.

name A user-defined name for a schedule. If name is not specified, then a system-generated Schedule ID is set as the name.
label Specify a cluster label that identifies the cluster on which the schedule API call must be run.
start_time Start datetime for the schedule
end_time End datetime for the schedule
frequency Set this option or cron_expression but do not set both options. Specify how often the schedule should run. Input is an integer. For example, frequency of one hour/day/month is represented as {"frequency":"1"}
time_unit Denotes the time unit for the frequency. Its default value is days. Accepted value is minutes, hours, days, weeks, or months.
cron_expression Set this option or frequency but do not set both options. The standard cron format is “s, m, h, d, M, D, Y” where s is second, m is minute, M is month, d is date, and D is day of the week. Only year (Y) is optional. Example - "cron_expression":"0 0 12 * * ?". For more information, see Cron Trigger Tutorial.
macros

Expressions to evaluate macros. Macros can be used in parameterized commands.

Refer to the Macros in Scheduler page for more details.

no_catch_up Set this parameter to true if you want to skip schedule actions that were supposed to have run in the past and run only the api/v2.0 schedule actions. By default, this parameter is set to false. When a new schedule is created, the scheduler runs schedule actions from start time to the current time. For example, if a daily schedule is created from Jun 1, 2015 on Dec 1, 2015, schedules are run for Jun 1, 2015, Jun 2, 2015, and so on. If you do not want the scheduler to run the missed schedule actions for months earlier to Dec, set no_catch_up to true. The main use of skipping a schedule action is if when you suspend a schedule and resume it later, in which case, there will be more than one schedule action and you might want to skip the earlier schedule actions. For more information, see Understanding the Qubole Scheduler Concepts.
time_zone

Timezone of the start and end time of the schedule.

Scheduler will understand ZoneInfo identifiers. For example, Asia/Kolkata.

For a list of identifiers, check column 3 in List of TZ in databases.

Default value is UTC.

command_timeout You can set the command timeout configurable in seconds. Its default value is 129600 seconds (36 hours) and any other value that you set must be less than 36 hours. QDS checks the timeout for a command every 60 seconds. If the timeout is set for 80 seconds, the command gets killed in the next minute that is after 120 seconds. By setting this parameter, you can avoid the command from running for 36 hours.
time_out Unit is minutes. A number that represents a maximum amount of time the schedule should wait for dependencies to be satisfied.
concurrency Specify how many schedule actions can run at a time. Default value is 1.
dependency_info

Describe dependencies for this schedule.

Check the Hive Datasets as Schedule Dependency for more information.

notification It is an optional parameter that is set to false by default. You can set it to true if you want to be notified through email about instance failure. notification provides more information.
notification
Parameter Description
is_digest It is a notification email type that is set to true if a schedule periodicity is in minutes or hours. If it set to false, the email type is immediate by default.
notify_failure If this option is set to true, you receive schedule failure notifications.
notify_success If this option is set to true, you receive schedule success notifications.
notification_email_list By default, the current user’s email ID is added. You can add additional email IDs as required.
dependency_info
Parameter Description
files Use this parameter if there is dependency on S3 files and it has the following sub options. For more information, see Configuring GS Files Data Dependency.
path It is the S3 path of the dependent file (with data) based on which the schedule runs.
window_start It denotes the start day or time.
window_end It denotes the end day or time.
hive_tables Use this parameter if there is dependency on Hive table data that has partitions. For more information, see Configuring Hive Tables Data Dependency.
schema It is the database that contains the partitioned Hive table.
name It is the name of the partitioned Hive table.
window_start It denotes the start day or time.
window_end It denotes the end day or time.
interval It denotes the dataset interval and defines how often the data is generated. Hive Datasets as Schedule Dependency provides more information. You must also specify the incremental time that can be in minutes, hours, days, weeks, or months. The usage is "interval":{"days":"1"}. The default interval is 1 day.
column It denotes the partitioned column name. You must specify the date-time mask through the_date parameter denotes how to convert from date to string for the partition. The usage is "columns":{"the_date":"<value>"}. The <value> can be a macro or a string.
Response

The response contains a JSON object representing the edited schedule.

Note

There is a rerun limit for schedule reruns to be processed concurrently at a given point of time. Understanding the Qubole Scheduler Concepts provides more information.

Example

Sample 1

Goal: Modify a schedule to run every 2 hours

curl -i -X PUT -H "X-AUTH-TOKEN: $AUTH_TOKEN" -H "Accept: application/json" -H "Content-type: application/json" \
-d    '{ "frequency": 30, "time_unit": "days" }' \
"https://gcp.qubole.com/api/v1.2/scheduler/3159"

Response

{
 "email_list":"qubole@qubole.com",
 "dependency_info":{},
 "end_time":"2022-07-01 02:00",
 "status":"RUNNING",
 "no_catch_up":false,
 "label":"default",
 "concurrency":1,
 "frequency":30,
 "time_zone":"UTC",
 "template":"generic",
 "command":{
            "sample":false,"loader_table_name":null,"md_cmd":null,"approx_mode":false,"query":"select stock_symbol, max(high), min(low), sum(volume) from daily_tick_data where date1='$formatted_date$' group by stock_symbol","loader_stable":null,"script_location":null,"approx_aggregations":false
           },
 "user_id":108,
 "is_digest":false,
 "time_unit":"days",
 "digest_time_hour":0,
 "macros":[{"formatted_date":"Qubole_nominal_time.format('YYYY-MM-DD')"}],
 "incremental":{},
 "bitmap":0,
 "digest_time_minute":0,
 "can_notify":false,
 "command_type":"HiveCommand",
 "name":"3159",
 "start_time":"2012-07-01 02:00",
 "time_out":10,
 "id":3159,
 "next_materialized_time":"2012-07-07 02:00"
}

Sample 2

Goal: Modify a workflow command in a schedule

curl -i -X PUT -H "X-AUTH-TOKEN: $AUTH_TOKEN" -H "Accept: application/json" -H "Content-type: application/json" -d
'{"command_type": "CompositeCommand",
  "command":{ "sub_commands":
  [
   {
    "command_type": "SparkCommand",
    "language":"command_line",
    "cmdline": "A=123"
   },
   {
    "command_type": "SparkCommand",
    "language":"command_line",
    "cmdline": "B=456"
   }
  ]}
 }' "https://gcp.qubole.com/api/v1.2/scheduler/3159"
Preview a Command Before Running It
POST /api/v1.2/scheduler/preview_macro

After adding one or more macro variables in a command/query, you can preview it before running it. To see the preview of the command, use this API. It can be used for previewing a new, existing, or edited command.

Parameters

None

REST API Syntax

Here is the API syntax for running a preview command API call.

curl -i -X POST -H "X-AUTH-TOKEN: $AUTH_TOKEN" -H "Accept: application/json" -H "Content-type: application/json" \
-d '{ "command": { "query": "<query <macro>>", "command_type": "HiveCommand"}, "macros": [{"<macro>":"\"<macrovalue>\""}]
    }' \ "https://gcp.qubole.com/api/v1.2/scheduler/preview_macro"
Sample API Requests

Here are two sample API calls to preview the commands.

Hive Command Schedule Sample
curl -i -X POST -H "X-AUTH-TOKEN: $AUTH_TOKEN" -H "Accept: application/json" -H "Content-type: application/json" \
-d '{"command": { "query": "show table $table$", "command_type": "HiveCommand" }, "macros": [{"table":"\"hivetabledata\""}]
}'\ "https://gcp.qubole.com/api/v1.2/scheduler/preview_macro"
Workflow Command Schedule Sample
curl -i -X POST -H "X-AUTH-TOKEN: $AUTH_TOKEN" -H "Accept: application/json" -H "Content-type: application/json" \
-d `{
          "command": { "sub_commands" : [ {
                    "command_type" :  "HiveCommand",
          "query" :  "Some $hive$ query"

        },
        {
          "command_type" :  "PrestoCommand",
          "query" : "Some $presto$"
        }
      ],
     "command_type": "CompositeCommand" },
         "macros": [{"presto":"\"prestoCmd\""}, {"hive":"\"hiveCmd\""}]
     } \ "https://gcp.qubole.com/api/v1.2/scheduler/preview_macro"
Suspend, Resume or Kill a Schedule
PUT /api/v1.2/scheduler/(int: id)

Use this API to suspend, resume, or kill a schedule.

Note

After you stop a schedule, you cannot resume it. However, you can suspend a schedule and resume it later.

Required Role

The following users can make this API call:

  • Users who belong to the system-admin group.
  • Users who belong to a group associated with a role that allows suspending, resuming, or killing a schedule. See Managing Groups and Managing Roles for more information.

This API is used to suspend, resume, or kill an existing schedule created to run commands automatically at certain frequency in a specified interval.

Resource URI scheduler/id
Request Type PUT
Supporting Versions v2.0
Return Value JSON object with the status of the operation.
Parameters

Note

Parameters marked in bold below are mandatory. Others are optional and have default values.

Parameter Description
status It indicates the status and its valid values are suspend, resume or kill.
no_catch_up

Note

You can use this parameter while resuming a suspended schedule.

Set this parameter to true if you want to skip schedule actions that were supposed to have run in the past and run only the api/v2.0 schedule actions. By default, this parameter is set to false. When a new schedule is created, the scheduler runs schedule actions from start time to the current time. For example, if a daily schedule is created from Jun 1, 2015 on Dec 1, 2015, schedules are run for Jun 1, 2015, Jun 2, 2015, and so on. If you do not want the scheduler to run the missed schedule actions for months earlier to Dec, set no_catch_up to true. The main use of skipping a schedule action is if when you suspend a schedule and resume it later, in which case, there will be more than one schedule action and you might want to skip the earlier schedule actions. For more information, see Understanding the Qubole Scheduler Concepts.

Examples
Example to Suspend a Schedule
curl -i -X PUT -H "X-AUTH-TOKEN: $AUTH_TOKEN" -H "Accept: application/json" -H "Content-type: application/json" \
-d '{"status":"suspend"}' \ "https://gcp.qubole.com/api/v1.2.0}/scheduler/${SCHEDID}/"

Response

{"succeeded":"true","status":"SUSPENDED"}

Note

There is a rerun limit for a scheduled job. qds-scheduler-concepts provides more information.

Example to Resume a Suspended Schedule

Note

A _SUCCESS file is created in the output folder for successful schedules. You can set mapreduce.fileoutputcommitter.marksuccessfuljobs to false to disable creation of _SUCCESS file or to true to enable creation of the _SUCCESS file.

curl -i -X PUT -H "X-AUTH-TOKEN: $AUTH_TOKEN" -H "Accept: application/json" -H "Content-type: application/json" \
-d '{"status":"resume"}' \ "https://gcp.qubole.com/api/v1.2/scheduler/${SCHEDID}/"

Response

{"succeeded":"true","status":"RUNNING"}

Note

After you stop a schedule, you cannot resume it. However, you can suspend a schedule and resume it later.

Example to Kill a Schedule
curl -i -X PUT -H "X-AUTH-TOKEN: $AUTH_TOKEN" -H "Accept: application/json" -H "Content-type: application/json" \
-d '{"status":"kill"}' \ "https://gcp.qubole.com/api/v1.2/scheduler/${SCHEDID}/"

Response

{"succeeded":"true","status":"KILLED"}
List Schedule Actions
GET /api/v1.2/scheduler/(int: id)/actions

Retrieves a list of actions run for a scheduler. The list is paginated.

Note

A _SUCCESS file is created in the output folder for successful schedules. You can set mapreduce.fileoutputcommitter.marksuccessfuljobs to false to disable creation of _SUCCESS file or to true to enable creation of the _SUCCESS file.

Required Role

The following users can make this API call:

  • Users who belong to the system-admin or system-user group.
  • Users who belong to a group associated with a role that allows viewing actions that a scheduler runs. See Managing Groups and Managing Roles for more information.
Response

The list contains information about commands that are run as part of the action. The list is ordered with api/v2.0 sequence first.

Example

Goal: Retrieve a list of actions for a schedule. The list has 3 actions per page.

curl -i -X GET -H "X-AUTH-TOKEN: $AUTH_TOKEN" \
-H "Accept: application/json" \
-H "Content-type: application/json" \
"https://gcp.qubole.com/api/v1.2/scheduler/${SCHEDID}/actions?per_page=3"

Response

{
  "actions": [
    {
      "status": "done",
      "is_rerun_of": 47519,
      "nominal_time": "2014-06-26T11:00:00Z",
      "done": true,
      "sequence_id": 4226,
      "query_hist_id": 277791,
      "rerun_number": 2,
      "id": 47520,
      "dependencies": {
        "not_found": [

        ],
        "found": [

        ]
      },
      .
      .
      .
      "periodic_job_id": 30562,
    },
    {
      "done": true,
      "sequence_id": 4226,
      "query_hist_id": 277790,
      .
      .
      .
      "status": "done",
    },
    {
      "status": "done",
      "is_rerun_of": null,
      "periodic_job_id": 30562,
    }
  ],
  "paging_info": {
    "previous_page": null,
    "per_page": "3",
    "next_page": 2
  }
}
View a Schedule’s Action
GET /api/v1.2/scheduler/<Schedule ID>/actions/<sequence_id>

Use this API to view an existing schedule’s action.

Schedule ID is the Schedule’s ID and it is a mandatory parameter.

sequence_id is the nth action of the schedule and it is optional.

Required Role

The following users can make this API call:

  • Users who belong to the system-admin or system-user group.
  • Users who belong to a group associated with a role that allows viewing a schedule’s actions. See Managing Groups and Managing Roles for more information.
Response

If you provide only <Schedule ID> in the API call, for example, /api/v1.2/scheduler/123/actions, then the API response contains the list of Scheduler instances in the descending order based on instance ID, that is the most recent run is the first action object in the actions array. The most recent Scheduler instance run might be the re-run instances of an old instance. This is illustrated in Example 1.

If you provide both <Schedule ID> and sequence_id in the API call, for example, /api/v1.2/scheduler/123/actions/5, then the API response contains the list of Scheduler run instances that belong to only that sequence ID and the results are always in the descending order based on rerun_number, that is the most recent rerun is the first action object in the actions array. This is illustrated in Example 2.

The two types of API calls and the corresponding responses are illustrated in the below examples.

Examples

Here are the two different examples.

Example 1

Goal: View details of the schedule’s action by providing only the schedule’s ID.

curl -i -X GET -H "X-AUTH-TOKEN: $AUTH_TOKEN" \
-H "Accept: application/json" \
-H "Content-type: application/json" \
"https://gcp.qubole.com/api/v1.2/scheduler/16537/actions"

Response

For brevity, the JSON is truncated.

{
"paging_info": {
    "next_page": null,
    "previous_page": null,
    "per_page": 10
},
"actions": [
    {
        "id": 6587693,
        "sequence_id": 2,
        "nominal_time": "2017-03-16T00:00:00Z",
        "is_rerun_of": 3460235,
        "rerun_number": 2,
        "created_at": "2018-02-13T07:27:42Z",
        "query_hist_id": 128128536,
        "dependencies": {
            "found": [],
            "not_found": []
        },
        "done": true,
        "status": "done",
        "periodic_job_id": 16537,
        "perms": {
            "kill": true,
            "rerun": true
        }
        ,
        "command": {
            "id": 128128536,
            "path": "/tmp/2018-02-13/2056/128128536",
            "status": "done",
            "created_at": "2018-02-13T07:27:43Z",
            "updated_at": "2018-02-13T07:27:58Z",
            "command_type": "HiveCommand",
            "progress": 100,
            .....
            "command": {
                "query": "show tables",
                "sample": false,
                ....
            }
        }
    },
    {
        "id": 6587685,
        "sequence_id": 3,
        "nominal_time": "2017-03-16T00:00:00Z",
        "is_rerun_of": null,
        "rerun_number": 1,
        ....
        ,
        "command": {
          ....
        }
    },
    {
        "id": 3460235,
        "sequence_id": 2,
        "nominal_time": "2017-03-16T00:00:00Z",
        "is_rerun_of": null,
        "rerun_number": 1,
        .....
        ,
        "command": {
           .....
        }
    },
    {
        "id": 3460233,
        "sequence_id": 1,
        "nominal_time": "2017-03-15T18:30:00Z",
        "is_rerun_of": null,
        "rerun_number": 1,
        ....
        ,
        "command": {
          ....
        }
    }
]
}
Example 2

Goal: View a schedule’s action to retrieve its reruns by providing Scheduler ID and Sequence ID.

curl -i -X GET -H "X-AUTH-TOKEN: $AUTH_TOKEN" \
-H "Accept: application/json" \
-H "Content-type: application/json" \
"https://gcp.qubole.com/api/v1.2/scheduler/16537/actions/2"

Response

For brevity, the JSON is truncated.

{
"paging_info": {
    "next_page": null,
    "previous_page": null,
    "per_page": 10
},
"actions": [
    {
        "id": 6587693,
        "sequence_id": 2,
        "nominal_time": "2017-03-16T00:00:00Z",
        "is_rerun_of": 6587682,
        "rerun_number": 5,
        "created_at": "2018-02-13T07:27:42Z",
        "query_hist_id": 128128536,
        "dependencies": {
            "found": [],
            "not_found": []
        },
        "done": true,
        "status": "done",
        "periodic_job_id": 16537,
        "perms": {
            "kill": true,
            "rerun": true
        },
        "command": {
          ....
        }
    },
    {
        "id": 6587685,
        "sequence_id": 2,
        "nominal_time": "2017-03-16T00:00:00Z",
        "is_rerun_of": 3460235,
        "rerun_number": 4,
        "created_at": "2018-02-13T07:22:49Z",
        "query_hist_id": 128127532,
        ....
        ,
        "command": {
          ....
        }
    },
    {
        "id": 6587682,
        "sequence_id": 2,
        "nominal_time": "2017-03-16T00:00:00Z",
        "is_rerun_of": 3460235,
        "rerun_number": 3,
        "created_at": "2018-02-13T07:20:27Z",
        "query_hist_id": 128127085,
        .....
        ,
        "command": {
          ....
        }
    },
    {
        "id": 6587667,
        "sequence_id": 2,
        "nominal_time": "2017-03-16T00:00:00Z",
        "is_rerun_of": 3460235,
        "rerun_number": 2,
        "created_at": "2018-02-13T07:18:15Z",
        "query_hist_id": 128126570,
        ....
        ,
        "command": {
          ....
        }
    },
    {
        "id": 3460235,
        "sequence_id": 2,
        "nominal_time": "2017-03-16T00:00:00Z",
        "is_rerun_of": null,
        "rerun_number": 1,
        "created_at": "2017-03-16T15:18:25Z",
        "query_hist_id": 61115891,
        ....
        ,
        "command": {
          ....
        }
    }
]
}
Kill a Schedule Action
PUT /api/v1.2/actions/(int: id)/kill

Cancels a running scheduled action. The ID can be obtained by listing the actions using any one of the List Schedule Actions or the List All Actions.

Required Role

The following users can make this API call:

  • Users who belong to the system-admin or system-user group.
  • Users who belong to a group associated with a role that allows killing a schedule action. See Managing Groups and Managing Roles for more information.
Response

A JSON hash that encodes if the operation is a success or not.

{"kill_succeeded":"true"}
Example

Goal: Rerun an action with ID ${ACTIONID}

curl -i -X PUT -H "X-AUTH-TOKEN: $AUTH_TOKEN" \
-H "Accept: application/json" \
-H "Content-type: application/json" \
{"status":"kill"} \
"https://gcp.qubole.com/api/v1.2/actions/${ACTIONID}/kill"

Response

{"kill_succeeded":"true"}
Rerun a Scheduled Action
POST /api/v1.2/actions/(int: id)/rerun

Use this API to rerun a scheduled action. Get the Action ID by listing the actions using any one of the List Schedule Actions or the List All Actions.

Required Role

The following users can make this API call:

  • Users who belong to the system-admin group.
  • Users who belong to a group associated with a role that allows rescheduling. See Managing Groups and Managing Roles for more information.
Response

A JSON hash that encodes if the operation is successful or not.

{"status":"rescheduled"}
Example

Goal: Rerun an action with id ${ACTIONID}

curl -i -X POST -H "X-AUTH-TOKEN: $AUTH_TOKEN" \
-H "Accept: application/json" \
-H "Content-type: application/json" \
"https://gcp.qubole.com/api/v1.2/actions/${ACTIONID}/rerun"

Response

{"status":"rescheduled"}
List All Actions
GET /api/v1.2/actions

Retrieves a list of actions run by the scheduler. The list can belong to any schedule in the account. It is ordered in reverse chronological order. The list is also paginated.

Required Role

The following users can make this API call:

  • Users who belong to the system-admin or system-user group.
  • Users who belong to a group associated with a role that allows viewing actions that a scheduler runs. See Managing Groups and Managing Roles for more information.
Response

The list contains information about commands that are run as part of the action.

Example

Goal: Retrieve a list of actions for a schedule. The list has 3 actions per page.

curl -i -X GET -H "X-AUTH-TOKEN: $AUTH_TOKEN" \
-H "Accept: application/json" \
-H "Content-type: application/json" \
"https://gcp.qubole.com/api/v1.2/actions?per_page=3"

Response

{
  "actions": [
    {
      "status": "done",
      "is_rerun_of": 47519,
      "nominal_time": "2014-06-26T11:00:00Z",
      "done": true,
      "sequence_id": 4226,
      "query_hist_id": 277791,
      "rerun_number": 2,
      "id": 47520,
      "dependencies": {
        "not_found": [

        ],
        "found": [

        ]
      },
      .
      .
      .
      "periodic_job_id": 30562,
    },
    {
      "done": true,
      "sequence_id": 4226,
      "query_hist_id": 277790,
      .
      .
      .
      "status": "done",
    },
    {
      "status": "done",
      "is_rerun_of": null,
      "periodic_job_id": 30562,
    }
  ],
  "paging_info": {
    "previous_page": null,
    "per_page": "3",
    "next_page": 2
  }
}
View an Action
GET /api/v1.2/actions/(int: id)
Required Role

The following users can make this API call:

  • Users who belong to the system-admin or system-user group.
  • Users who belong to a group associated with a role that allows viewing a schedule information. See Managing Groups and Managing Roles for more information.
Response

Response is a scheduled action. The action may or may not have an empty command child object. A command may be empty if dependencies were not satisfied.

This API differs from the View a Schedule’s Action. In that API, instead of the sequence_id of the action, it accepts the unique ID of the action within the schedule.

Example

Goal: View the details about the first action of a schedule.

curl -i -X GET -H "X-AUTH-TOKEN: $AUTH_TOKEN" \
-H "Accept: application/json" \
-H "Content-type: application/json" \
"https://gcp.qubole.com/api/v1.2/actions/37129/"

Response

{
"status": "done",
"done": true,
"command": {
    "meta_data": {
        "logs_resource": "commands/165025/logs",
        "results_resource": "commands/165025/results"
    },
    "status": "done",
    "command_source": "SCHEDULED",
    "progress": 100,
    "qbol_session_id": 49007,
    "command": {
        "approx_mode": false,
        "md_cmd": false,
        "loader_stable": null,
        "script_location": null,
        "query": "select stock_symbol, max(high), min(low), sum(volume) from daily_tick_data where date1='2012-07-01' group by stock_symbol",
        "sample": false,
        "loader_table_name": null,
        "approx_aggregations": false
    },
    "created_at": "2015-04-15T10:05:11Z",
    "start_time": 1429092315,
    "end_time": 1429092346,
    "command_type": "HiveCommand",
    "pid": 12110,
    "account_id": 3,
    "label": "default",
    "template": "generic",
    "timeout": null,
    "can_notify": false,
    "pool": null,
    "user_id": 3,
    "submit_time": 1429092311,
    "name": "",
    "id": 165025,
    "path": "/tmp/2015-04-15/3/165025",
    "qlog": "{\"QBOL-QUERY-SCHEMA\":{\"/tmp/2015-04-15/3/165025.dir/000\":[{\"ColumnType\":\"string\",\"ColumnName\":\"stock_symbol\"},{\"ColumnType\":\"float\",\"ColumnName\":\"_c1\"},{\"ColumnType\":\"float\",\"ColumnName\":\"_c2\"},{\"ColumnType\":\"bigint\",\"ColumnName\":\"_c3\"}]}}",
    "num_result_dir": 1,
    "resolved_macros": "{\"Qubole_nominal_time\":\"Sun Jul 01 2012 02:00:00 GMT+0000\",\"formatted_date\":\"2012-07-01\",\"Qubole_nominal_time_iso\":\"2012-07-01 02:00:00+00:00\"}"
},
"created_at": "2015-04-15T10:05:11Z",
"periodic_job_id": 3094,
"dependencies": {
    "not_found": [],
    "found": []
},
"nominal_time": "2012-07-01T02:00:00Z",
"is_rerun_of": null,
"sequence_id": 1,
"query_hist_id": 165025,
"rerun_number": 1,
"id": 46096
}

Sensor API

Qubole supports file and partition sensors to monitor the file and Hive partition’s availability. For more information, see file-partition-sensors.

Airflow uses file and partition sensors for programmatically monitoring workflows.

The APIs to create a file sensor and partition sensors are described in these topics:

Create a File Sensor API
POST /api/v1.2/file_sensor

Use this API to create a file sensor API. Airflow uses it for programmatically monitoring workflows.

Required Role

The following users can make this API call:

  • Users who belong to the system-admin or system-user group.
  • Users who belong to a group associated with a role that allows creating a sensor. See Managing Groups and Managing Roles for more information.
Parameters

Note

Parameters marked in bold below are mandatory. Others are optional and have default values.

Parameter Description
files An array of the file’s location that needs monitoring.
Request API Syntax
curl -X POST -H "X-AUTH-TOKEN: $AUTH_TOKEN" -H "Content-Type: application/json" -H "Accept: application/json" \
-d '{"files": "["<file1 location>", "<file2 location>]" }' \ "https://gcp.qubole.com/api/v1.2/sensors/file_sensor"
Sample API Request

The following example creates a file sensor:

curl -X POST -H "X-AUTH-TOKEN: $AUTH_TOKEN" -H "Content-Type: application/json" -H "Accept: application/json" \
-d '{"files": "["s3://abc/xyz", "s3://abc/123"]" }' \ "https://gcp.qubole.com/api/v1.2/sensors/file_sensor"

The requests with 200 response code will have just a status field which will either contain true or false. The requests with 422 response code will contain the error message as well.

{"status": "true"}
{"error": {"error_code": 422, "error_message": "File list is missing or not in array format"}}
Create a Partition Sensor API
POST /api/v1.2/partition_sensor

Use this API to create a partition sensor API. Airflow uses it for programmatically monitoring workflows.

Required Role

The following users can make this API call:

  • Users who belong to the system-admin or system-user group.
  • Users who belong to a group associated with a role that allows creating a sensor. See Managing Groups and Managing Roles for more information.
Parameters

Note

Parameters marked in bold below are mandatory. Others are optional and have default values.

Parameter Description
schema The database that contains the Hive table partition, which needs monitoring.
table The name of the Hive table that contains the partition, which needs monitoring.
columns This contains the array of Hive-table-column names and the corresponding values.
Request API Syntax
curl -X POST -H "X-AUTH-TOKEN: $AUTH_TOKEN" -H "Content-Type: application/json" -H "Accept: application/json" \
-d '{"schema": "<Database Name>", "table":"<Hive table name>",
     "columns":[{"column":"<column name>", "values":["<value>"]}]}' \
"https://gcp.qubole.com/api/v1.2/sensors/partition_sensor"
Sample API Request

Here is an example of creating an Hive table partition sensor.

curl -X POST -H "X-AUTH-TOKEN: $AUTH_TOKEN" -H "Content-Type: application/json" -H "Accept: application/json" \
-d '{"schema": "default", "table":"hivetable", "columns":[{"column":"dt", "values":"[2017-05-19]"}]}' \
"https://gcp.qubole.com/api/v1.2/sensors/partition_sensor"

The requests with 200 response code will have just a status field which will either contain true or false. The requets with 422 response code will contain the error message as well.

{"status": "true"}
{"error": {"error_code": 422, "error_message": "Table can't be found in metastore"}}

Users API

Invite a User to a Qubole Account
POST /api/v1.2/users/invite_new

This API is used to invite new users to a Qubole account. New users are invited by an existing Qubole user of a Qubole account. After new users are added to the Qubole account, they become part of the system-users group by default.

Required Role

The following users can make this API call:

  • Users who belong to the system-admin or system-user group.
  • Users who belong to a group associated with a role that allows inviting users to an account. See Managing Groups and Managing Roles for more information.
Parameters

Note

Parameters marked in bold are mandatory. Others are optional and have default values.

Parameter Description
invitee_email Email address of the new user, who should be added to the Qubole account.
account Account ID of the current user
groups This parameter is used to add groups to the new user. By default, a new user is added to to the system-user group.
user_type
This parameter defines the type of user to be created. Applicable values are service or regular. By default, a regular user is created. To enable service users for your account, create a ticket with Qubole Support .
Request API Syntax

Here is the API request syntax.

curl -X POST -H "X-AUTH-TOKEN: $AUTH_TOKEN" -H "Content-Type: application/json" -H "Accept: application/json" \
-d '{"invitee_email":"<qubole email address>","account":"<account-ID>","groups":"<groups>", "user_type":"service" }' \
"https://gcp.qubole.com/api/v1.2/users/invite_new"
Sample Request

Here is a sample request.

curl -X POST -H "X-AUTH-TOKEN: $AUTH_TOKEN" -H "Content-Type: application/json" -H "Accept: application/json" \
-d '{"invitee_email":"user@qubole.com","account":"4332","groups":"system-admin" }' \
"https://gcp.qubole.com/api/v1.2/users/invite_new"
Enable a User in a Qubole Account
POST /api/v1.2/accounts/enable_qbol_user

This API is used to enable a user in a particular Qubole account.

Required Role

The following users can make this API call:

  • Users who belong to the system-admin or system-user group.
  • Users who belong to a group associated with a role that allows enabling users. See Managing Groups and Managing Roles for more information.
Parameters

Note

Parameters marked in bold are mandatory. Others are optional and have default values.

Parameter Description
qbol_user_id ID of the user, who should be enabled. Or, it can also be the email address of the user. View Users of a QDS Account describes how to get the users. Send the get_users API request. From the successful response, trace the user-id and email, and use its id value.
Request API Syntax

Here is the Request API syntax.

curl  -X POST -H "X-AUTH-TOKEN: $AUTH_TOKEN" -H "Content-Type: application/json" -H "Accept: application/json" \
-d '{"qbol_user_id" : "<id>/<user-email-address>"}' \ "https://gcp.qubole.com/api/v1.2/accounts/enable_qbol_user"

To get the qbol_user_id, perform the following steps:

  1. Send the GET API request, GET /api/v1.2/accounts/get_users as described in View Users of a QDS Account.
  2. A successful API response contains multiple values. For example, the values may look like { id: xyz, ... user_id: abcd, email:abc@.. }. Use the value that is with the id parameter.
Sample Request

Here is a sample request.

curl  -X POST -H "X-AUTH-TOKEN: $AUTH_TOKEN" -H "Content-Type: application/json" -H "Accept: application/json" \
-d '{"qbol_user_id" : "4"}' \ "https://gcp.qubole.com/api/v1.2/accounts/enable_qbol_user"

Here is a sample request with the user’s email address.

curl  -X POST -H "X-AUTH-TOKEN: $AUTH_TOKEN" -H "Content-Type: application/json" -H "Accept: application/json" \
-d '{"qbol_user_id" : "user@qubole.com"}' \ "https://gcp.qubole.com/api/v1.2/accounts/enable_qbol_user"
Disable a User in a Qubole Account
POST /api/v1.2/accounts/disable_qbol_user

This API is used to disable a user in a particular Qubole account.

Required Role

The following users can make this API call:

  • Users who belong to the system-admin or system-user group.
  • Users who belong to a group associated with a role that allows disabling users. See Managing Groups and Managing Roles for more information.
Parameters

Note

Parameters marked in bold are mandatory. Others are optional and have default values.

Parameter Description
qbol_user_id ID of the user, who should be disabled. Or, it can also be the email address of the user. View Users of a QDS Account describes how to get the users of a Qubole account. Send the get_users API request. From the successful response, trace the user-id and email, and use its id value.
Request API Syntax

Here is the Request API syntax.

curl  -X POST -H "X-AUTH-TOKEN: $AUTH_TOKEN" -H "Content-Type: application/json" -H "Accept: application/json" \
-d '{"qbol_user_id" : "<id>/<user-email-address>"}' \ "https://gcp.qubole.com/api/v1.2/accounts/disable_qbol_user"

To get the qbol_user_id, perform the following steps:

  1. Send the GET API request, GET /api/v1.2/accounts/get_users as described in View Users of a QDS Account.
  2. A successful API response contains multiple values. For example, the values may look like { id: xyz, ... user_id: abcd, email:abc@.. }. Use the value that is with the id parameter.
Sample Request

Here is a sample request.

curl  -X POST -H "X-AUTH-TOKEN: $AUTH_TOKEN" -H "Content-Type: application/json" -H "Accept: application/json" \
-d '{"qbol_user_id" : "4"}' \ "https://gcp.qubole.com/api/v1.2/accounts/disable_qbol_user"

Here is a sample request with the user’s email address.

curl  -X POST -H "X-AUTH-TOKEN: $AUTH_TOKEN" -H "Content-Type: application/json" -H "Accept: application/json" \
-d '{"qbol_user_id" : "user@qubole.com"}' \ "https://gcp.qubole.com/api/v1.2/accounts/disable_qbol_user"

Troubleshooting Guide

This guide explains how to troubleshoot issues that you might encounter while using QDS.

Troubleshooting Query Problems – Before You Contact Support

To make sure your problem is resolved as quickly as possible, you should gather all the pertinent information before you create a support ticket. The following checklist will help you organize this information:

  • Fill in mandatory fields in the ticket.

  • What is the account ID which is impacted?

    You can find this on the My Accounts page of the Control Panel.

  • Which command ID(s) or Notebook ID(s) are impacted?

    You can find it on respective QDS UI.

  • Can we rerun command or Notebook Paragraph in case of failure? If not, please state the reason in the ticket description.

  • Which environment is the issue occurring in?

    You can share the URL in case you do not know the environment.

  • Is the issue happening intermittently or always? If intermittent, provide successful and failed command ID(s).

  • Was it running before successfully?

  • Have there been any changes to your environment lately which could contribute to the issue? If so, what did you changed in your query environment (QDS, Cloud, and so on) before you ran this query?

  • What troubleshooting steps (if any) have you tried?

  • What is the business impact of this problem? Provide a description to further describe how it matches the selected ticket priority.

    For example:

    • Impact on internal SLA or production workloads?
    • Impact on new development with <deadline> (please specify)?
    • Testing only (low urgency)?
  • Provide reference of previous tickets with the same issue if you have raised them before.

  • If there are any other observations about this issue, please provide details.

Using Qubole General Help

For reporting an issue related to QDS, create a ticket with Qubole Support which logs an Helpdesk ticket. The Qubole Support team ensures that each helpdesk ticket is attended and resolved as quickly as possible.

Troubleshooting Query Problems – Before You Contact Support describes a checklist that you can do to gather the information before logging a helpdesk ticket/contacting the Qubole Support team.

QDS Help Center describes how to use the Helpcenter to know more information such as documentation and the support helpdesk.

Troubleshooting Airflow Issues

This topic describes a couple of best practices and common issues with solutions related to Airflow.

Cleaning up Root Partition Space by Removing the Task Logs

You can set up a cron to cleanup root partition space filled by task log. Usually Airflow cluster runs for a longer time, so, it can generate piles of logs, which could create issues for the scheduled jobs. So, to clear the logs, you can set up a cron job by following these steps:

  1. Edit the crontab:

    sudo crontab -e

  2. Add the following line at the end and save

    0 0 * * * /bin/find $AIRFLOW_HOME/logs -type f -mtime +7 -exec rm -f {} \;

Using macros with Airflow

Macros on Airflow describes how to use macros.

Common Issues with Possible Solutions

Issue 1: When a DAG has X number of tasks but it has only Y number of running tasks

Check the DAG concurrency in airflow configuration file(ariflow.cfg).

Issue 2: When it is difficult to trigger one of the DAGs

Check the connection id used in task/Qubole operator. There could be an issue with the API token used in connection.To check the connection Id, Airflow Webserver -> Admin -> Connections. Check the datastore connection: sql_alchemy_conn in airflow configuration file(airflow.cfg) If there is no issue with the above two things. Create a ticket with Qubole

Issue 3: Tasks for a specific DAG get stuck

Check if the depends_on_past property is enabled in airflow.cfg file. Based on the property, you can choose to do one of these appropriate solutions:

  1. If depends_on_past is enabled, check the runtime of the last task that has run successfully or failed before the task gets stuck. If the runtime of the last successful or failed task is greater than the frequency of the DAG, then DAG/tasks are stuck for this reason. It is an open-source bug. Create a ticket with Qubole Support to clear the stuck task. Before creating a ticket, gather the information as mentioned in Troubleshooting Query Problems – Before You Contact Support.
  2. If depends_on_past is not enabled, create a ticket with Qubole Support. Before creating a ticket, gather the information as mentioned in Troubleshooting Query Problems – Before You Contact Support.
Issue 4: When manually running a DAG is impossible

If you are unable to manually run a DAG from the UI, do these steps:

  1. Go to line 902 of the /usr/lib/virtualenv/python27/lib/python2.7/site-packages/apache_airflow-1.9.0.dev0+incubating-py2.7.egg/airflow/www/views.py file.
  2. Change from airflow.executors import CeleryExecutor to from airflow.executors.celery_executor import CeleryExecutor.

Questions on Airflow Service Issues

Here is a list of FAQs that are related to Airflow service issues with corresponding solutions.

  1. Which logs do I look up for Airflow cluster startup issues?

    Refer to Airflow Services logs which are brought up during the cluster startup.

  2. Where can I find Airflow Services logs?

    Airflow services are Scheduler, Webserver, Celery, and RabbitMQ. The service logs are available at /media/ephemeral0/logs/airflow location inside the cluster node. Since airflow is single node machine, logs are accessible on the same node. These logs are helpful in troubleshooting cluster bringup and scheduling issues.

  3. What is $AIRFLOW_HOME?

    $AIRFLOW_HOME is a location that contains all configuration files, DAGs, plugins, and task logs. It is an environment variable set to /usr/lib/airflow for all machine users.

  4. Where can I find Airflow Configuration files?

    Configuration file is present at “$AIRFLOW_HOME/airflow.cfg”.

  5. Where can I find Airflow DAGs?

    The DAGs’ configuration file is available in the $AIRFLOW_HOME/dags folder.

  6. Where can I find Airflow task logs?

    The task log configuration file is available in $AIRFLOW_HOME/logs.

  7. Where can I find Airflow plugins?

    The configuration file is available in $AIRFLOW_HOME/plugins.

  8. How do I restart Airflow Services?

    You can do start/stop/restart actions on an Airflow service and the commands used for each service are given below:

    • Run sudo monit <action> scheduler for Airflow Scheduler.
    • Run sudo monit <action> webserver for Airflow Webserver.
    • Run sudo monit <action> worker for Celery workers. A stop operation gracefully shuts down existing workers. A start operation adds more equivalent number of workers as per the configuration. A restart operation gracefully shuts down existing workers and adds equivalent number of workers as per the configuration.
    • Run sudo monit <action> rabbitmq for RabbitMQ.
  9. How do I invoke Airflow CLI commands within the node?

    Airflow is installed inside a virtual environment at the /usr/lib/virtualenv/python27 location. Firstly, activate the virtual envirnoment, source /usr/lib/virtualenv/python27/bin/activate and run the Airflow command.

  10. How to view the Airflow processes using Monit dashboard?

You can navigate to the Clusters page and select Monit Dashboard from the Resources drop-down list of an up and running cluster. To know more about how to use Monit dashboard, see monitoring-through-monit-dashboard.
  1. How to manage the Airflow processes using Monit Dashboard when the status is Failed or Does not exist?
If the status of the process is Execution failed or Does not exist, you need to restart the process. To know more about about how to restart the process with the help of Monit dashboard, see monitoring-through-monit-dashboard.

Questions on DAGs

Is there any button to run a DAG on Airflow?

There is no button to run a DAG in the QDS UI, but the Airflow 1.8.2 web server UI provides one.

How do I delete a DAG?

Deleting a DAG is still not very intuitive in Airflow. QDS provides its own implementation for deleting DAGs, but you must be careful using it.

To delete a DAG, submit the following command from the Workbench page of the QDS UI:

airflow delete_dag dag_id -f

The above command deletes the DAG Python code along with its history from the data source. Two types of errors may occur when you delete a DAG:

  • DAG isn't available in Dagbag:

    This happens when the DAG Python code is not found on the cluster’s DAG location. In that case, nothing can be done from the UI and it would need a manual inspection.

  • Active DAG runs:

    If there are active DAG runs pending for the DAG, then QDS cannot delete it. In such a case, you can visit the DAG and mark all tasks under those DAG runs as completed and try again.

Error message when deleting a DAG from the UI

The following error message might appear when you delete a DAG from the QDS UI in Airflow v1.10.x:

<dag Id> is still in dagbag(). Remove the DAG file first.

Here is how the error message appears in the Airflow UI:

_images/Airflow_DAG_delete_error.png

The reason for this error message is that deleting a DAG from the UI causes the metadata of the DAG to be deleted, but not the DAG file itself. In Airflow 1.10.x, the DAG file must be removed manually before deleting the DAG from the UI. To remove the DAG file, perform the following steps:

  1. ssh into the Airflow cluster.
  2. Go to the following path: /usr/lib/airflow/dags
  3. Run the following command: grep -R "<dag_name_you_want_to_delete>". This command will return the file path linked to this DAG.
  4. Delete the DAG file using the following command: rm <file_name>
  5. With the DAG file removed, you can now delete the DAG from the QDS UI.

If you still face issues with deleting a DAG, raise a ticket with Qubole Support.

Can I create a configuration to externally trigger an Airflow DAG?

No, but you can trigger DAGs from the QDS UI using the shell command airflow trigger_dag <DAG>....

If there is no connection password, the qubole_example_operator DAG will fail when it is triggered.

Troubleshooting Hadoop Issues

This section describes how to troubleshoot Hadoop issues related to memory and disk space issues. They are categorized as follows:

Accessing the Logs of Hadoop Components

Here are the log locations of Hadoop components:

  • The logs of ResourceManager/NodeManager are saved in /media/ephemeral0/logs/yarn.
  • The logs of NameNode/DataNode are saved in /media/ephemeral0/logs/hdfs.
  • The logs of the EBS upscaling are saved in /media/ephemeral0/logs/others/disk_check_daemon.log.
  • The logs written by the node bootstrap script are saved in /media/ephemeral0/logs/others/node_bootstrap.log.
  • The logs of the autoscaling-nodes are saved in /media/ephemeral0/logs/yarn/autoscaling.log or /media/ephemeral0/logs/yarn/scaling.log.

Hadoop Logs Location provides the location of YARN logs, daemon logs, and the MapReduce history files.

Disk Space Issues in Hadoop

This topic addresses about how to troubleshoot a few common Hadoop disk space issues.

Handling a Disk Space Issue When Creating a Directory

While running Hadoop jobs, you can hit this exception: cannot create directory :No space left on device.

This exception usually appears when the disk space in the HDFS is full. In Qubole, only temporary/intermittent data to the HDFS is deleted as a cron job would be running to delete the temp files regularly. This issue would be seen in cases such as:

  • Long running jobs where the jobs may be writing lots of intermediate data and the cron could not delete the data as the jobs are still running.
  • Long running clusters where in rare cases, the data written from failed or killed tasks may not get deleted.

Solution: Verify the actual cause by checking the HDFS disk usage from one of these methods:

  • On the Qubole UI, through the DFS Status from the running cluster’s UI page.

  • By logging into the cluster node and running this command:

    hadoop dfsadmin -report

    A sample response is mentioned here.

    Configured Capacity: 153668681728 (143.12 GB)
    Present Capacity: 153668681728 (143.12 GB)
    DFS Remaining: 153555091456 (143.01 GB)
    DFS Used: 113590272 (108.33 MB)
    DFS Used%: 0.07%
    Under replicated blocks: 33
    Blocks with corrupt replicas: 0
    Missing blocks: 0
    
    -------------------------------------------------
    Live datanodes (2):
    
    Name:x.x.x.x:50010 (ip-x-x-x-x.ec2.internal)
    Hostname:ip-x-x-x-x.ec2.internal
    Decommission Status : Normal
    Configured Capacity: 76834340864 (71.56 GB)
    DFS Used: 56795136 (54.16 MB)
    Non DFS Used: 0 (0 B)
    DFS Remaining: 76777545728 (71.50 GB)
    DFS Used%: 0.07%
    DFS Remaining%: 99.93%
    Configured Cache Capacity: 0 (0 B)
    Cache Used: 0 (0 B)
    Cache Remaining: 0 (0 B)
    Cache Used%: 100.00%
    Cache Remaining%: 0.00%
    Xceivers: 2
    Last contact: Tue Dec 26 11:21:19 UTC 2017
    
    
    Name:x.x.x.x:50010((ip-x-x-x-x.ec2.internal)
    Hostname: ip-x-x-x-x.ec2.internal
    Decommission Status : Normal
    Configured Capacity: 76834340864 (71.56 GB)
    DFS Used: 56795136 (54.16 MB)
    Non DFS Used: 0 (0 B)
    DFS Remaining: 76777545728 (71.50 GB)
    DFS Used%: 0.07%
    DFS Remaining%: 99.93%
    Configured Cache Capacity: 0 (0 B)
    Cache Used: 0 (0 B)
    Cache Remaining: 0 (0 B)
    Cache Used%: 100.00%
    Cache Remaining%: 0.00%
    Xceivers: 2
    Last contact: Tue Dec 26 11:21:21 UTC 2017
    
Handling a Device Disk Space Error

While running jobs, you may hit this exception - java.io.IOException: No space left on device.

Cause: This exception usually appears when there is no disk space on the worker or coordinator nodes. You can confirm this by logging into the corresponding node and running a df -h on the node when the query is still running.

Solution: You can avoid this error by one of these solutions:

  1. Enable EBS autoscaling. After enabling, you can attach additional EBS volumes based on the query’s requirement.
  2. You can also try using cluster instance types with larger disk space.

Memory Issues in Hadoop

This topic describes the memory issues in Hadoop and YARN, which are as listed here:

Job History Server Memory Issues

While running Hadoop jobs, if you get the org.apache.hadoop.ipc.RemoteException, which is as mentioned below.

Caused by: org.apache.hadoop.ipc.RemoteException(java.lang.OutOfMemoryError): GC overhead limit exceeded

at org.apache.hadoop.ipc.Client.call(Client.java:1471)
at org.apache.hadoop.ipc.Client.call(Client.java:1402)
at org.apache.hadoop.ipc.ProtobufRpcEngine$Invoker.invoke(ProtobufRpcEngine.java:232)
at com.sun.proxy.$Proxy48.getJobReport(Unknown Source)
at org.apache.hadoop.mapreduce.v2.api.impl.pb.client.MRClientProtocolPBClientImpl.getJobReport(MRClientProtocolPBClientImpl.java:133)

Error Implication: The above exception implies that it has occurred due to the out-of-memory issue in the Job History Server (JHS).

Solution: It is recommended to use a larger coordinator node with at least 60G RAM where you can use 4GB heap memory. If you still face the issue, then you can increase the JHS memory by using the node bootstrap.

Increase the JHS’ memory by adding the following script in the node bootstrap.

#increase the JHS memory to 8G
sudo echo 'export HADOOP_JOB_HISTORYSERVER_HEAPSIZE="8192"' >> /etc/hadoop/mapred-env.sh

#restart the JHS
sudo -u mapred /usr/lib/hadoop2/sbin/mr-jobhistory-daemon.sh stop historyserver
sudo -u mapred /usr/lib/hadoop2/sbin/mr-jobhistory-daemon.sh start historyserver
YARN ApplicationMaster Exceeding its Physical Memory Limit Error

While running a Hadoop job, you may get this error on the ApplicationMaster’s physical memory limit.

Application application_<XXXXXXXXXXXXXX> failed 1 times due to AM Container for appattempt_<XXXXXXXXXXXXX> exited
with exitCode: -104

Diagnostics: Container pid=<XXXX>,containerID=container_<XXXXXXXXXXXXX> is running beyond physical memory limits.
Current usage: 5.2 GB of 4.8 GB physical memory used; 27.5 GB of 10.1 GB virtual memory used. Killing container.
Dump of the process-tree for container_<XXXXXXXXXXXXX>

Container killed on request. Exit code is 143
Container exited with a non-zero exit code 143
Failing this attempt. Failing the application.

Error implication:

The error implies that the YARN container is running beyond its physical memory limits.

Solution:

YARN provides the processing capacity to each application by allocating containers. A container is the basic unit of processing capacity in YARN and it is an encapsulation of resource elements (memory, cores, and so on). This issue may get resolved by configuring the appropriate value of the below parameters:

  • yarn.app.mapreduce.am.command-opts=-Xmx<XXXX>m. It sets the JVM arguments for an Application Master.
  • yarn.app.mapreduce.am.resource.mb=<XXXX>. It sets the container size.

Note

The container size must be greater than the heap size that is the value of the yarn.app.mapreduce.am.command-opts parameter.

Handling Exceeded Physical Memory Limit Error in a Mapper

When a mapper exceeds its physical memory limit, you would see this error in its logs. For information on the logs’ location, see Accessing the Logs of Hadoop Components.

Application application_<XXXXXXXXXXXXXX> failed 1 times due to AM Container for appattempt_<XXXXXXXXXXXXX> exited with exitCode: -104

Diagnostics: Container pid=<XXXX>,containerID=container_<XXXXXXXXXXXXX> is running beyond physical memory limits.
Current usage: 5.2 GB of 4.8 GB physical memory used; 27.5 GB of 10.1 GB virtual memory used. Killing container.
Dump of the process-tree for container_<XXXXXXXXXXXXX>

Solution: Try increasing the mapper’s container and heap memory using these configuration properties.

mapreduce.map.java.opts = -Xmx<XXXX>m;
mapreduce.map.memory.mb = <XXXX>;
Handling Exceeded Physical Memory Limit Error in a Reducer

When a reducer exceeds its physical memory limit, you would see this error in its logs. For information on the logs’ location, see Accessing the Logs of Hadoop Components.

Application application_<XXXXXXXXXXXXXX> failed 1 times due to AM Container for appattempt_<XXXXXXXXXXXXX> exited with exitCode: -104

Diagnostics: Container pid=<XXXX>,containerID=container_<XXXXXXXXXXXXX> is running beyond physical memory limits.
Current usage: 5.2 GB of 4.8 GB physical memory used; 27.5 GB of 10.1 GB virtual memory used. Killing container.
Dump of the process-tree for container_<XXXXXXXXXXXXX>

Solution: Try increasing the reducer’s container and heap memory using this configuration.

mapreduce.reduce.java.opts = -Xmx<XXXX>;
mapreduce.map.memory.mb = <XXXX>;
Handling the Java Heap Space Error

Exception: java.lang.OutOfMemoryError: Java heap space

If you are seeing this error in the mapper task, try to increase the mapper memory.

set mapreduce.map.memory.mb=<XXXX>;
set mapreduce.map.java.opts=-Xmx<XXXX>m;

If you are seeing this error in the reducer task, try to increase the reducer memory.

set mapreduce.reduce.memory.mb=<XXXX>;
set mapreduce.reduce.java.opts=-Xmx<XXXX>m;
Hadoop Client-side Memory Issues

If you get the Java out of the heap space or exceeded Garbage Collection (GC) overhead limit errors on the client node, while running the Hadoop job, then increase the client JVM heap size for the job by setting the HADOOP_CLIENT_OPTS variable as mentioned in this example.

sed -i "s/HADOOP_CLIENT_OPTS=\"-Xmx512m/HADOOP_CLIENT_OPTS=\"-Xmx4000m/g" /usr/lib/hadoop2/etc/hadoop/hadoop-env.sh

When you get this exception, Failed to sanitize XML document destined for handler class while running Hadoop jobs, you can imply that it has occurred due to the low client-side memory. You can increase the client-side memory as shown in this example:

export HADOOP_CLIENT_OPTS=-Xmx1024m

Troubleshooting Presto Issues

This topic describes common Presto issues with suitable workarounds. It consists of the issues that are categorized as below:

Presto Query Issues

This topic describes common Presto query issues with solutions and they are:

Handling Memory Issues

When you hit memory issues in Presto queries, as a workaround, perform the following steps:

  1. Use a bigger cluster by increasing the maximum worker node count.
  2. Add a limit clause for all subqueries.
  3. Use a larger cluster instance.

Presto Configuration Properties describes the query execution configuration properties along other settings.

Common Issues and Potential Solutions

Here are some common issues in Presto with potential solutions.

Query exceeded max memory size of <XXXX> GB

This issue appears when memory limit gets exhausted at the cluster level. Set higher value of query.max-memory. This is a cluster-level limit, which denotes maximum memory that a query can take aggregated across all nodes.

Query exceeded local memory limit of <XXXX> GB

Increase the value of query.max-memory-per-node equal to 40% of worker instance Memory. The query.max-memory-per-node determines maximum memory that a query can take up on a node

Here are recommendations to avoid memory issues:

  • If larger table is on the right side, the chances are that Presto errors out. So, an ideal scenario is put smaller table on the right side and bigger tables on the left side of JOIN.
  • The other alternative is use distributed JOINs. By default, Presto supports Map-side JOINs but you can also enable Reduce-side JOINs (distributed JOINs). Rework the query to bring down the memory usage.
No nodes available to run the query

When the coordinator node cannot find node to run the query, one of the common reasons is that cluster is not configured properly. It could be a generic error which might need further triage to find the root cause. Such error message is also seen when no data source attached for the connector.

Ensure that the connector data source configuration is correct and catalogue properties is defined as below.

_images/PrestoConnectorDatasourceConfig.png

This might also happen due to a configuration error in which worker daemons did not come up or nodes died due to out-of-memory error. Check server.log in worker nodes.

This can also be seen when the coordinator node is small and it could not do the heartbeat collection.

Presto Queries Failing Sporadically with java.net.SocketTimeoutException

Presto queries failed with the following java.net.SocketTimeoutException when a custom Hive metastore (HMS) (with good connectivity) is used.

2019-06-20T16:01:40.570Z    ERROR    transaction-finishing-12    com.facebook.presto.transaction.TransactionManager    Connector threw exception on abort
com.facebook.presto.spi.PrestoException: 172.18.40.110: java.net.SocketTimeoutException: Read timed out
at com.facebook.presto.hive.metastore.ThriftHiveMetastore.getTable(ThriftHiveMetastore.java:214)
at com.facebook.presto.hive.metastore.BridgingHiveMetastore.getTable(BridgingHiveMetastore.java:74)
at com.facebook.presto.hive.metastore.CachingHiveMetastore.loadTable(CachingHiveMetastore.java:362)
at com.facebook.presto.hive.metastore.CachingHiveMetastore.access$500(CachingHiveMetastore.java:64)
at com.facebook.presto.hive.metastore.CachingHiveMetastore$6.load(CachingHiveMetastore.java:210)
at com.facebook.presto.hive.metastore.CachingHiveMetastore$6.load(CachingHiveMetastore.java:205)
at com.google.common.cache.CacheLoader$1.load(CacheLoader.java:182)
at com.google.common.cache.LocalCache$LoadingValueReference.loadFuture(LocalCache.java:3716)
at com.google.common.cache.LocalCache$Segment.loadSync(LocalCache.java:2424)
at com.google.common.cache.LocalCache$Segment.lockedGetOrLoad(LocalCache.java:2298)
at com.google.common.cache.LocalCache$Segment.get(LocalCache.java:2211)
at com.google.common.cache.LocalCache.get(LocalCache.java:4154)
at com.google.common.cache.LocalCache.getOrLoad(LocalCache.java:4158)
at com.google.common.cache.LocalCache$LocalLoadingCache.get(LocalCache.java:5147)
at com.facebook.presto.hive.metastore.qubole.QuboleLoadingCache.get(QuboleLoadingCache.java:57)
at com.facebook.presto.hive.metastore.CachingHiveMetastore.get(CachingHiveMetastore.java:312)
at com.facebook.presto.hive.metastore.CachingHiveMetastore.getTable(CachingHiveMetastore.java:356)
at com.facebook.presto.hive.metastore.CachingHiveMetastore.loadTable(CachingHiveMetastore.java:362)
at com.facebook.presto.hive.metastore.CachingHiveMetastore.access$500(CachingHiveMetastore.java:64)
at com.facebook.presto.hive.metastore.CachingHiveMetastore$6.load(CachingHiveMetastore.java:210)
at com.facebook.presto.hive.metastore.CachingHiveMetastore$6.load(CachingHiveMetastore.java:205)
at com.google.common.cache.CacheLoader$1.load(CacheLoader.java:182)
at com.google.common.cache.LocalCache$LoadingValueReference.loadFuture(LocalCache.java:3716)
at com.google.common.cache.LocalCache$Segment.loadSync(LocalCache.java:2424)

Solution:

Note

Firstly, you should check why SocketTimeout has occurred in Hive metastore logs. HMS logs are available on the coordinator node at /media/ephemeral0/logs/hive<version>/hive_ms.log. HMS logs display some relevant errors and you should take steps according to errors that you see in the logs. Based on errors, increasing the thrift Heap memory or increasing the socket timeout helps in resolving this issue.

  1. Increase the size of the coordinator node to enhance its memory. (For example, increase the coordinator node’s memory from 30 GB RAM to 60 GB RAM.)

  2. Create a ticket with Qubole Support to increase the metastore’s maximum heap memory at the cluster level. (If you get the metastore’s memory increased at the account-level through Qubole Support, then it applies to all clusters.)

  3. Increase the socket-timeout values of the custom Hive metastore by overriding the default values. Pass them as catalog/hive.properties in the cluster’s Override Presto Configuration that is under Advanced Configuration > PRESTO SETTINGS as shown in this example.

    catalog/hive.properties:
    hive.metastore-timeout=3m
    hive.s3.connect-timeout=3m
    hive.s3.socket-timeout=3m
    

    For more information on cluster settings, see Managing Clusters.

Presto Server and Cluster Issues

This section describes the issues with the solutions related to the Presto server and cluster and they are:

Handling Presto Server Connection Issues

If you get this error message while trying to connecting to a Presto cluster:

Error running command: Server refused connection:

One possible workaround is to ensure that you have provided access to Qubole public buckets to get Presto cluster to boot up.

Trace the Presto logs that are at the location(s) below:

  • On the cluster, logs are at: /media/ephemeral0/presto/var/log or /usr/lib/presto/logs

You can go to the logs location on the cluster using these commands.

[ec2-user@ip-XX-XXX-XX-XX logs]$ cd /media/ephemeral0/presto/var/log
[ec2-user@ip-XX-XXX-XX-XX log]$ pwd
/media/ephemeral0/presto/var/log
[ec2-user@ip-XX-XXX-XX-XX log]$ ls -ltr
total 692

-rw-r--r-- 1 root root 231541 Dec 18 07:10 gc.log
-rw-r--r-- 1 root root 248166 Dec 18 07:10 launcher.log
-rw-r--r-- 1 root root 160394 Dec 18 07:10 server.log
-rw-r--r-- 1 root root  40822 Dec 18 07:10 http-request.log

The different types of logs are:

  • server.log: For any job failure in presto, it is important to see the Presto server log which will be having error stack traces, warning messages and so on.
  • launcher.log: There is a python process which starts the Presto process and the logs for that python process goes to the launcher.log. If you do not find anything in the server.log, then next option is to see the launcher.log.
  • gc.log: This log is helpful in analyzing the cause for the long running job or for a query being stuck. This is quite verbose so it can be helpful in looking at the Garabage Collection (GC) pause as a result of minor and full GC.
  • http-request.log: This log tells us the incoming request to Presto server and responses from the Presto server.
Handling the Exception - Encountered too many errors talking to a worker node

It could be a generic error message and you must check the logs. Handling Presto Server Connection Issues mentions the logs’ location on the cluster.

Here are a few common causes of the error:

  • Node may have gone out of memory and it shows up in the launcher.log of the worker node.
  • High Garbage Collection (GC) pause on the node and it shows up in the gc.log of the worker node.
  • The coordinator node is too busy to get heartbeat from the node. It shows up in the the server.log of the coordinator node.
Handling Query Failures due to an Exceeded Memory Limit

The query failure that occurs due to an exceeded maximum memory limit may be as a result of incorrect property values that are overridden in the cluster. The values may not be required or the property names may have a typo/incorrectly entered.

Handling the Exception - Server did not reply

When you get the Server did not reply exception, check the logs and look for the phrase SERVER STARTED. If the phrase is not in the logs, then there can be an error in the overridden Presto configuration on the cluster.

Investigating Ganglia Reports

Ganglia is a monitoring system for distributed systems. You can access the Ganglia Monitoring page by navigating to Control Panel > Clusters. Under the Resources column for the running cluster in question, there is a Ganglia Metrics link. If that link does not exist, an administrator needs to enable it for the cluster in question.

For more information on how to enable Ganglia monitoring, see Performance Monitoring with Ganglia.

Ganglia will provide visibility into many detailed metrics like presto-jvm.metrics, disk metrics, CPU metrics, memory metrics, and network metrics. It is very crucial in understanding system resource utilization during certain windows of time and troubleshooting performance issues.

Investigating Datadog Metrics

presto-system-metrics describes the list of metrics that can be seen on the Datadog monitoring service. It also describes the abnormalities and actions that you can perform to handle abnormalities.

Handling the query.max-memory-per-node configuration

The maximum memory a query can take up on a node is defined by the query.max-memory-per-node configuration property. Its value only applies to the worker nodes and does not apply to the cluster’s coordinator node.

If the value of query.max-memory-per-node is set more than 42% of Physical Memory, cluster failures occur. For more information, see the query execution properties table under Presto Configuration Properties.

If the queries are failing with the maximum memory limit exceeded exception, then reduce the value of query.max-memory-per-node by overriding it in the cluster’s Override Presto Configuration. You can also try reducing the worker node size.

Handling Presto Query Failures due to the Abnormal Server Shutdown

Sometimes, when you run the node bootstrap scripts, you can see the Presto queries intermittently fail with the following error.

2017-09-13T23:05:19.309Z    ERROR    remote-task-callback-828    com.facebook.presto.execution.StageStateMachine    Stage 20170913_230512_00045_9tvic.21 failed
com.facebook.presto.spi.PrestoException: Server is shutting down. Task 20170913_230512_00045_9tvic.21.8 has been canceled
at com.facebook.presto.execution.SqlTaskManager.close(SqlTaskManager.java:227)
at sun.reflect.NativeMethodAccessorImpl.invoke0(Native Method)
at sun.reflect.NativeMethodAccessorImpl.invoke(NativeMethodAccessorImpl.java:62)
at sun.reflect.DelegatingMethodAccessorImpl.invoke(DelegatingMethodAccessorImpl.java:43)
at java.lang.reflect.Method.invoke(Method.java:497)
at io.airlift.bootstrap.LifeCycleManager.stop(LifeCycleManager.java:135)
at io.airlift.bootstrap.LifeCycleManager$1.run(LifeCycleManager.java:101)

Solution: This error occurs when the node bootstrap scripts of the cluster contain the presto server stop command. Otherwise, the scripts may have just caused the Presto server to abnormally shut down.

To resolve or avoid this error, run a node bootstrap script for Presto changes using the Qubole Presto Server bootstrap, which is an alternative to the node bootstrap. For more information, see Using the Qubole Presto Server Bootstrap.

For other changes, you may have to still use the node bootstrap sript.

Tuning Presto

This topic describes tips for tuning parallelism and memory in Presto. The tips are categorized as follows:

Tuning Parallelism at a Task Level

The number of splits in a cluster = node-scheduler.max-splits-per-node * number of worker nodes. The node-scheduler.max-splits-per-node denotes the target value for the total number of splits that can be running on any worker node. Its default value is 100.

If there are queries submitted in large batches or for connectors that produce many splits, which get completed quickly, then it is better to set a higher value for node-scheduler.max-splits-per-node. The higher value may improve the query latency as it ensures that the worker nodes have sufficient splits to completely engage them.

On the contrary, if you set a very high value, then it may lower the performance as the splits may not be balanced across workers. Typically, set the value so that anytime, there is only one split that is waiting to be processed.

Note

If a query on the Hive catalog suffers from lower parallelism due to the less number of splits that are being generated, then you can use hive.max-initial-splits and hive.max-initial-split-size to achieve higher parallelism.

Tuning Parallelism at an Operator Level

The task concurrency denotes the default local concurrency for parallel operators such as JOINS and AGGREGATIONS. Its default value is 16. The value of the task concurrency must be a multiplier of 2. You can increase/reduce the value depending on the query concurrency and worker nodes utilization as described below:

  • Lower values are better for clusters running many queries concurrently as the running queries use cluster nodes. In such a case, increasing the concurrency causes context switching and other overheads and thus there is a slow down in the query execution.
  • Higher values are better for clusters that run just one query or a few queries.

You can set the operator concurrency at the cluster level using the task.concurrency property. You can also specify the operator concurrency at the session level using the task_concurrency session property.

Tuning Memory

Presto features these three Memory Pools to manage the available resources:

  • General Pool
  • Reserved Pool
  • System Pool

All queries are initially submitted to the General Memory Pool. As long as the General Pool has memory, queries continue to run in it, but once it runs out of memory, the query using highest amount of memory in the General Pool is moved to the Reserved Pool and thereafter, this one query runs in the Reserved Pool while other queries continue to run in the General Pool. While the Reserved Pool is running a query, if the General Pool runs out of memory again, then the query using highest amount of memory in the General pool is moved to the Reserved pool but it will not resume its execution until the current query running in the Reserved Pool finishes.The Reserved Pool can hold multiple queries but it allows only one query to be executed at a given point in time.

The System Pool provides the memory for the operations, whose memory Presto does not track. Network buffers and IO buffers are examples for a System Pool.

This table describes the memory parameters.

Memory Type Description Parameter and default value
maxHeap It is the JVM container size. Defaults to up to 70% of Instance Memory
System Memory It is the overhead allocation. Defaults to 40% of maxHeap
Reserved Memory In case, General Memory is exhausted and if more memory is required by jobs, Reserved Memory is used by one job at a time to ensure progress until General Memory is available. query.max-memory-per-node
Total Query Memory It denotes the total tracked memory that is used by the query. It is applicable to Presto 0.208 and later versions. query.max-total-memory-per-node
General Memory It is the first stop for all jobs. maxHeap - Reserved Memory - System Memory
Query Memory It is the maximum memory for the job across the cluster. query_max_memory is the session property and query.max-memory is the cluster-level property.
Tips to Avoid Memory Issues

Presto delays jobs when there are not enough Split Slots to support the dataset. Jobs fail when there is no sufficient memory to process the query. If any of the below memory values apply to the current environment, then the configuration is not powerful enough and you can expect a job lag and a failure.

Reserved Memory * Number of Nodes < Peak Job Size As a first recommendation, increase the Reserved Memory. However, increasing the Reserved Memory can impact the concurrency as the General Pool shrinks accordingly. As a second recommendation, use a larger instance.
General Memory * Number of Nodes < Average Job Size * Concurrent Jobs As a first recommendation, increase the Reserved Memory. However, increasing the Reserved Memory can shrink the General Pool. When it is not possible to shrink the Reserved Pool, use a larger instance.
Reserved Memory * Number of Nodes < Query Memory Adjust the setting
Reserved Memory * Number of Nodes < Query Memory Limit Adjust the setting
Disabling Reserved Pool

A new experimental configuration property called experimental.reserved-pool-enabled is added to Presto version 0.208 to allow disabling Reserved Pool, which is used to prevent deadlocks when memory is exhausted in the General Pool by promoting the biggest query to Reserved Pool. However, only one query gets promoted to Reserved Pool and queries in General Pool get into the blocked state whenever it becomes full. To avoid this scenario, you can set experimental.reserved-pool-enabled to false for disabling Reserved Pool.

When Reserved Pool is disabled (experimental.reserved-pool-enabled=false), General Pool can take advantage of the memory previously that is allocated for Reserved Pool and support higher concurrency. For avoiding deadlocks when General Pool is full, enable Presto’s OOM killer by setting query.low-memory-killer.policy=total-reservation-on-blocked-nodes. When General Pool is full on a worker node, the OOM killer resolves the situation by killing the query with the highest memory usage on that node. This allows queries with reasonable memory requirements to keep making progress while a small number of high-memory-requirement queries may be killed to prevent them from utilizing cluster resources.

See also:

Troubleshooting Hive Issues

When troubleshooting a failed Hive job or Hive application, you can analyze the command logs to identify the errors and exceptions to understand the root cause of the job failure.

Hive Tuning

OutOfMemory Issues

OutOfMemory issues are sometimes caused by there being too many files in split computation. To resolve this problem, increase the Application Master (AM) memory. To increase the AM memory, set the following parameters:

set tez.am.resource.memory.mb=<Size in MB>;
set tez.am.launch.cmd-opts=-Xmx<Size in MB>;
The default value for tez.am.resource.memory.mb is 1536MB.
Block & Split Tuning

HDFS block size manages the storage of the data in the cluster and the split size drives how that data is read for processing by MapReduce. Make sure the block sizing and the Mapper maximum and minimum split size are not causing the creation of an unnecessarily large number of files.

dfs.blocksize       Sets the HDFS Block Size for storage - defaults to 128 MB
mapred.min.split.size       Sets the minimum split size - defaults to dfs.blocksize
mapred.max.split.size       Sets the maximum split size - defaults to dfs.blocksize

Configuring the split size boundaries for MapReduce may have cascading effects on the number of mappers created and the number of files each Mapper will access.

Blocks Required     Dataset Size / dfs.blocksize
Maximum Mappers Required    Dataset Size / mapred.min.split.size
Minimum Mappers Required    Dataset Size / mapred.max.split.size
Maximum Mappers per Block   Maximum Mappers Required / Blocks Required
Maximum Blocks per Mapper   Blocks Required / Minimum Mappers Required
Parallelism Tuning

The number of tasks configured for worker nodes determines the parallelism of the cluster for processing Mappers and Reducers. As the slots get used by MapReduce jobs, there may job delays due to constrained resources if the number of slots was not appropriately configured. Try to set maximums and not constants so as to put boundaries on Hive but not handcuff it to a certain number of tasks.

mapred.tasktracker.map.tasks.maximum        Maximum number of map tasks
mapred.tasktracker.reduce.tasks.maximum     Maximum number of reduce tasks
Memory Tuning

If analysis of the tasks reveals that the memory utilization is low, consider modifying the memory allocation for the Hadoop cluster. Reducing the allocated memory for the tasks will free up space on the cluster and allow for an increased in the number of Mappers or Reducers.

mapred.map.child.java.opts  Java heap memory setting for the map tasks
mapred.reduce.child.java.opts       Java heap memory setting for the reduce tasks

Analyzing Hive Job Failures

Note

If you face any intermittent lock or dead lock issues during the custom Hive metastore migration, see Intermittent Lock and Deadlock Issues in Hive Metastore Migration to resolve them.

When there is an issue with a Hive job, you can first start by analyzing the job’s logs and results.

To analyze a job, navigate to the Workbench in the QDS UI and perform the following steps:

  1. Each Qubole job or command has a unique ID. Your can search a job using the command_id as depicted below.
_images/01_anal_hive_fail-cmd_id.png
  1. Any Hive or shell command job contains the logs in the bottom-right section of the UI. Generally, these logs show the number of MapReduce jobs it is going to start and each MapReduce job has its own link Application UI that opens a new browser tab and displays the job details.
_images/02a_anal_hive_fail-logs01.png _images/02a_anal_hive_fail-logs02.png
  1. The Application UI page shows important details as mentioned below:
  • Job Status (Succeeded/Failed/Killed)
  • Total Mapper/Reducer tasks
  • Failed/Killed tasks
  • Counter Link: this table shows very useful parameters, such as bytes_read and file_bytes_written. These counters are very useful for understanding the nature of a job. For example, the counters provide details about how much data is being read, how much data is being written on HDFS or cloud object storage, and so on.
Intermittent Lock and Deadlock Issues in Hive Metastore Migration

To prevent intermittent lock and deadlock issues that occur during the migration from Qubole-managed Hive metastore to custom-managed Hivmetastore, Qubole recommends you to set the SQL transaction isolation level to READ COMMITTED.

The above configuration change may require restarting the RDS instance that hosts the metastore database.

For more information, see Migrating Data from Qubole Hive Metastore to a Custom Hive Metastore and Connecting to a Custom Hive Metastore.

Troubleshooting Errors and Exceptions in Hive Jobs

This topic provides information about the errors and exceptions that you might encounter when running Hive jobs or applications. You can resolve these errors and exceptions by following the respective workarounds.

Container memory requirement exceeds physical memory limits
Problem Description

A Hive job fails, and the error message below appears in the Qubole UI under the Logs tab of the Workbench page, or in the Mapper logs, Reducer logs, or ApplicationMaster logs:

Container [pid=18196,containerID=container_1526931816701_34273_02_000003] is running beyond physical memory limits.
Current usage: 2.2 GB of 2.2 GB physical memory used; 3.2 GB of 4.6 GB virtual memory used. Killing container.
Diagnosis

Three different kinds of failure can result in this error message:

  • Mapper failure
    • This error can occur because the Mapper is requesting more memory than the configured memory. The parameter mapreduce.map.memory.mb represents Mapper memory.
  • Reducer failure
    • This error can occur because the Reducer is requesting more memory than the configured memory. The parameter mapreduce.reduce.memory.mb represents Reducer memory.
  • ApplicationMaster failure
    • This error can occur when the container hosting the ApplicationMaster is requesting more than the assigned memory. The parameter yarn.app.mapreduce.am.resource.mb represents the memory allocated.
Solution

Mapper failure: Modify the two parameters below to increase the memory for Mapper tasks if a Mapper fails with the above error.

  • mapreduce.map.memory.mb: The upper memory limit that Hadoop allows to be allocated to a Mapper, in megabytes.
  • mapreduce.map.java.opts: Sets the heap size for a Mapper.

Reducer failure: Modify the two parameters below to increase the memory for Reducer tasks if a Reducer fails with the above error.

  • mapreduce.reduce.memory.mb: The upper memory limit that Hadoop allows to be allocated to a Reducer, in megabytes.
  • mapreduce.reduce.java.opts: Sets the heap size for a Reducer.

ApplicationMaster failure: Modify the two parameters below to increase the memory for the ApplicationMaster if the ApplicationMaster fails with the above error.

  • yarn.app.mapreduce.am.resource.mb: The amount of memory the ApplicationMaster needs, in megabytes.
  • yarn.app.mapreduce.am.command-opts: To set the heap size for the ApplicationMaster.

Make sure that yarn.app.mapreduce.am.command-opts is less than yarn.app.mapreduce.am.resource.mb. Qubole recommends that the value of yarn.app.mapreduce.am.command-opts be around 80% of yarn.app.mapreduce.am.resource.mb.

Example: Use the set command to update the configuration property at the query level.

set yarn.app.mapreduce.am.resource.mb=3500;

set yarn.app.mapreduce.am.command-opts=-Xmx2000m;

To update configs at the cluster level:

  • Add or update the parameters under Override Hadoop Configuration Variables in the Advanced Configuration tab in Cluster Settings and restart the cluster.
  • See also: MapReduce Configuration in Hadoop 2
GC overhead limit exceeded, causing out of memory error
Problem Description

A Hive job fails with an out-of-memory error “GC overhead limit exceeded,” as shown below.

java.io.IOException: org.apache.hadoop.ipc.RemoteException(java.lang.OutOfMemoryError): GC overhead limit exceeded
at org.apache.hadoop.mapred.ClientServiceDelegate.invoke(ClientServiceDelegate.java:337)
at org.apache.hadoop.mapred.ClientServiceDelegate.getJobStatus(ClientServiceDelegate.java:422)
at org.apache.hadoop.mapred.YARNRunner.getJobStatus(YARNRunner.java:579)
at org.apache.hadoop.mapreduce.Job$1.run(Job.java:348)
at org.apache.hadoop.mapreduce.Job$1.run(Job.java:345)
at java.security.AccessController.doPrivileged(Native Method)
at javax.security.auth.Subject.doAs(Subject.java:415)
Diagnosis

This out-of-memory error is coming from the getJobStatus method call. This is likely to be an issue with the JobHistory server running out of memory. This can be confirmed by checking the JobHistory server log on the coordinator node in media/ephemeral0/logs/mapred. The JobHistory server log will show the out of memory exception stack trace as above.

The out of memory error for the JobHistory server usually happens in the following cases:

  1. The cluster coordinator node is too small and the JobHistory server is set to, for example, a heap size of 1 GB.
  2. The jobs are very large, with thousands of mapper tasks running.
Solution
  • Qubole recommends that you use a larger cluster coordinator node, with at least 60 GB RAM and a heap size of 4 GB for the JobHistory server process.
  • Depending on the nature of the job, even 4 GB for the JobHistory server heap size might not be sufficient. In this case, set the JobHistory server memory to a higher value, such as 8 GB, using the following bootstrap commands:
sudo echo 'export HADOOP_JOB_HISTORYSERVER_HEAPSIZE="8192"' >> /etc/hadoop/mapred-env.sh
sudo -u mapred /usr/lib/hadoop2/sbin/mr-jobhistory-daemon.sh stop historyserver
sudo -u mapred /usr/lib/hadoop2/sbin/mr-jobhistory-daemon.sh start historyserver
Mapper or reducer job fails because no valid local directory is found
Problem Description

Mapper or reducer job fails with the following error:

Caused by: org.apache.hadoop.util.DiskChecker$DiskErrorException: Could not find any valid directory for <file_path>

at org.apache.hadoop.fs.LocalDirAllocator$AllocatorPerContext$DirSelector.getPathForWrite(LocalDirAllocator.java:541)

at org.apache.hadoop.fs.LocalDirAllocator$AllocatorPerContext.getLocalPathForWrite(LocalDirAllocator.java:627)

at org.apache.hadoop.fs.LocalDirAllocator.getLocalPathForWrite(LocalDirAllocator.java:173)

at org.apache.hadoop.fs.LocalDirAllocator.getLocalPathForWrite(LocalDirAllocator.java:154)

at org.apache.tez.runtime.library.common.task.local.output.TezTaskOutputFiles.getInputFileForWrite(TezTaskOutputFiles.java:250)

at org.apache.tez.runtime.library.common.shuffle.orderedgrouped.MapOutput.createDiskMapOutput(MapOutput.java:100)

at org.apache.tez.runtime.library.common.shuffle.orderedgrouped.MergeManager.reserve(MergeManager.java:404)

at org.apache.tez.runtime.library.common.shuffle.orderedgrouped.FetcherOrderedGrouped.copyMapOutput(FetcherOrderedGrouped.java:476)

at org.apache.tez.runtime.library.common.shuffle.orderedgrouped.FetcherOrderedGrouped.copyFromHost(FetcherOrderedGrouped.java:278)

at org.apache.tez.runtime.library.common.shuffle.orderedgrouped.FetcherOrderedGrouped.fetchNext(FetcherOrderedGrouped.java:178)

at org.apache.tez.runtime.library.common.shuffle.orderedgrouped.FetcherOrderedGrouped.callInternal(FetcherOrderedGrouped.java:191)

at org.apache.tez.runtime.library.common.shuffle.orderedgrouped.FetcherOrderedGrouped.callInternal(FetcherOrderedGrouped.java:54)
Diagnosis

This error can appear on the Worbench page of the QDS UI or in the Hadoop Mapper or Reducer logs.

MapReduce stores intermediate data in local directories specified by the parameter mapreduce.cluster.local.dir in the mapred-site.xml file. During job processing, MapReduce checks these directories to see if there is enough space to create the intermediate files. If there is no directory that has enough space, the MapReduce job will fail with the error shown above.

Solution
  1. Make sure that there is enough space in the local directories, based on the requirements of the data to be processed.
  2. You can compress the intermediate output files to minimize space consumption.

Parameters to be set for compression:

set mapreduce.map.output.compress = true;
set mapreduce.map.output.compress.codec = org.apache.hadoop.io.compress.SnappyCodec; -- Snappy will be used for compression
Out of Memory error when using ORC file format
Problem Description

An Out of Memory error occurs while generating splits information when the ORC file format is used.

Diagnosis

The following logs appear on the Workbench page of the QDS UI under the Logs tab:

Caused by: java.util.concurrent.ExecutionException: java.lang.OutOfMemoryError: GC overhead limit exceeded
at java.util.concurrent.FutureTask.report(FutureTask.java:122)
at java.util.concurrent.FutureTask.get(FutureTask.java:192)
at org.apache.hadoop.hive.ql.io.orc.OrcInputFormat.generateSplitsInfo(OrcInputFormat.java:1098)
... 15 more
Caused by: java.lang.OutOfMemoryError: GC overhead limit exceeded
at com.google.protobuf.ByteString.copyFrom(ByteString.java:192)
at com.google.protobuf.CodedInputStream.readBytes(CodedInputStream.java:324)
at org.apache.hadoop.hive.ql.io.orc.OrcProto$StringStatistics.(OrcProto.java:1331)
at org.apache.hadoop.hive.ql.io.orc.OrcProto$StringStatistics.(OrcProto.java:1281)
at org.apache.hadoop.hive.ql.io.orc.OrcProto$StringStatistics$1.parsePartialFrom(OrcProto.java:1374)
Solution

The Out of Memory error could be because of using the default split strategy (HYBRID), which requires more memory. Qubole recommends using the ORC split strategy as BI by setting the parameter below:

hive.exec.orc.split.strategy=BI
Hive job fails when “lock wait timeout” is exceeded
Problem Description

A Hive job fails with the following error message:

Lock wait timeout exceeded; try restarting transaction.

The timeout happens while partioning Insert operations.

Diagnosis

The following content will appear in the hive.log file:

ERROR metastore.RetryingHMSHandler (RetryingHMSHandler.java:invoke(173)) - Retrying HMSHandler after 2000 ms (attempt 9 of 10) with error:
javax.jdo.JDODataStoreException: Insert of object “org.apache.hadoop.hive.metastore.model.MPartition@74adce4e” using statement
“INSERT INTO `PARTITIONS` (`PART_ID`,`TBL_ID`,`LAST_ACCESS_TIME`,`CREATE_TIME`,`PART_NAME`,`SD_ID`) VALUES (?,?,?,?,?,?)” failed :
Lock wait timeout exceeded; try restarting transaction

This MySQL transaction timeout can happen during heavy traffic on the Hive Metastore when the RDS server is too busy.

Solution

Try setting a higher value for innodb_lock_wait_timeout on the MySQL side. innodb_lock_wait_timeout defines the length of time in seconds an InnoDB transaction waits for a row lock before giving up. The default value is 50 seconds.

Troubleshooting Spark Issues

When any Spark job or application fails, you should identify the errors and exceptions that cause the failure. You can access the Spark logs to identify errors and exceptions.

This topic provides information about the errors and exceptions that you might encounter when running Spark jobs or applications. You can resolve these errors and exceptions by following the respective workarounds.

You might want to use the Sparklens experimental open service tool that is available on http://sparklens.qubole.net to identify the potential opportunities for optimizations with respect to driver side computations, lack of parallelism, skew, etc. For more information about Sparklens, see the Sparklens blog.

Out of Memory Exceptions

Spark jobs might fail due to out of memory exceptions at the driver or executor end. When troubleshooting the out of memory exceptions, you should understand how much memory and cores the application requires, and these are the essential parameters for optimizing the Spark appication. Based on the resource requirements, you can modify the Spark application parameters to resolve the out-of-memory exceptions.

For more information about resource allocation, Spark application parameters, and determining resource requirements, see An Introduction to Apache Spark Optimization in Qubole.

Driver Memory Exceptions
Exception due to Spark driver running out of memory
  • Description: When the Spark driver runs out of memory, exceptions similar to the following exception occur.

    Exception in thread "broadcast-exchange-0" java.lang.OutOfMemoryError: Not enough memory to build and broadcast the table
    to all worker nodes. As a workaround, you can either disable broadcast by setting spark.sql.autoBroadcastJoinThreshold to -1
    or increase the spark driver memory by setting spark.driver.memory to a higher value
    
  • Resolution: Set a higher value for the driver memory, using one of the following commands in Spark Submit Command Line Options on the Workbench page:

    • --conf spark.driver.memory= <XX>g

      OR

    • --driver-memory <XX>G

Job failure because the Application Master that launches the driver exceeds memory limits
  • Description: A Spark job may fail when the Application Master (AM) that launches the driver exceeds the memory limit and is eventually terminated by YARN. The following error occurs:

    Diagnostics: Container [pid=<XXXXX>,containerID=container_<XXXXXXXXXX>_<XXXX>_<XX>_<XXXXXX>] is running beyond physical memory limits.
    Current usage: <XX> GB of <XX> GB physical memory used; <XX> GB of <XX> GB virtual memory used. Killing container
    
  • Resolution: Set a higher value for the driver memory, using one of the following commands in Spark Submit Command Line Options on the Workbench page:

    • --conf spark.driver.memory= <XX>g

      OR

    • --driver-memory <XX>G

    As a result, a higher value is set for the AM memory limit.

Executor Memory Exceptions
Exception because executor runs out of memory
  • Description: When the executor runs out of memory, the following exception might occur.

    Executor task launch worker for task XXXXXX ERROR Executor: Exception in task XX.X in stage X.X (TID XXXXXX)
    java.lang.OutOfMemoryError: GC overhead limit exceeded
    
  • Resolution: Set a higher value for the executor memory, using one of the following commands in Spark Submit Command Line Options on the Workbench page:

    • --conf spark.executor.memory= <XX>g

      OR

    • --executor-memory <XX>G

FetchFailedException due to executor running out of memory
  • Description: When the executor runs out of memory, the following exception may occur.

    ShuffleMapStage XX (sql at SqlWrapper.scala:XX) failed in X.XXX s due to org.apache.spark.shuffle.FetchFailedException:
    failed to allocate XXXXX byte(s) of direct memory (used: XXXXX, max: XXXXX)
    
  • Resolution: From the Workbench page, perform the following steps in Spark Submit Command Line Options:

    1. Set a higher value for the executor memory, using one of the following commands:

      • --conf spark.executor.memory= <XX>g

        OR

      • --executor-memory <XX>G

    2. Increase the number of shuffle partitions, using the following command: --spark.sql.shuffle.partitions

Executor container killed by YARN for exceeding memory limits
  • Description: When the container hosting the executor needs more memory for overhead tasks or executor tasks, the following error occurs.

    org.apache.spark.SparkException: Job aborted due to stage failure: Task X in stage X.X failed X times,
    most recent failure: Lost task X.X in stage X.X (TID XX, XX.XX.X.XXX, executor X): ExecutorLostFailure
    (executor X exited caused by one of the running tasks) Reason: Container killed by YARN for exceeding memory limits. XX.X GB
    of XX.X GB physical memory used. Consider boosting spark.yarn.executor.memoryOverhead
    
  • Resolution: Set a higher value for spark.yarn.executor.memoryOverhead based on the requirements of the job. The executor memory overhead value increases with the executor size (approximately by 6-10%). As a best practice, modify the executor memory value accordingly.

    To set a higher value for executor memory overhead, enter the following command in Spark Submit Command Line Options on the Workbench page: --conf spark.yarn.executor.memoryOverhead=XXXX

    Note

    For Spark 2.3 and later versions, use the new parameter spark.executor.memoryOverhead instead of spark.yarn.executor.memoryOverhead.

    If increasing the executor memory overhead value or executor memory value does not resolve the issue, you can either use a larger instance, or reduce the number of cores.

    To reduce the njmber of cores, enter the following in the Spark Submit Command Line Options on the Workbench page: --executor-cores=XX. Reducing the number of cores can waste memory, but the job will run.

Spark job repeatedly fails

  • Description: When the cluster is fully scaled and the cluster is not able to manage the job size, the Spark job may fail repeatedly.
  • Resolution: Run the Sparklens tool to analyze the job execution and optimize the configuration accordingly.
For more information about Sparklens, see the Sparklens blog.

Spark Shell Command failure

  • Description: When a spark application is submitted through a shell command in QDS, it may fail with the following error.

    Qubole > Shell Command failed, exit code unknown
    2018-08-02 12:43:18,031 WARNING shellcli.py:265 - run - Application failed or got killed
    

    In this case, the actual reason that kills the application is hidden and you might not able to find the reason in the logs directly.

  • Resolution:

    1. Navigate to the Workbench page
    2. Click on the Resources tab to analyze the errors and perform the appropriate action.
    3. Run the job again using the Spark Submit Command Line Options on the Workbench page.

Error when the total size of results is greater than the Spark Driver Max Result Size value

  • Description: When the total size of results is greater than the Spark Driver Max Result Size value, the following error occurs.

    org.apache.spark.SparkException: Job aborted due to stage failure: Total size of serialized results of x tasks (y MB) is bigger
    than spark.driver.maxResultSize (z MB)
    
  • Resolution: Increase the Spark Drive Max Result Size value by modifying the value of --conf spark.driver.maxResultSize

    in the Spark Submit Command Line Options on the Workbench page.

Too Large Frame error

  • Description: When the size of the shuffle data blocks exceeds the limit of 2 GB, which spark can handle, the following error occurs.

    org.apache.spark.shuffle.FetchFailedException: Too large frame: XXXXXXXXXX
    at org.apache.spark.storage.ShuffleBlockFetcherIterator.throwFetchFailedException(ShuffleBlockFetcherIterator.scala:513)
    at org.apache.spark.storage.ShuffleBlockFetcherIterator.next(ShuffleBlockFetcherIterator.scala:444)
    
    
    
    Caused by: java.lang.IllegalArgumentException: Too large frame: XXXXXXXXXX
    at org.spark_project.guava.base.Preconditions.checkArgument(Preconditions.java:119)
    at org.apache.spark.network.util.TransportFrameDecoder.decodeNext(TransportFrameDecoder.java:133)
    
  • Resolution: Perform one of the following steps to resolve this error:

    Solution 1:

    1. Run the job on Spark 2.2 or higher version because Spark 2.2 or higher handles this issue in a better way when compared to other lower versions of Spark. For information, see https://issues.apache.org/jira/browse/SPARK-19659.

    1. Use the following Spark configuration:
      1. Modify the value of spark.sql.shuffle.partitions from the default 200 to a value greater than 2001.
      2. Set the value of spark.default.parallelism to the same value as spark.sql.shuffle.partitions.

    Solution 2:

    1. Identify the DataFrame that is causing the issue.
      1. Add a Spark action(for instance, df.count()) after creating a new DataFrame.
      2. Print anything to check the DataFrame.
      3. If the print statement is not executed for a DataFrame, then the issue is with that DataFrame.
    2. After the DataFrame is identified, repartition the DataFrame by using df.repartition() and then cache it by using df.cache().
    3. If there is skewness in the data and you are using Spark version earlier than 2.2, then modify the code.

Spark jobs fail because of compilation failures

When you run a Spark program or application from the Workbench page, the code is compiled and then submitted for execution.

If there are syntax errors, or the JARs or classes are missing, jobs may fail during compilation or at runtime.

  • Description: If there are any errors in the syntax, the job may fail even before the job is submitted.

    The following figure shows a syntax error in a Spark program written in Scala.

    _images/ScalaSyntaxError.png
  • Resolution: Check the code for any syntax errors and rectify the syntax. Rerun the program.

    The following figure shows a Spark job that ran successfully and displayed results.

    _images/ScalaPositiveResult.png
  • Description: class/JAR-not-found errors occur when you run a Spark program that uses functionality in a JAR that is not available in the Spark program’s classpath; the error occurs either during compilation, or, if the program is compiled locally and then submitted for execution, at runtime.

    The following figure shows an example of a class-not-found error.

    _images/SparkClassNotFoundError.png
  • Resolution: Add the dependent classes and jars and rerun the program. See Specifying Dependent Jars for Spark Jobs.

Troubleshooting Notebook Issues

When troubleshooting a failed paragraph in the notebook, you can analyze the Zeppelin logs and Spark logs to identify the errors and exceptions to understand the root cause of the failure.

Accessing the Logs

You can access the Zeppelin logs that reside on the cluster coordinator node at /media/ephemeral0/logs/zeppelin/logs. When analyzing the Zeppelin start up logs, you should also check the <zeppelin-root-ip>.out file.

You can access the interpreter logs and the logs for the Spark jobs that run on the Notebooks page.

Accessing the Interpreter Logs
  1. From the QDS UI, navigate to the Notebooks page.

  2. Click on Interpreters.

  3. From the Interpreters page, click Logs on the top right corner.

    The following figure shows the Interpreters page.

    _images/int-logs.png
  4. The logs files are displayed in a separate tab. Click on the files for the details.

    The following figure shows the log files that are displayed.

    _images/int-logs-details.png

For information about accessing the Spark Application UI, see Accessing the Spark Application UI from the Notebooks page.

Troubleshooting Errors and Exceptions in Notebook Paragraphs

This topic provides information about the errors and exceptions that you might encounter when running notebook paragraphs. You can resolve these errors and exceptions by following the respective workarounds.

Notebook fails to load
  • Description: Notebook fails to get loaded on the UI and the following error might occur.

    502 Bad Gateway
    

    This error occurs mainly when the Zeppelin server is not running or the underneath daemon is getting killed.

  • Resolution:

    1. Check Zeppelin server logs at /media/ephemeral0/logs/zeppelin/logs/<zeppelin_server.log> and /media/ephemeral0/logs/zeppelin/logs/<zeppelin_server_log.out> files.

    2. The logs might contain the Zeppelin running Out Of Memory (OOM) error as shown below.

      at org.eclipse.jetty.server.Server.doStart(Server.java:354)
      at org.eclipse.jetty.util.component.AbstractLifeCycle.start(AbstractLifeCycle.java:68)
      at org.apache.zeppelin.server.ZeppelinServer.main(ZeppelinServer.java:204)
      Caused by: java.lang.OutOfMemoryError: Java heap space
      at java.util.Arrays.copyOf(Arrays.java:2367)
      at java.lang.AbstractStringBuilder.expandCapacity(AbstractStringBuilder.java:130)
      at java.lang.AbstractStringBuilder.ensureCapacityInternal(AbstractStringBuilder.java:114)
      

      The default heap space might not be sufficient for Zeppelin to load all the notebooks. Default heap space of Zeppelin server is configured to be 10% of the coordinator node memory. Ideally, the coordinator node should be configured with higher memory if number of notebooks is large or delete notebooks that are not in use.

    3. Increase heap memory using node bootstrap. Contact Qubole Support.

Paragraph stops responding
  • Description: While using a notebook, paragraphs might stop responding due to various reasons.

  • Resolution:

    1. Click the Cancel button.

    2. If canceling the paragraph fails, then navigate to the Interpreters page and restart the corresponding interpreter.

    3. If the issue still persists, restart the Zeppelin server by running the following command as a root user:

      /usr/lib/zeppelin/bin/zeppelin-daemon.sh restart
      
  • Description: Paragraphs might stop responding when the spark job is sluggish or when the spark job fails.

  • Resolution:

    1. In the Notebooks page, navigate to Interpreters and click Logs.
    2. Open the corresponding Interpreter logs.
    3. Analyze the log files for the container, executor or task.
    4. Check connectivity to the thrift server.
Paragraph keeps running for a long time
  • Description: Due to less resources, paragraphs might run for a long time.

  • Resolution: Tune the job by providing more resources like minimum number of executors, executor memory, executor memory overhead, and max executors.

    1. Set an appropriate high value for minimum number of executors, executor memory, executor memory overhead, and max executors in the Interpreter settings.
    2. Restart the interpreter.
Error due to insufficient Spark driver memory
  • Description: For Qubole notebook, if the configured spark driver memory is not sufficient to run the job the following error occurs.

    Interpreter JVM has stopped responding. Restart interpreter with higher driver memory controlled by setting spark.driver.memory.
    
  • Resolution:

    1. Set an appropriate high value for driver memory value by configuring spark.driver.memory in the Interpreter settings.
    2. Restart the interpreter.
Paragraphs Failed

Paragraphs might fail for various reasons. You should identify if the paragraph failed due to the interpreter or the job.

  1. Analyze the interpreter logs to check if there is an issue with the interpreter.
  2. If there are no failures at the interpreter, then check the logs in the Spark UI. Analyze the executor container logs or failed executor logs. See Accessing the Spark Logs.
  3. If the failures are in the Spark job, see Troubleshooting Spark Issues.

If the issue still persists, then contact Qubole Support.

TTransportException
  • Description: While running paragraphs, the TTransportException exception as shown below might occur due to various unexpected reasons. This exception signifies that there is some error in the communication between Zeppelin and driver/spark-applications.

    org.apache.thrift.transport.TTransportException
    at org.apache.thrift.transport.TIOStreamTransport.read(TIOStreamTransport.java:132)
    at org.apache.thrift.transport.TTransport.readAll(TTransport.java:86)
    at org.apache.thrift.protocol.TBinaryProtocol.readAll(TBinaryProtocol.java:429)
    at org.apache.thrift.protocol.TBinaryProtocol.readI32(TBinaryProtocol.java:318)
    at org.apache.thrift.protocol.TBinaryProtocol.readMessageBegin(TBinaryProtocol.java:219)
    at org.apache.thrift.TServiceClient.receiveBase(TServiceClient.java:69)
    at org.apache.zeppelin.interpreter.thrift.RemoteInterpreterService$Client.recv_interpret(RemoteInterpreterService.java:249)
    
  • Resolution: Depending on the error in the communication between Zeppelin and driver/spark-applications, perform the appropriate actions:

    • Metastore connectivity failure

      1. Check the Interpreter logs. In the Notebooks page, navigate to Interpreters and click Logs.

      2. If Zeppelin is not able to connect to metastore, then the logs might contain one of the following errors.

        Error 1:
        
        MetaStoreClient lost connection. Attempting to reconnect.
        org.apache.thrift.transport.TTransportException: java.net.SocketException: Connection reset
        
        OR
        
        Error 2:
        
        Got exception: org.apache.thrift.transport.TTransportException java.net.SocketTimeoutException: Read timed out
        org.apache.thrift.transport.TTransportException: java.net.SocketTimeoutException: Read timed out
        
        
        at org.apache.thrift.transport.TIOStreamTransport.read(TIOStreamTransport.java:129)
        at org.apache.thrift.transport.TTransport.readAll(TTransport.java:86)
        at org.apache.thrift.protocol.TBinaryProtocol.readAll(TBinaryProtocol.java:429)
        at org.apache.thrift.protocol.TBinaryProtocol.readI32(TBinaryProtocol.java:318)
        at org.apache.thrift.protocol.TBinaryProtocol.readMessageBegin(TBinaryProtocol.java:219)
        at org.apache.thrift.TServiceClient.receiveBase(TServiceClient.java:77)
        at org.apache.hadoop.hive.metastore.api.ThriftHiveMetastore$Client
        
      3. Verify the metastore connectivity and rerun the job.

    • Interpreter not initiated

      The interpreter might have not been initiated due to large or insufficient driver memory. Set an appropriate high value of driver memory by configuring spark.driver.memory in the Interpreter settings. Rerun the job.

    • Restart the interpreter and rerun the job.

Nullpointer Exception in a Spark Notebook
  • Description: In a Spark notebook, sometimes you cannot create a Spark Session and the following Nullpointer exception occurs.

    java.lang.NullPointerException
    at org.apache.zeppelin.spark.Utils.invokeMethod(Utils.java:39)
    at org.apache.zeppelin.spark.Utils.invokeMethod(Utils.java:34)
    at org.apache.zeppelin.spark.SparkInterpreter.createSparkContext_2(SparkInterpreter.java:467)
    at org.apache.zeppelin.spark.SparkInterpreter.createSparkContext(SparkInterpreter.java:456)
    at org.apache.zeppelin.spark.SparkInterpreter.getSparkContext(SparkInterpreter.java:156)
    at org.apache.zeppelin.spark.SparkInterpreter.open(SparkInterpreter.java:938)
    at org.apache.zeppelin.interpreter.LazyOpenInterpreter.open(LazyOpenInterpreter.java:70)
    at org.apache.zeppelin.spark.PySparkInterpreter.getSparkInterpreter(PySparkInterpreter.java:531)
    at org.apache.zeppelin.spark.PySparkInterpreter.createGatewayServerAndStartScript(PySparkInterpreter.java:201)
    at org.apache.zeppelin.spark.PySparkInterpreter.open(PySparkInterpreter.java:170)
    at org.apache.zeppelin.interpreter.LazyOpenInterpreter.open(LazyOpenInterpreter.java:70)
    at org.apache.zeppelin.interpreter.LazyOpenInterpreter.interpret(LazyOpenInterpreter.java:95)
    at org.apache.zeppelin.interpreter.remote.RemoteInterpreterServer$InterpretJob.jobRun(RemoteInterpreterServer.java:344)
    at org.apache.zeppelin.scheduler.Job.run(Job.java:185)
    at org.apache.zeppelin.scheduler.FIFOScheduler$1.run(FIFOScheduler.java:139)
    at java.util.concurrent.Executors$RunnableAdapter.call(Executors.java:471)
    at java.util.concurrent.FutureTask.run(FutureTask.java:262)
    at java.util.concurrent.ScheduledThreadPoolExecutor$ScheduledFutureTask.access$201(ScheduledThreadPoolExecutor.java:178)
    ...
    
  • Resolution:

    1. Check if there are any artifacts (dependencies in Interpreter Settings). Remove artifacts (if any) and restart the interpreter.
    2. If the problem persists even after removing artifacts, trace the error in Intepreter logs as described here:
      1. In the Notebooks page, navigate to Interpreters and click Logs.
      2. Open the corresponding Interpreter logs.
      3. Trace errors in that log file(s).

    If you are still unable to trace the error, create a ticket with Qubole Support.

Troubleshooting Jupyter Notebook Issues

When troubleshooting failures and issues in Jupyter notebooks, you can access the logs on the JupyterLab interface to identify the errors and exceptions, and understand the root cause of the failures.

Accessing the Logs for Jupyter Notebooks

You can access the Spark Driver logs and Kernel logs on the JupyterLab interface when troubleshooting any issues or failures. These logs are applicable for Spark based Kernels only.

You can perform one of the following steps to access the Logs:

  • In a Jupyter notebook, click on the down arrow next to the kernel on the right corner. You can click on the Spark UI, Driver Logs, or Kernel Log from the widget.

    The following image shows the UI options to access the logs.

    _images/jupy-logs.png

    The Spark UI, Driver Log, and Kerner Log open in a separate tab.

  • From the output table of the Job run, click on the corresponding links for Spark UI and Driver Log.

    The Spark UI and Driver Log open in a separate tab.

The following image shows a sample Spark UI page.

_images/jupy-spark-ui.png

The following image shows a sample Driver Log page.

_images/jupy-driver-logs.png

The following image shows a sample Kernel Log page.

_images/jupy-kernel-logs.png

Alternatively, you can navigate to Spark >> Resource Manager from the Menu bar. The Resource Manager page opens in a separate tab. From the Resource Manager page, browse through the application logs.

The following image shows a sample Resource Manager page.

_images/jupy-rm.png

Troubleshooting Errors and Exceptions in Jupyter Notebooks

This topic provides information about the errors and exceptions that you might encounter when running Jupyter notebooks. You can resolve these errors and exceptions by following the respective workarounds.

Execution stops at a cell in a scheduled Jupyter notebook
  • Description: If there is a warning in one of the cells when a scheduled Jupyter notebook runs, the execution stops at that cell.
  • Resolution: As a workaround, to skip the warning and continue execution, add raises-exception in that cell’s metadata field by performing the following steps:
    1. Select the cell that shows the warning.
    2. Click on the Tools icon on the left side bar.
    3. Click Advanced Tools.
    4. Add raises-exception in the Cell Metadata tags field.
    5. Re-run the API.
Import of Zeppelin Notebooks might fail
  • Description: When you import a Zeppelin notebook of size greater than 1 MB, the import operation might fail.
  • Resolution: As a workaround, perform the following steps:
    1. Clear the output of the Zeppelin notebook to reduce the size of the notebook.
    2. Import the Zeppelin notebook in JupyterLab interface.

Providing Feedback

The top-right corner of the QDS user interface (UI) contains a feedback icon Icon.

Clicking the Feedback icon displays the QDS UI feedback dialog as shown below.

_images/DialogueBox.png

You can click the Feedback button from any screen. Enter your feedback in the textbox provided and click Send Feedback.

You can also directly access the Community posts or submit a support ticket using the Ask the Community or Submit Support Ticket icons below the textbox.

Important

We would like to publicly and unequivocally acknowledge that a few words and phrases in terminology used in our industry and subsequently adopted by Qubole over the last decade is insensitive, non-inclusive, and harmful. We are committed to inclusivity and correcting these terms and the negative impressions that they have facilitated. Qubole is actively replacing the following terms in our documentation and education materials:

  • Master becomes Coordinator
  • Slave becomes Worker
  • Whitelist becomes Allow List or Allowed
  • Blacklist becomes Deny List or Denied

These terms have been pervasive for too long in this industry which is wrong and we will move as fast as we can to make necessary corrections, while we need your patience we will not take it for granted. Please do not hesitate to reach out if you feel there are other areas for improvement, if you feel we are not doing enough or moving fast enough or if you want to discuss anything further in this area.

Alex Aidun (alex@qubole.com)
Director, Education Services
Director, Technical Publications