Qubole Operator API
This page describes the Qubole Operator API. For more information on the Qubole Operator, see Introduction to Airflow in Qubole,
Qubole Operator Examples, and Questions about Airflow.
class airflow.contrib.operators.QuboleOperator(qubole_conn_id='qubole_default', *args, **kwargs)
Execute tasks (commands) on QDS.
Parameters Description
Parameter |
Description |
qubole_conn_id |
The connection ID which consists of QDS auth_token. |
kwargs
Parameter |
Description |
command_type |
Type of command that is to be executed. For example, a Hive, Shell, or Hadoop command. |
tags |
An array of tags that you can assign to the command. |
cluster_label |
The label of a cluster on which the command is executed. |
name |
A name that you can provide to the command. This is a template-supported field. |
notify |
Set this to receive an email when the command completes. You are notified on a successful or failed command. |
Understanding Command-type-specific Parameters
Here are the different command-type-specific parameters:
Note
You can also use .txt
files for template-driven use cases.
hivecmd Parameters
Parameter |
Description |
query |
An inline query statement. This is a template-supported field. Either a query or a
script_location is required. |
script_location |
For GCP, the GS location that contains the query statement. This is a template-supported field. Either a
query or a script_location is required. |
sample_size |
Sample size in bytes on which to run a query. |
macros |
Macro values that are used in the query. This is a template-supported field. |
hadoopcmd Parameters
Parameter |
Description |
sub_command |
Must be jar , gsdistcp , or streaming followed by 1 or more arguments.
This is a template-supported field. (gsdistcp is valid for all platforms.) |
shellcmd Parameters
Parameter |
Description |
script |
An inline command with arguments. This is a template-supported field. Either a script or a
script_location is required. |
script_location |
For GCP, the GS location that contains the query statement. This is a template-supported field. Either a
script or a script_location is required. |
files |
A list of files in an GCP GS bucket in the file1 and file2 format. These files are copied
into the working directory where the Qubole command is being executed.
It is a template-supported field. |
archives |
A list of archives in an GCP GS bucket in the archive1 and archive2 format. These are
unarchived into the working directory where the Qubole command is being executed.
This is a template-supported field. |
parameters |
Any additional arguments which must be passed to the script (only when script_location is added).
This is a template-supported field. |
sparkcmd Parameters
Parameter |
Description |
program |
The complete Spark program in Scala, SQL, Command, R, or Python. This is a template-supported field. A
Spark notebook can be run using the QuboleOperator. For more information, see Qubole Operator Examples. |
cmdline |
The Spark-submit command line; specify the required information on this command line.
This is a template-supported field. |
sql |
An inline SQL query statement. This is a template-supported field. |
script_location |
The local file path that contains the query statement. This is a template-supported field. One of the
following values must be specified: script_location , program , cmdline , sql , or
note_id . |
language |
The program languages scala , sql , command_line , R , python , and notebook are
supported. Specify the language that you want to use. |
app_id |
The ID of an Spark job server app. |
note_id |
The ID of a notebook. |
arguments |
These are Spark-submit command line arguments. |
user_program_arguments |
These are arguments that the user program accepts. |
macros |
Macro values that are used in the query. It is a template-supported field. |
dbtapquerycmd Parameters
Parameter |
Description |
db_tap_id |
The data store ID of the target database in Qubole. This is a template-supported field. Its value
is a string and not an integer. |
query |
An inline query statement. This is a template-supported field. |
macros |
Macro values that are used in the query. This is a template-supported field. |
dbexportcmd Mode 1/Simple Mode Parameters
Parameter |
Description |
mode |
The value must be 1 for the simple mode to push data from QDS to a relational database. |
hive_table |
The name of the Hive table. This is a template-supported field. |
partition_spec |
The partition specification for the Hive table. This is a template-supported field. |
dbtap_id |
The data store ID of the target database in Qubole. This is a template-supported field. Its value
is a string and not an integer. |
db_table |
The name of the DB table. This is a template-supported field. |
db_update_mode |
The two different types of update modes, allowinsert or updateonly . |
db_update_keys |
Columns used to determine the uniqueness of rows and it is only valid for db_update_mode . This is a
template-supported field. |
dbexportcmd Mode 2/Advanced Mode Parameters
Parameter |
Description |
mode |
The mode value for advanced mode is 2 to export to an HDFS directory or a storage location. |
dbtap_id |
The data store ID of the target database in Qubole. This is a template-supported field. Its value
is a string and not an integer. |
db_table |
The name of the DB table. This is a template-supported field. |
db_update_mode |
The two different types of update modes, allowinsert or updateonly . |
db_update_keys |
Columns used to determine the uniqueness of rows and it is only valid for db_update_mode . This is a
template-supported field. |
export_dir |
An HDFS/Cloud location from which data is exported. This is a template-supported field. |
fields_terminated_by |
The Hex value of the character used as a column separator in the dataset. |
dbimportcmd Mode 1/Simple Mode Parameters
Parameter |
Description |
mode |
The mode value for simple mode is 1 to pull data from a relational database to QDS in a Hive table. |
hive_table |
The name of the Hive table. This is a template-supported field. |
dbtap_id |
The data store ID of the target database in Qubole. This is a template-supported field. Its value
is a string and not an integer. |
db_table |
The name of the db table. This is a template-supported field. |
where_clause |
The WHERE clause (if any). This is a template-supported field. |
parallelism |
The number of parallel database connections used for extracting the data. |
dbimportcmd Mode 2/Advanced Mode Parameters
Parameter |
Description |
mode |
The mode value for advanced mode is 2 to specify a custom query to transform the data before pulling it. |
hive_table |
The name of the Hive table. This is a template-supported field. |
dbtap_id |
The data store ID of the target database in Qubole. This is a template-supported field. Its value
is a string and not an integer. |
db_table |
The name of the db table. This is a template-supported field. |
parallelism |
The number of parallel database connections used for extracting the data. |
extract_query |
The SQL query to extract data from the database. $CONDITIONS must be part of the WHERE clause.
This is a template-supported field. |
boundary_query |
The query used to get range of row IDs that are to be extracted.
This is a template-supported field. |
split_column |
Column used as row ID to split data into ranges. This is a template-supported field. |
get_results
This command returns standard output of the command represented by the Qubole Operator.
Parameter |
Description |
delim |
Specify the delimiter (example can be a , , (space) , and so on to segregate each row’s data.
Delimiter replaces Ctrl + A from results data. |
fp |
Use this to write command results directly into a file . If you do not specify fp , Airflow creates
an fp and returns it. |
inline |
This parameter decides whether or not to display the command results inline as a CRLF-separated string. |
fetch |
This parameter decides whether or not to download large results directly from the Cloud; it is set to
true by default. It becomes effective only when inline is set to true . If inline is true
and fetch is false , only the Cloud path is displayed. |
ti |
The TaskInstance object. |
get_log
This command returns standard logs (in a raw text format) of the command represented by the Qubole Operator.
Parameter |
Description |
ti |
The TaskInstance object. |
get_jobs
This command returns jobs of the command represented by the Qubole Operator. It calls the Jobs API and retrieves the
details of the hadoop jobs spawned on the cluster by command (command_id
). This information is only available for
commands, which have been completed.
Parameter |
Description |
ti |
The TaskInstance object. |