Submit a Hadoop CloudDistCp Command¶
-
POST
/api/v1.2/commands/
¶
Hadoop DistCP is the tool used for copying large amount of data across clusters. Ensure that the output directory is new and does not exist before running a Hadoop job.
Required Role¶
The following users can make this API call:
- Users who belong to the system-admin or system-user group.
- Users who belong to a group associated with a role that allows submitting a command. See Managing Groups and Managing Roles for more information.
Parameters¶
Note
Parameters marked in bold below are mandatory. Others are optional and have default values.
Parameter | Description |
---|---|
command_type | HadoopCommand |
sub_command | clouddistcp |
sub_command_args | [hadoop-generic-options] [clouddistcp-arg1] [clouddistcp-arg2] ...
|
src | Location of the data on HDFS or Google Cloud Storage location, to copy. Important CloudDistCp does not support bucket names with the underscore (_) character. |
dest | Destination path for the copied data on HDFS or Google Cloud Storage location. Important CloudDistCp does not support bucket names with the underscore (_) character. |
label | Specify the cluster label on which this command is to be run. |
name | Add a name to the command that is useful while filtering commands from the command history. It does not accept & (ampersand), < (lesser than), > (greater than), ” (double quotes), and ‘ (single quote) special characters, and HTML tags as well. It can contain a maximum of 255 characters. |
tags | Add a tag to a command so that it is easily identifiable and searchable from the
commands list in the Commands History. Add a tag as a filter value while
searching commands. It can contain a maximum of 255 characters. A comma-separated
list of tags can be associated with a single command. While adding a tag value,
enclose it in square brackets. For example, {"tags":["<tag-value>"]} . |
macros | Denotes the macros that are valid assignment statements containing the variables and
its expression as: macros: [{"<variable>":<variable-expression>}, {..}] . You can
add more than one variable. For more information, see Macros. |
srcPattern | A regular expression that filters the copy operation to a data subset at the If the regular expression contains special characters such as an asterisk (*), either
the regular expression or the entire |
groupBy | A regular expression that causes CloudDistCp to concatenate files that match the expression. For example, you could use this option to combine log files written in one hour into a single file. The concatenated filename is the value matched by the regular expression for the grouping. Parentheses indicate how files should be grouped, with all of the items that match the parenthetical statement being combined into a single output file. If the regular expression does not include a parenthetical statement, the cluster fails on the CloudDistCp step and returns an error. If the regular expression argument contains special characters, such as an asterisk
(*), either the regular expression or the entire When |
targetSize | The size, in mebibytes (MiB), of the files to create based on the If the files concatenated by |
outputCodec | It specifies the compression codec to use for the copied files. This can take the
values: gzip , gz , lzo , snappy , or none. You can use this option, for
example, to convert input files compressed with Gzip into output files with LZO
compression, or to uncompress the files as part of the copy operation. If you choose
an output codec, the filename is appended with the appropriate extension (for
example, for gz and gzip, the extension is .gz) If you do not specify a value for
outputCodec , the files are copied over with no change in the compression. |
CloudServerSideEncryption | It ensures that the target data is transferred using SSL and automatically encrypted in Google Cloud Storage using a service-side key. When retrieving data using CloudDistCp, the objects are automatically unencrypted. If you try to copy an unencrypted object to an encryption-required storage bucket, the operation fails. |
deleteOnSuccess | If the copy operation is successful, this option makes CloudDistCp delete copied files from the source location. It is useful if you are copying output files, such as log files, from one location to another as a scheduled task, and you do not want to copy the same files twice. |
disableMultipartUpload | It disables the use of multipart upload. |
encryptionKey | If SSE-KMS or SSE-C is specified in the algorithm, then using this parameter, you can
specify the key using which the data is encrypted. In case the algorithm is
SSE-KMS , the key is not mandatory as default KMS would be used. If algorithm is
SSE-C , then specify the key else the job fails. |
filesPerMapper | It is the value that denotes the number of files that is placed in each map task. |
multipartUploadChunkSize | The size, in MiB is the multipart upload part size. By default, it uses multipart upload when writing to cloud storage. The default chunk size is 16 MiB. |
numberFiles | It prepends output files with sequential numbers. The count starts at 0 unless a
different value is specified by startingIndex . |
startingIndex | It is used with numberFiles to specify the first number in the sequence. |
outputManifest | It creates a text file, compressed with Gzip, that contains a list of all files copied by CloudDistCp. |
previousManifest | It reads a manifest file that was created during a previous call to CloudDistCp using
the outputManifest . When previousManifest is set, CloudDistCp excludes the
files listed in the manifest from the copy operation. If outputManifest is
specified along with previousManifest , files listed in the previous manifest also
appear in the new manifest file, even though the files are not copied. |
copyFromManifest | It reverses the previousManifest behavior to cause CloudDistCp to use the
specified manifest file as a list of files to copy, instead of a list of files to
exclude from copying. |
CloudEndpoint | It specifies the endpoint to use when uploading a file. This option sets the endpoint for both the source and destination. |
CloudSSEAlgorithm | It is used for encryption. If you do not specify it but CloudServerSideEncryption
is enabled, then AES256 algorithm is used by default. Valid values are AES256 ,
SSE-KMS , and SSE-C . |
srcCloudEndpoint | It is a cloud storage endpoint to specify as the source path. |
timeout | It is a timeout for command execution that you can set in seconds. Its default value is 129600 seconds (36 hours). QDS checks the timeout for a command every 60 seconds. If the timeout is set for 80 seconds, the command gets killed in the next minute that is after 120 seconds. By setting this parameter, you can avoid the command from running for 36 hours. |
tmpDir | It is the location (path) where files are stored temporarily when they are copied
from the cloud object storage to the cluster. The default value is hdfs:///tmp . |
Example¶
curl -i -X POST -H "X-AUTH-TOKEN: $AUTH_TOKEN" -H "Content-Type: application/json" -H "Accept: application/json" \
-d '{"sub_command": "clouddistcp", "sub_command_args": "--src cloud://paid-qubole/kaggle_data/HeritageHealthPrize/ --dest /datasets/HeritageHealthPrize", "command_type": "HadoopCommand"}' \
"https://gcp.qubole.com/api/v1.2/commands"
Sample Response
{
"id": 18167,
"meta_data": {
"logs_resource": "commands/18167/logs",
"results_resource": "commands/18167/results"
},
"command": {
"sub_command": "clouddistcp",
"sub_command_args": "--src cloud://paid-qubole/kaggle_data/HeritageHealthPrize/ --dest /datasets/HeritageHealthPrize"
},
"command_type": "HadoopCommand",
"created_at": "2013-03-14T09:34:15Z",
"path": "/tmp/2013-03-14/53/18167",
"progress": 0,
"qbol_session_id": 3525,
"qlog": null,
"resolved_macros": null,
"status": "waiting"
}