Inserting Data

Note

Presto is supported on AWS, Azure, and GCP Cloud platforms; see QDS Components: Supported Versions and Cloud Platforms.

You may want to write results of a query into another Hive table or to a Cloud location. QDS Presto supports inserting data into (and overwriting) Hive tables and Cloud directories, and provides an INSERT command for this purpose. It is currently available only in QDS; Qubole is in the process of contributing it to open-source Presto.

Keep in mind that Hive is a better option for large scale ETL workloads when writing terabytes of data; Presto’s insertion capabilities are better suited for tens of gigabytes.

The INSERT syntax is very similar to Hive’s INSERT syntax.

Let us use default_qubole_airline_origin_destination as the source table in the examples that follow; it contains flight itinerary information.

Cloud Directories

You can write the result of a query directly to Cloud storage in a delimited format; for example:

INSERT INTO directory '<scheme>qubole.com-siva/experiments/quarterly_breakdown'
SELECT origin,
       quarter,
       count(*) AS c
FROM default_qubole_airline_origin_destination
WHERE YEAR='2007'
GROUP BY quarter,
     origin;

<scheme> is the Cloud-specific URI scheme: gs:// for GCP.

Here is a preview of what the result file looks like using cat -v. Fields in the results are ^A (ASCII code \x01) separated.

"DFW"^A1^A334973
"LAX"^A1^A216789
"OXR"^A1^A456
"HNL"^A1^A78271
"IAD"^A1^A115924
"ALB"^A1^A20779
"ORD"^A1^A414078

Simple Hive Tables

The target Hive table can be delimited, CSV, ORC, or RCFile. Qubole does not support inserting into Hive tables using custom input formats and serdes. You can create a target table in delimited format using the following DDL in Hive.

CREATE TABLE quarter_origin (quarter string, origin string, count int)
ROW FORMAT DELIMITED
FIELDS TERMINATED BY '|'
STORED AS TEXTFILE;

Run desc quarter_origin to confirm that the table is familiar to Presto. It can take up to 2 minutes for Presto to pick up a newly created table in Hive. Now run the following insert statement as a Presto query.

INSERT INTO TABLE quarter_origin
SELECT quarter,
       origin,
       count(*)
FROM default_qubole_airline_origin_destination
WHERE YEAR='2007'
GROUP BY quarter,
     origin;

You can now run queries against quarter_origin to confirm that the data is in the table.

SELECT *
FROM quarter_origin LIMIT 5;

Similarly, you can overwrite data in the target table by using the following query.

INSERT OVERWRITE TABLE quarter_origin
SELECT quarter,
       origin,
       count(*)
FROM default_qubole_airline_origin_destination
WHERE YEAR='2007'
GROUP BY quarter,
     origin;

Partitioned Hive Tables

You can also partition the target Hive table; for example (run this in Hive):

CREATE TABLE quarter_origin_p (origin string, count int)
PARTITIONED BY (quarter string)
ROW FORMAT DELIMITED
FIELDS TERMINATED BY '|'
STORED AS TEXTFILE;

Now you can insert data into this partitioned table in a similar way. The only catch is that the partitioning column must appear at the very end of the select list. In the below example, the column quarter is the partitioning column.

INSERT INTO TABLE quarter_origin_p
SELECT origin,
       count(*),
       quarter
FROM default_qubole_airline_origin_destination
WHERE YEAR='2007'
GROUP BY quarter,
     origin;

Note that the partitioning attribute can also be a constant. You can use overwrite instead of into to erase previous content in partitions.

Configuring the Concurrent Writer Tasks Per Query

Caution

Use this configuration judiciously to prevent overloading the cluster due to excessive resource utilization. So it is recommended to use higher value through session properties for queries which generate bigger outputs. For example, ETL jobs.

Presto provides a configuration property to define the per-node-count of Writer tasks for a query. You can set it at a cluster level and a session level. By default, when inserting data through INSERT OR CREATE TABLE AS SELECT operations, one Writer task per worker node is created which can slow down the query if there there is a lot of data that needs to be written. In such cases, you can use the task_writer_count session property but you must set its value in power of 2 to increase the number of Writer tasks per node. This eventually speeds up the data writes.

The cluster-level property that you can override in the cluster is task.writer-count. You must set its value in power of 2.