Presto FAQs¶

How is Presto different from Hive?
How is Qubole’s Presto different from open-source Presto?
Where do I find Presto logs?
Why are new nodes not being used by my query during upscaling?
Where can I find the Presto Server Bootstrap logs?
How can I optimize the Presto query?

How is Presto different from Hive?¶

As a user, there are certain differences that you should be aware about Presto and Hive, even though they are able to execute SQL-like queries.

Presto:

Does not support User-defined functions (UDFs). However, Presto has a large number of built-in UDFs. Qubole provides additional UDFs, which can be added only before the cluster startup and runtime UDF additions such as Hive are not supported.
Does not support JOIN ordering. Ensure that a smaller table is to the right of the JOIN token.

How is Qubole’s Presto different from open-source Presto?¶

Note

Presto is supported on AWS, Azure, and GCP Cloud platforms; see QDS Components: Supported Versions and Cloud Platforms.

While Qubole’s Presto offering is heavily based on open-source Presto, there are a few differences. Qubole’s Presto:

Supports inserting data into Cloud Object Storage directories
Supports INSERT OVERWRITES
Supports autoscaling clusters
Supports GZIP compression
Supports data traffic encryption among the Presto cluster nodes
Supports additional connectors such as Kinesis and SerDes such as AVRO and Openx JSON

Where do I find Presto logs?¶

The coordinator cluster node’s logs are located at:
- DEFLOC/logs/presto/cluster_id/cluster_start_time/master/
- DEFLOC/logs/presto/cluster-id/cluster_start_time/master/queryinfo/
The worker cluster node’s logs are located at: DEFLOC/logs/presto/cluster_id/cluster_start_time/nodeIP/node_start_time/

Where:

DEFLOC refers to the default location of an account.
cluster_id is the cluster ID.
cluster_start_time is the time you start the cluster. You can fetch Presto logs using the above log location using the approximate start time of the cluster.

You can also get it by running a Presto command. When you run a Presto command, the log location is reported under the Logs tab.

For example, you’ll see the path as something like this:
```
Log location: gs://mydata.com/trackdata/logs/logs/presto/95907
Started Query: 20191110_092450_00096_bucas Query Tracker
Query: 20190810_092450_00096_bucas Progress: 0%
Query: 20190810_092450_00096_bucas Progress: 0%
```
95907 is the cluster instance ID; there are sub-directories for the coordinator and worker nodes.

Why are new nodes not being used by my query during upscaling?¶

New nodes are available only to certain operations (such as TableScans and Partial Aggregations) of queries already in progress when the nodes are added. For more information, see this explanation of how autoscaling works in a Presto cluster.

Where can I find the Presto Server Bootstrap logs?¶

A GCP user can see the Presto Server Bootstrap logs in /media/ephemeral0/presto/var/log/bootstrap.log.

How can I optimize the Presto query?¶

The following are some guidelines you can consider adopting to make the most of Presto on Qubole:

Check the health of the cluster before submitting a query and ensure that the Presto master is working and the cluster resources such as CPU, memory, and disk are not over utilized (> 95%).
Check the storage format of the data you are querying. Prepared data (columnar format, partitioned, statistics, and so on) provide better performance. Contact your Admin for further advice.
Ensure that PREDICATES and LIMIT statements are used when querying large datasets, doing large data scans, or joining multiple tables.
Consolidate small files into bigger files asynchronously to reduce network overheads.
Collect dataset statistics such as file size, rows, and histograms of values to optimize queries with JOIN reordering.
Enable runtime filtering to improve the performance of INNER JOIN queries.
Enable automatic selection of optimal JOIN distribution type and JOIN order based on table statistics.
Use broadcast/replicated JOIN (Map-side JOIN) when build side tables are small.
Use aggregations over DISTINCT values to speed up the query execution.