Spark Integration with BigQuery

Google’s BigQuery is a serverless data warehouse for storing and querying massive datasets. Spark on Qubole is integrated with BigQuery, enabling direct reads of data from BigQuery storage into Spark DataFrames. This allows data engineers to explore BigQuery datasets or join data in Google Cloud Storage and BigQuery to perform complex data transformations and queries. For more information about BigQuery, see the Google BigQuery documentation.

Data scientists can look up BigQuery datasets and build machine learning models using Qubole’s Spark and Notebooks. The data is read in Apache Avro format using parallel streams with dynamic data sharding across streams to support low latency reads. The Spark connector for BigQuery eliminates the need to export data from BigQuery to Google Cloud Storage, improving data processing times.

Viewing BigQuery Datasets in the Qubole UI

Qubole displays BigQuery datasets directly in the Workbench and Notebooks interfaces. This allows data scientists and data engineers to discover BigQuery tables and datasets from within QDS.

Qubole Workbench UI with Data from BigQuery

../../../_images/BigQuery-Workbench.jpg

Qubole Notebooks UI with Data from BigQuery

../../../_images/BigQuery-Notebook.png