Understanding the Hive Data Model¶
Data in QDS Hive is organized as tables and table partitions. These tables and partitions can either be created from data that you already have in Cloud storage, or can be generated as an output of running Hive queries. QDS uses HiveQL to query this data. For a primer on Hive, see the Apache Hive wiki.
The following topics are covered in this section:
Types of Hive Tables¶
Tables in QDS Hive are backed by data residing either in Cloud storage or in HDFS (Hadoop Distributed File System). The table types available in QDS Hive are:
- External Tables: These tables are assigned an explicit location by the user. When an external table is dropped, Hive does not delete the data in the location that it points to.
- Regular Tables: These tables do not have an explicit location attribute and are assigned by one by Hive directly. Hive assigns a location relative to a default location that is fixed for an account. When a regular table is dropped, the data in the table is deleted.
- Temporary Tables: QDS Hive allows a third form of tables that is deleted automatically once the user’s session is deleted.
- Mongo backed Tables : You can create a Hive table whose underlying data resides in a Mongo database collection. When the table is queried, Qubole dynamically fetches the data from the Mongo database. See Mongo Backed Tables for more information.
Given pre-existing data in a bucket in Cloud storage, we can create an external table over the data to begin analyzing it, as in the example(s) that follow.
CREATE EXTERNAL TABLE miniwikistats (projcode string, pagename string, pageviews int, bytes int) PARTITIONED BY(dt string) ROW FORMAT DELIMITED FIELDS TERMINATED BY ' ' LINES TERMINATED BY '\n' LOCATION 'gs://paid-qubole/default-datasets/miniwikistats/';
This table is used as a reference in subsequent examples. It assumes the following:
- The data is entirely in text and contains four fields that are separated using the space character; the rows are separated by the newline character.
- The data lives under
<scheme>is the Cloud-specific URI and path.
- The dataset is partitioned (by hour for example) so that each hour’s statistics are in a
separate subfolder. For example, the statistics for first hour of 20110101 are in
- The CREATE statement above creates a partitioned table, but it does not populate any partitions in it, so the table is empty (even though this Cloud location has data).
You might create derived data sets while analyzing the original data sets and might want to keep them for a period. In this case, you can create a regular table in Cloud storage, for instance:
CREATE TABLE q1_miniwikistats AS SELECT projcode, sum(pageviews) FROM miniwikistats WHERE dt >= '20110101' AND dt <= '20110105' GROUP BY projcode;
You may want to force a table to reside in HDFS. Such tables provide faster throughput, but bear in mind that they are automatically deleted at the end of the session.
You can use either TMP or TEMPORARY when creating temporary tables in QDS. See How can I create a table in HDFS? for a discussion of the differences.
CREATE TMP TABLE tmp_stats AS SELECT * FROM miniwikistats WHERE projcode LIKE 'ab%' AND dt > '20110101';
You can look up the location of all tables created in this manner by using
DESCRIBE FORMATTED tmp_stats;
You do not need to configure storage for temporary tables (always in HDFS) or external tables (which you can see by explicitly listing the entire location).
Regular tables are handled as follows:
As part of setting up an account, you can set a default location in your Cloud storage, with the credentials needed to access this location (read/write/list). This location is used to store logs and results, and by default, QDS creates regular tables in a subdirectory of this same location, so you don’t need to supply the credentials again to get access to the tables. If you choose to create external tables in a location that is not accessible via the account’s storage credentials, you can specify the credentials as part of the location URL.
To help you get started, QDS creates some read-only tables for each account. These tables are backed by a public requester-pays bucket on cloud object storage. The tables are as follows:
- default_qubole_memetracker: 96 million memes collected between 2008 and 2009