Hive Datasets as Schedule Dependency¶
This section describes how schedules can be set up to run only if the data is available in Apache Hive tables. Typically, schedules which run Hive commands depend on data in Hive tables.
CREATE EXTERNAL TABLE daily_tick_data (
date2 string,
open float,
close float,
high float,
low float,
volume INT,
average FLOAT)
PARTITIONED BY (
stock_exchange STRING,
stock_symbol STRING,
year STRING,
date1 STRING)
ROW FORMAT DELIMITED
FIELDS TERMINATED BY ','
LOCATION '<scheme>/stock_ticker';
date1
is a date with format YYYY-MM-DD
<scheme>
is the Cloud-dependent URI and path: for example gs://gs-bucket/default-datasets
The dataset is available from 2012-07-01. For this example, let us assume that the dataset is updated everyday at 1AM UTC and jobs are scheduled everyday at 2AM UTC.
The following query has to be executed every day:
SELECT stock_symbol, max(high), min(low), sum(volume) FROM daily_tick_data WHERE date1='$yesterday$' GROUP BY stock_symbol
The following sub-topics provide more information:
- Partition Column Values
- Dataset Interval
- Initial Instance
- Understanding GCP GS Files Dependency
- Understanding Hive Partition Dependency
Partition Column Values¶
Qubole has to be informed about the new partitions that are added every day.
In the example, the following partitions are added on 2013-01-02:
stock_exchange = nasdaq stock_exchange = nasdaq | stock_symbol = ibm stock_symbol = orcl | year = 2013 year = 2013 | date1=2013-01-01 date1=2013-01-01 |
stock_exchange = nyse | stock_symbol = ibm | year = 2013 | date1=2013-01-01 |
stock_exchange = nyse | stock_symbol = ibm | year = 2013 | date1=2013-01-01 |
For example, the partition columns can have the following values:
stock_exchange | [nasdaq, nyse] |
stock_symbol | [ibm, orcl] |
year | %Y |
date1 | %Y-%m-%d |
The above information has to be entered while submitting a job either through the UI or API.
The format of the partition columns, year
and date1
, does
not change from one job to another. These are stored in the Hive
metastore and do not need to be specified every time.
The format for date partition columns can be entered through the QDS UI or the API. For more information on Store Table Properties, see Store Table Properties.
See Configuring Hive Tables Data Dependency for more information on setting Hive table data dependency using the QDS UI.
Dataset Interval¶
In this example, the job runs every day and the dataset is generated every day. It is possible that the job runs at a frequency different from the interval at which the dataset is generated. For example, the following query is run once in seven days while the dataset is generated once a day.
SELECT stock_symbol, max(high), min(low), sum(volume) FROM daily_tick_data WHERE date1>'$sevendaysago$' AND date1 < '$today$' GROUP BY stock_symbol
Qubole needs additional information to schedule this job, as follows:
interval | How often the data is generated. |
window_start, window_end | Defines the range of intervals to wait for. Each is an integer which is multiple of the interval. |
For the purposes of this example, the values for interval
, window\_start
and window\_end
are:
interval | 1 day |
window_start | -6 (inclusive of seven days ago) |
window_end | 0 (inclusive of today ) |
As with the date formats of the partition columns, the interval at which the
dataset is generated does not change often. interval
can also be stored
in the Hive metastore and need not be specified every time.
For more information, see Configuring GS Files Data Dependency.
Initial Instance¶
Initial instance specifies the first instance of the data that is available. This is useful when a new dataset is introduced. It is possible that some jobs at the beginning may not have all instances available and should not be generated.
Let us understand the dependency of data in GCP GS files and Hive partitions required by the Qubole Scheduler for scheduling jobs. Dependencies are the prerequisites that must be met before a job can
Understanding GCP GS Files Dependency¶
GS files dependency implies that a job runs if the data has arrived in GS buckets. You can schedule a job to run at a specific date and time, either once or on a repetitive basis if the data exists. You can define repeat intervals such as last 6 hours, last 6 days, last 3 weeks, and last 7 months. For more information, see Configuring GS Files Data Dependency.
To schedule jobs at periodic intervals, Qubole Scheduler requires the following information:
- Start day or time (parameter:
window_start
) - End day or time (parameter:
window_end
) - Day or time interval that denotes when and how often data is generated (parameter:
interval
) - Nominal time which is the logical start time of an instance
The dependency must be defined as: gs://<bucket>/<folderinGSbucket>/<abc>-%Y-%m-%d-%H-%M-%S
,
for example: gs://abc.com/data/schedule-2014-12-31-00-00-00
.
See Time class for more information on date and time placeholders.
The following table shows how to create data in GS files for the previous day’s data with a daily interval.
Sequence ID | Nominal Time | Created At | Dependency |
1 | 2015-01-01 00:00:00 | 2015-04-22 10:00:00 | gs://abc.com/data/schedule-2014-12-31-00… |
2 | 2015-01-02 00:00:00 | 2015-04-22 10:15:00 | gs://abc.com/data/schedule-2015-01-01-00… |
3 | 2015-01-03 00:00:00 | 2015-04-22 10:30:00 | gs://abc.com/data/schedule-2015-01-02-00… |
The window_start
and window_end
parameters are relative to Nominal Time.
Nominal Time is the time for which the Schedule Action was processed and Created At is the time at which the Scheduler picked up the schedule. For more information, see Understanding the Qubole Scheduler Concepts.
Interpreting window_start Parameter Values¶
The value 0 implies now, -1 implies 1 day ago, and -2 implies 2 days ago.
Similarly, for an hourly/daily/weekly/monthly/yearly interval (frequency), the value 0 denotes now. -1 denotes 1 hour/day/week/month/year ago. -2 denotes 2 hour/day/week/month/year ago and so on.
Interpreting window_end Parameter Values¶
The Qubole Scheduler supports waiting for data. For example, waiting for 6 weeks of data implies that window_start
is -5 and
window_end
is 0.
Note
When the data arrival interval and the scheduler interval are different, then the scheduler interval follows its own frequency to process the data. For example, if the data arrival interval is hourly and the scheduler interval is daily, the scheduler waits for an entire day’s data.*
Data and the scheduler can be in two different timezones. For example,
{
window_start => -48
window_end => -24
frequency => hourly
time_zone => Americas/Los_Angeles
}
scheduler_frequency => daily
time_zone => Americas/New_York
Understanding Hive Partition Dependency¶
The Qubole Scheduler allows data units to have Hive partitions. Data in Hive tables can be categorized by Hive partitions
such as country
or date
. The Hive query example on this page contains Hive partitions. The
scheduler recognizes the Hive partitions from the corresponding table properties in the Hive metastore.
See Partitions for more information.
Timezone can be specified as an optional parameter in a hive query with daylight savings as on/off.
Hive tables can be partitioned by date and country. Dependency is expressed as %Y/%M/%d/["US", "CAN", "IRE"]
.