Skip to content

UTC Timestamp Data Type Proposal

Paul Rogers edited this page Apr 27, 2020 · 1 revision

UTC Timestamp for Drill

Drill provides the TIMESTAMP data type which is a millisecond-resolution timestamp in the local time zone. Proposed is a new UTCTIME timestamp with nanosecond resolution in the UTC time zone.

With a UTC timestamp available, we will find it necessary to revisit and rationalize Drill's other date/time data types. Proposed are new types to accurately capture the idea of a zone-less date and time, and ways to define the existing data types (with their quirks) to preserve backward compatibility.

Notation

SQL does not provide syntax for declaring data types. This note uses a simple extension:

FN(type1) --> type2

Where type and type2 are Drill data types. The --> notation simply means that FN maps type1 into type2.

Some Drill functions use a data type as a literal. In this case, we trust the reader to understand the difference between a type and literal:

CAST(BIGINT AS UTCTIME) --> UTCTIME

Here, the first UTCTIME is a literal.

UTCTIME Data Type

UTCTIME is a data type for a value, represented as a 64-bit integer, that gives the number of nanoseconds since the epoch, UTC.

Semantics

Drill has two informal groupings of data types: inferred types and explicit types. Since Drill operates without schema, readers map from incoming data to one of the inferred types (typically VARCHAR, INT or BIGINT, FLOAT8 and so on.) The explicit types add behavior, but typically cannot be inferred from an input. Explicit types are things like (VARDECIMAL, TIMESTAMP, DATE, TIME, INTERVAL and so on.) UTCTIME is an explicit data type: the user must write a SQL expression to convert an inferred BIGINT column to a UTCTIME value.

Constant Declaration

We add a new keyword to Drill's existing TIMESTAMP syntax:

  • TIMESTAMP string --> TIMESTAMP -- Declare a constant in the local time zone.
  • UTCTIME string --> UTCTIME -- Declare a constant in the UTC time zone. The string is in ISO notation and may include a timezone offset portion.

Conversion to/from BIGINT

UTCTIME is directly converable to and from the BIGINT data type:

CAST(UTCTIME AS BIGINT) --> BIGINT
CAST(BIGINT AS UTCTIME) --> UTCTIME

The conversion is direct: the numeric value of the BIGINT becomes the nanoseconds since the epoch value for UTCTIME and visa-vera.

The following invariants hold for BIGINT column bi and UTCTIME column utc:

CAST( CAST(utc AS BIGINT) AS UTCTIME) = utc --> TRUE
CAST( CAST(bi AS UTCTIME) AS BIGINT) = bi --> TRUE

Math Operations

The following math operations are extended for UTCTIME:

  • UTCTIME +/- INTERVAL --> UTCTIME -- Add or subtract a time interval to/from a UTC timestamp giving another UTC timestamp. (Same as DATE_ADD below.)
  • UTCTIME +/- BIGINT --> UTCTIME -- Add or subtract a given number of nanoseconds to/from a UTC timestamp giving another UTC timestamp. (Same as DATE_ADD below.)
  • UTCTIME - UTCTIME --> BIGINT -- Subtract one UTC timestamp from another to produce a difference in nanoseconds. (See also DATE_SUB below.)
  • UTCTIME relop UTCTIME -- Compare two UTC times. Comparision is based on the underlying 64-bit integer values.

Extensions to Existing Drill Date/Time Functions

Drill provides a number of date/time functions which are extended for UTCTIME:

  • AGE(UTCTIME) --> INTERVAL -- Returns the span of time between the given time and now.
  • AGE(UTCTIME, UTCTIME) --> INTERVAL -- Returns the span beween two UTC times.
  • DATE_ADD(UTCTIME, INTERVAL) --> UTCTIME -- Adds an interval to a UTC timestamp giving a new UTC timestamp.
  • DATE_DIFF(UTCTIME, UTCTIME) --> INTERVAL -- Same as AGE.
  • DATE_PART(keyword, UTCTIME) --> INTEGER -- Returns the date part of a UTC timestamp.
  • DATE_SUB(UTCTIME, INTERVAL) --> UTCTIME -- Subtracts a date/time interval from a UTC timestamp, giving another UTC timestamp.
  • EXTRACT(arg FROM UTCTIME) --> DOUBLE -- Returns the value of a date/time field.

Note that Drill intervals are accurate to milliseconds. To compute the precise difference between two UTC timestamps, convert them to BIGINT first, to get another BIGINT which represents the difference in milliseconds.

New UTC Functions Parallel to Local Time Functions

The following new functions complement existing functions.

  • NOW_UTC --> UTCTIME is parallel to CURRENT_TIMESTAMP --> TIMESTAMP and NOW --> TIMESTAMP, produces the current time, UTC to a resolution of nanoseconds.

Drill has other "CURRENT" functions (CURRENT_TIME, CURRENT_DATE, etc.) We omit UTC versions because UTC is used in use cases here current dates and times are seldom needed. The values can be computed if needed.

New UTC Functions for Time Buckets

New functions to generalize existing functions:

  • GROUP_TIME(UTCTIME BY arg) --> UTCTIME, where arg is one of DAY, MONTH, YEAR, or is INTEGER unit where unit is one of the units above, optionally with an S appended (e.g. 5 DAYS.) -- Truncates the time to the given unit so that all time values within that same range map to the time at the start of that range.
  • ROUND_TIME(UTCTIME TO arg) --> UTCTIME, where arg is as defined above. -- As above, but rounds rather than truncates.

Example Querie

Here is a current Drill query which uses a BIGINT as a stand-in for the missing UTCTIME data type:

SELECT hostname,
       time_bucket_ns(timestamp_ns, 36000000) as hour_ts,
       AVG(usage_user) as avg_metric
FROM dfs.test.`tsbs_test_large.json`
WHERE (timestamp_ns / 1000000000) BETWEEN 1451606400 AND (1451606400 + 3600*24)
GROUP BY hostname, hour_ts

And the proposed Drill version after the revisions proposed here:

SELECT hostname,
       GROUP_TIME(timestamp_ts BY HOUR) as hour_ts,
       AVG(usage_user) as avg_metric
FROM ...
WHERE timestamp_ts >= UTCTIME '2016-01-01T00:37:12Z'
  AND timestamp_ts <  UTCTIME '2016-01-01T00:37:12Z'
                               + INTERVAL(1 DAY)
GROUP BY hour_ts

Implementation

The UTCTIME is represented by an 8-byte signed integer; the same as the BIGINT data type.

Some basic implementation details:

  • Add UTCTIME to the MinorType enum.
  • Add UTCTIME to the Types class, metadata classes and so on.
  • Create a UtcTimeVector and associated nullable and repeated vectors, holders and so on.
  • Create UTCTIME column readers and writers (in both new and legacy formats.) Use a Java long to store the timestamp at the API level.
  • Implement each of the proposed Drill functions.
  • Implement code generation for the proposed math operations.

Note that the changes must be done in a way that are backward compatible since Drill provides an ODBC driver which we are no longer able to update.

UTCTIME Conversions to Existing Types

A robust product will provide conversion to/from Drill's existing data types.

Conversion to/from TIMESTAMP

The Drill TIMESTAMP class is defined as a timestamp in local time. However since Drill previously had no UTC time type, people sometimes used TIMESTAMP as if it were UTC. Still, Drill functions do assume local time, hence TIMESTAMP behavior is very confusing to most people.

It is possible (though perhaps not wise) to convert a UTCTIME to a Drill TIMESTAMP and visa-versa:

UTC_TO_LOCAL(UTCTIME) --> TIMESTAMP rounds the nanosecond resolution to the closest millisecond, then performs the required conversion from the UTC time zone to the local time zone using the Java 8 timezone operations.

LOCAL_TO_UTC(TIMESTAMP) --> UTCTIME performs timezone conversion from local time to UTC, then multiplies the value by a million to convert from milliseconds to nanoseconds.

Alternative syntax, since SQL allows syntactic structures inside functions:

CONVERT_TIME(UTCTIME TO LOCAL) --> TIMESTAMP
CONVERT_TIME(TIMESTAMP TO UTC) --> UTCTIME

With the obvious no-op transformations:

CONVERT_TIME(UTCTIME TO UTC) --> UTCTIME
CONVERT_TIME(TIMESTAMP TO LOCAL) --> TIMESTAMP

The above syntax could be extended to take a timezone name in place of the two kewords. However, it is beyond the scope of this project to convert to/from timezones other than the local time as Drill provides no data types to represent other timezones.

For those people who have (mistakenly) treated as TIMESTAMP as UTC, the following (undocumented, except for here) workaround is possible to undo the error:

CAST( CAST(TIMESTAMP AS BIGINT) AS UTCTIME) --> UTCTIME

Conversion to DATE

In SQL, DATE is a concept independent of timezones: it is simply a representation of the concept of a day on a calendar, it is not a date in a specific location. Drill, however, represents a DATE as a millisecond timestamp of midnight relative to the epoch in the local timezone. This representation is unfortunate as it leads to endless confusion. There are two interpretations of a DATE:

  • As a calendar date
  • As an instant in the local time zone at midnight on the given date.

Unfortunately, these two interpretations give different results with working with UTC dates. Under the calendar date interpretation, converting from UTC to a DATE would produce the date in the UTC timezone on which the instant occurred. Under the local time interpretation, we would get the date in the local time zone on which the instant occurred, which may be different.

Fixing this issue is, however, beyond the scope of this project. Instead, we live with the muddle by providing two functions:

localDate(utcTimeCol) AS localDate
utcDate(utcTimeCol) as utcDate

localDate(UTCTIME) --> DATE performs UTC-to-local time conversion as described above, then truncates the value to a date value.

utcDate(UTCTIME) --> DATE omits the timezone conversion and simply truncates the valute to a date value. Users must be careful not to interpret the DATE's timezone.

Conversion to TIME

In SQL, a TIME is simply a clock reading: a duration since midnight in a relative, unspecified time zone. In Drill, the time is assumed to be in the local time zone to be consistent with DATE. That is, add a DATE and a TIME to produce a TIMEZONE in local time.

In fact, a TIME is just an offset from midnight. But, given the historical interpretation in Drill, we also need to conversion functions:

localTime(utcTimeCol) AS localTime
utcTime(utcTimeCol) AS utcTime

localTime(UTCTIME) --> TIME performs UTC-to-local time conversion, then truncates the date portion, leving only the offset from midnight local time.

utcDate(UTCTIME) --> TIME performs only the truncation giveing the offset from midnight UTC.

With the two UTC functions, we end up with two numbers (DATE and TIME), which, if added, would produce a UTC timestamp in milliseconds for which Drill has no declared data type. Multiplying by a million would recover a (rounded) UTCTIME`.

Implicit Conversions in Functions

  • AGE(UTCTIME, TIMESTAMP) --> INTERVAL -- Converts local time to UTC, then works as above.
  • AGE(TIMESTAMP, UTCTIME) --> INTERVAL -- As above.
  • DATE_ADD(UTCTIME, arg) --> UTCTIME -- Adds an interval to a UTC timestamp producing another UTC timestamp.
  • DATE_DIFF(TIMESTAMP, UTCTIME) --> INTERVAL -- Same as AGE.
  • DATE_DIFF(UTCTIME, TIMESTAMP) --> INTERVAL -- Same as AGE.

The following new functions complement existing functions.

  • LOCALTIME ?
  • LOCALTIMESTAMP ?
  • NOW ?
  • TIMEOFDAY ?
  • UNIX_TIMESTAMP

An alternative is to create one-argument versions of each function:

  • CURRENT_DATE(LOCAL|UTC)
  • CURRENT_TIME(LOCAL|UTC)
  • CURRENT_TIMESTAMP(LOCAL|UTC)

Future Directions

Although beyond the scope of this project, it is worth thinking how we would round out Drill's date/time support.

Local Dates and Times

We would start by adding three new types which are explicitly not tied to any time zone:

  • Add LOCALDATE - A date with no implied relation to a timestamp, it is just a combination of a year, month and day.
  • Add LOCALTIME - A time with no implied relation to a timestamp or time zone, it is just a combination of a hour, month, second and fractions of a second.
  • Add DATETIME - A date and time with no implied time zone.

These form the following relation:

DATETIME = LOCALDATE + LOCALTIME

Clarify Existing Types

We cannot change the meaning of existing types, but we can be very clear how they work:

  • TIMESTAMP is the number of milliseconds since midnight Jan 1, 1970 in the local time zone.
  • DATE is a timestamp in the local time zone truncated to midnight.
  • TIME is a timestamp in the local time zone with days remove.

The existing types now are defined via the following relationship:

TIMESTAMP = DATE + TIME

Constant Keywords

Drill provides a number of keywords to define date/time constants. Exising keywords:

  • TIMESTAMP string --> TIMESTAMP -- Declare a TIMESTAMP constant.
  • INTERVAL args --> INTERVAL -- Declare an INTERVAL constant.
  • DATE string --> DATE -- Declare a date in the local time zone.
  • TIME string --> TIME -- Declare a time in the local time zone.

To these we add new keywords:

  • DATETIME string --> DATETIME
  • LOCALDATE string --> LOCALDATE
  • LOCALTIME string --> LOCALTIME

Type Conversions

Casts convert between these types. Casts simply reinterpret the value; they apply no time zone conversions.

  • CAST(DATETIME AS UTCTIME) --> UTCTIME -- Interpret a date/time as UTC.
  • CAST(DATETIME AS TIMESTAMP) --> TIMESTAMP -- Interpret a date/time as local.
  • CAST(TIMESAMP AS DATETIME) --> DATETIME -- Return the date and time (without timezone information) for a local timestamp.
  • CAST(UTCTIME AS DATETIME) --> DATETIME -- Return the date and time (without timezone information) for a UTC timestamp.
  • CAST(LOCALDATE AS DATETIME) --> Converts a local date to a local date/time at midnight.
  • CAST(DATETIME AS LOCALDATE) --> Returns the date portion of date/time.
  • CAST(DATETIME AS LOCALTIME) --> Returns the time portion of a date/time.
  • CAST(TIMESTAMP AS LOCALDATE) --> LOCALDATE -- Return the date part of a local timestamp.
  • CAST(UTCTIME AS LOCALDATE) --> LOCALDATE -- Return the date part of a UTC timestamp.
  • CAST(TIMESTAMP AS LOCALTIME) --> LOCALTIME -- Return the time part of a local timestamp.
  • CAST(UTCTIME AS LOCALTIME) --> LOCALTIME -- Return the time part of a UTC timestamp.

Automatic Conversions

In general, Drill should perform automatic converions (via casts) for all functions. If either argument is in localtime, then all arguments should be cast to local date and/or time. For example:

AGE(LOCALDATE, UTCTIME)

is equivalent to:

AGE( CAST(LOCALDATE AS DATETIME), CAST(UTCTIME AS DATETIME))

Both widening and narrowing operations are supported. If both are possible, narrowing is chosen (based on standard "significant digits" reasoning):

AGE(LOCALDATE, DATETIME)

Is converted to:

AGE( LOCALDATE, CAST(DATETIME AS LOCALDATE))

If one argument is in UTC time, and the other is local, then both arguments should be converted to UTC. For example:

AGE(TIMESTAMP, UTCTIME)

Is equivalent to:

AGE( UTCTIME(TIMESTAMP), UTCTIME)

Extensions to Existing Functions

Extensions to existing Drill functions:

  • AGE(LOCALDATE, LOCALDATE) --> INTERVAL -- Interval (of days) between two dates.
  • AGE(DATETIME, DATETIME) --> INTERVAL -- Interval (of days) between two date/times.
  • DATE_ADD(LOCALDATE, arg) --> LOCALDATE -- Add an interval to a local date.
  • DATE_ADD(DATETIME, arg) --> DATETIME -- Add an interval to a local date/time.
  • DATE_DIFF(LOCALDATE, LOCALDATE) --> INTERVAL -- Subtract to local dates.
  • DATE_DIFF(DATETIME, DATETIME) --> INTERVAL -- Subtract two local date/times.
  • DATE_PART(LOCALDATE, arg) --> INTEGER -- Extract a field from a local date.
  • DATE_PART(DATETIME, arg) --> INTEGER -- Extract a field from a local date/time.
  • DATE_SUB(LOCALTIME, INTERVAL) --> LOCALTIME -- Subtract an interval from a local time.
  • DATE_SUB(LOCALDateE, INTERVAL) --> LOCALTIME -- Subtract an interval from a local date.
  • DATE_SUB(DATETIME, INTERVAL) --> LOCALTIME -- Subtract an interval from a local date/time.
  • EXTRACT(field FROM LOCALDATE | LOCALTIME | DATETIME) --> DOUBLE -- Extract the given field from the date and/or time. Returns 0 if the field is not represented.

New functions to generalize existing functions:

  • ROUND_DATE(type TO arg) --> type, where type is one of DATE, TIMESTAMP, UTCTIME, LOCALDATE or DATETIME, and arg is one of DAY, MONTH, YEAR, or is INTEGER unit where unit is one of the units above, optionally with an S appended (e.g. 5 DAYS.)
  • ROUND_TIME(type TO arg) --> type, where type is one of TIME, TIMESTAMP, UTCTIME, LOCALTIME or DATETIME, and arg is one of HOUR, MINUTE, SECOND, or MS, or is INTEGER UNIT as above.

Functions to be deprecated (though left in the product) because they are redundant:

  • LOCALTIMESTAMP
  • NEARESTDATE(LOCALDATE | DATETIME) -- Rounds to the nearest date.
  • TIMESTAMPADD
  • TIMESTAMPDIFF
Clone this wiki locally