Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

time.Time column type #40

Open
jrkhan opened this issue Jun 29, 2021 · 4 comments
Open

time.Time column type #40

jrkhan opened this issue Jun 29, 2021 · 4 comments

Comments

@jrkhan
Copy link
Contributor

jrkhan commented Jun 29, 2021

Hello! I have a use case where I'd like to sort/filter/group based on time. While in practice, I could likely use an int/string column to represent my time and convert it (if necessary) within an operation to time.Time - I'm not inclined to think this will perform as well as a solution which stores the time internally.
Curious if you would:

  1. Think it makes sense to add a new column type vs piggybacking on int or string
  2. Be open to a pull request for a time.Time column,
  3. Have info on all the places to look to add a new column type/have a high level description of what would be required?

Thanks in advance!

@jrkhan
Copy link
Contributor Author

jrkhan commented Jun 30, 2021

As a starting point, I was thinking one might create a /tcolumn package that had roughly the equivalent of:
https://github.com/tobgu/qframe/tree/master/internal/icolumn
And then make sure to add a type to:
https://github.com/tobgu/qframe/blob/master/types/types.go
And make sure to add handling for the new types.Time for anything switching on type, so at the minimum:
https://github.com/tobgu/qframe/search?q=types.Int

@tobgu
Copy link
Owner

tobgu commented Jun 30, 2021

Hi!

I'm very much open to adding a time type to qframe if you see use for it. There have been discussions/suggestions for such a type before. Would it be possible to list a couple of use cases for such a type where it makes usage easier (or more intuitive) compared to an int for example. Just to know what the priorities are and what we're aiming for.

The entrypoints you list above are good starting points/templates for adding a new type. You'd probably need to add some basic functions for serialization to/from strings/ints as well.

I'd be happy to help out with any questions and PR reviewing if you want to contribute a new time type!

@jrkhan
Copy link
Contributor Author

jrkhan commented Jul 2, 2021

Thanks!
So, after doing some benchmarking, I'm a bit less concerned about performance - but there are a few quality of life benefits.

Here are some examples for talking points:

// building a qframe with int based time and filtering
times := []string{
    "Monday, 02-Jan-06 12:00:00 EST",
    "Tuesday, 03-Jan-06 12:00:00 EST",
    "Wednesday, 04-Jan-06 12:00:00 EST",
    "Wednesday, 11-Jan-06 12:00:00 EST",
}
dateUnix := make([]int, len(times))
for i, ts := range times {
    d, _ := time.Parse(time.RFC850, ts)
    dateUnix[i] = int(d.Unix())
}
f := qframe.New(map[string]interface{}{
    "Labels":   []string{"A", "B", "C", "D"},
    "UnixDate": dateUnix,
})
// filter For Wednesday - we lost our location by converting to unix time!
loc := location.LoadLocation("America/New_York")	
Wednesdays := f.Filter(qframe.Filter{Column: "UnixDate", Comparator: isUnixDayOfTheWeek(3, loc)})

And possible implementations for isUnixDayOfTheWeek:

type UnixTimeFilter = func(x int) bool
func isUnixDayOfTheWeekApprox(dotw int) UnixTimeFilter {
	return func(x int) bool {
		// example of a simple check for day of the week
                // but does not account for leap seconds/time zones
		return ((x/86400)+4)%7 == dotw
	}
}
func isUnixDayOfTheWeek(dotw int, loc *time.Location) UnixTimeFilter {
   return func(x int) bool {
       // slower, but should take into account time zones
       return time.Unix(int64(x), 0).In(loc).Day() == dotw
   }
}

In this scenario, my main concerns were the complexity of making filters time zone aware, and possible performance hit from doing conversion vs:

type TimeFilter = func(t time.Time) bool
func isDayOfTheWeek(dotw int) TimeFilter {
    return func(t time.Time) bool {
        return t.Day() == dotw
    }
}

However, after doing some benchmarking (in this case a filter testing if hour of the day is between noon and 2PM - see gist for implementation):

goos: darwin
goarch: amd64
cpu: Intel(R) Core(TM) i7-7700HQ CPU @ 2.80GHz
BenchmarkHoursBetween/UTC_Times_(int)-8         	1000000000	         0.006936 ns/op	       0 B/op	       0 allocs/op
BenchmarkHoursBetween/EST_Test_(int)-8          	1000000000	         0.02231 ns/op	       0 B/op	       0 allocs/op
BenchmarkHoursBetween/WET_Times_(int)-8         	1000000000	         0.02234 ns/op	       0 B/op	       0 allocs/op
BenchmarkHoursBetween/UTC_Times_(time.Time)-8   	1000000000	         0.006877 ns/op	       0 B/op	       0 allocs/op
BenchmarkHoursBetween/EST_Test_(time.Time)-8    	1000000000	         0.02426 ns/op	       0 B/op	       0 allocs/op
BenchmarkHoursBetween/WET_Times_(time.Time)-8   	1000000000	         0.006018 ns/op	       0 B/op	       0 allocs/op

it seems that filtering half a million timestamps isn't very time consuming regardless of if you convert from an int first
(One interesting and unexpected observation, it seems that the timezone has a big impact on how long this filtering operation takes! Times in EST were consistently 10x slower than equivalent timestamps in UTC for this test)

Another use case would be to drive logic off of column type:

if f.ColumnTypeMap()["time_StartTime"] == types.Int {
  // need to keep my own naming convention/external map to know if this is actually a Time type
}

vs

if f.ColumnTypeMap()["StartTime"] == types.Time {
   ...
}

Adopting a naming convention where I prefix my date column names with "time_" seems like a reasonable work around.

Being able to pull in time.Time directly from sql would also be helpful.

@tobgu
Copy link
Owner

tobgu commented Jul 3, 2021

Great investigation! I guess what t.Day() actually does is very similar to what you've done In(loc) for all timeones != UTC.

Being able to read time directly from SQL is a nice use case!

One point worth noting about time.Time is that it contains a pointer to a Location object. This means that a big frame with time columns is potentially quite heavy on the GC since it must now traverse all rows to follow the pointers in them to learn which objects are still alive. For the other types in QFrame I've made sure that they're only made up of primitive types to avoid this. If it's a problem or not is of course determined by the use case. In many cases it isn't. For the use case QFrame was originally developed (Qocache) it certainly would since it's basically a cache holding lots of QFrames of different sizes. I don't see that as a game stopper though as time.Time would not be used there.

One could of course come up with other, more space, and GC efficient, internal representations and then convert it on the fly to time.Time when calling filters, aggregations, etc. for convenience. This would be similar to what's done today for enum and string columns (which are just one big slice of bytes + a slice of integer offsets internally).

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

No branches or pull requests

2 participants