Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Joining dataframes #17

Open
jankom opened this issue Jan 29, 2020 · 3 comments
Open

Joining dataframes #17

jankom opened this issue Jan 29, 2020 · 3 comments

Comments

@jankom
Copy link

jankom commented Jan 29, 2020

Hi, thank you for writing this library. Are there any plans to add Joins? If I were to add them at least for myself, since I am not that experienced Go developer and I doubt it will bi in par to you standards, where/how would be the smartest way to add it?

Btw. Not totally related, but I am making an interpreter in Go and I will probably use your qframe for it's dataframe implementation. It could be a nice solution for interactive data exploration/cleanup. I will show you once language is more developed.

@tobgu
Copy link
Owner

tobgu commented Feb 2, 2020

Thanks for writing! I don't have a specific use case for joins myself at the moment but I would very much like it to be added still since it is part of a broader "dataframe offering".

You should not worry about giving it a try if you're interested in contributing. We'll sort out how to do it as we go along.

Some ideas/thoughts:

  • It would probably be best/most natural to add a new top level function Join on the qframe which takes another QFrame and a variadic number of functional options.
  • Ultimately I think it would make sense to support the combinations of inner, outer and full outer joins that are available (left/right being determined by which frame the Join function is called on).
  • I think it would make sense to go for a hash join algorithm, some of the building blocks required for this are already present in the code used for GroupBy and Distinct. There is a hash table here https://github.com/tobgu/qframe/blob/master/internal/grouper/grouper.go that perhaps can be re-used as is.
  • Some data copying will likely be required to merge the two dataframes together. This is probably OK performance wise but some care should be taken to reducera it.
  • NULL values should perhaps be configurable (in the case of outer joins) since not all column types have a zero/NULL representation.

I'd be happy to hear your thoughts on this!

@jankom
Copy link
Author

jankom commented Feb 3, 2020

Thank you for very thought out response. I will begin looking at the code as per your instructions and let you know when I have something working. I have a busy week ahead so maybe it won't be immediately. Thank you!

@tobgu
Copy link
Owner

tobgu commented Feb 3, 2020

Cool! Take your time and let me know if you have ideas or questions that you would like to discuss.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

No branches or pull requests

2 participants