Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Add a libcudf/thrust-based TPC-H derived datagen #16294

Merged
merged 141 commits into from
Aug 30, 2024

Conversation

JayjeetAtGithub
Copy link
Contributor

@JayjeetAtGithub JayjeetAtGithub commented Jul 16, 2024

Description

This PR adds a TPC-H (according to spec 3.0.1) inspired datagen written using libcudf and thrust

Implementation Status

  • lineitem
  • orders
  • region
  • nation
  • supplier
  • customer
  • part
  • partsupp

Checklist

  • I am familiar with the Contributing Guidelines.
  • New or existing tests cover these changes.
  • The documentation is up to date with these changes.

Copy link

copy-pr-bot bot commented Jul 16, 2024

This pull request requires additional validation before any workflows can run on NVIDIA's runners.

Pull request vetters can view their responsibilities here.

Contributors can view more details about this message here.

@github-actions github-actions bot added libcudf Affects libcudf (C++/CUDA) code. CMake CMake build issue labels Jul 16, 2024
@JayjeetAtGithub JayjeetAtGithub added feature request New feature or request non-breaking Non-breaking change labels Jul 16, 2024
@JayjeetAtGithub JayjeetAtGithub changed the title Add a libcudf/thrust-based TPC-H inspired datagen [WIP] Add a libcudf/thrust-based TPC-H inspired datagen Jul 16, 2024
@JayjeetAtGithub JayjeetAtGithub marked this pull request as ready for review July 24, 2024 22:24
@JayjeetAtGithub JayjeetAtGithub requested a review from a team as a code owner July 24, 2024 22:24
@JayjeetAtGithub
Copy link
Contributor Author

JayjeetAtGithub commented Jul 24, 2024

TODO:

  • Complete the lineitem and orders tables
  • The stream and mr params need to be passed to the column generation functions throughout the code
  • Need to support scale factors <1

@JayjeetAtGithub JayjeetAtGithub changed the base branch from branch-24.08 to branch-24.10 July 24, 2024 23:48
@JayjeetAtGithub JayjeetAtGithub changed the title [WIP] Add a libcudf/thrust-based TPC-H inspired datagen Add a libcudf/thrust-based TPC-H inspired datagen Jul 27, 2024
Copy link
Contributor

@davidwendt davidwendt left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Could you move the function definitions/implementations from header files to .cpp/.cu files? It appears that dbgen.cu is a .cu file because of header it includes and that header contains device code which seems unnecessarily restrictive.

@JayjeetAtGithub JayjeetAtGithub changed the title Add a libcudf/thrust-based TPC-H inspired datagen Add a libcudf/thrust-based TPC-H deriveddatagen Aug 8, 2024
@JayjeetAtGithub JayjeetAtGithub changed the title Add a libcudf/thrust-based TPC-H deriveddatagen Add a libcudf/thrust-based TPC-H derived datagen Aug 8, 2024
@JayjeetAtGithub JayjeetAtGithub requested a review from a team as a code owner August 8, 2024 19:46
Copy link
Member

@harrism harrism left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

This is a lot to get your head around and to build cleanly, so kudos to you @JayjeetAtGithub for getting it done!

Really I would have expected us to build this in Python, and I'm not sure why we didn't, but I suppose it's a good SOL test to do it in libcudf.

Overall it looks good, I had a couple of questions, but I approve.

Aside: I think that many of the generate_* function implementations have very high cognitive complexity so breaking them up into subfunctions might help maintainability (but I guess there's not much reuse potential). I don't think this is necessary before merging.

@JayjeetAtGithub
Copy link
Contributor Author

Can we use std::span?

I tried using cudf::host_span, but looks like in the perform_left_join code, we use the column indices in the left_on and right_on params to select columns from a table using table_view.select. Now table_view.select accepts only a vector. So we can't use std::span here.

Copy link
Member

@PointKernel PointKernel left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Thanks for the great work. This is a big step forward, and I'm really excited about the upcoming NVBench migration.

@JayjeetAtGithub, could you please document the concerns we discussed in a GitHub issue to ensure we won't miss them in the future?

Comment on lines 385 to 389
auto const gather_map = cudf::table_view({indices->view(), keys->view()});
auto const gathered_table = cudf::gather(
gather_map, ternary_mask->view(), cudf::out_of_bounds_policy::DONT_CHECK, stream, mr);
auto gathered_table_columns = gathered_table->release();
return std::move(gathered_table_columns[1]);
Copy link
Contributor

@karthikeyann karthikeyann Aug 29, 2024

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

nit: indices is not required here in gather_map table.

Comment on lines 402 to 405
auto const gather_map = cudf::table_view({indices->view(), keys->view()});
auto const gathered_table = cudf::gather(
gather_map, mask_index_type->view(), cudf::out_of_bounds_policy::DONT_CHECK, stream, mr);
auto gathered_table_columns = gathered_table->release();
Copy link
Contributor

@karthikeyann karthikeyann Aug 29, 2024

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

nit: indices is not required here in gather_map table.

Copy link
Contributor

@vyasr vyasr left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Reviewing just the CMake.

How do you anticipate the generator be used? Will it just be used by other benchmarks in our repository? If so, the CMake looks fine as-is. If you intend broader usage (I'd caution against that unless we have a very good reason) then there is some additional machinery.

@JayjeetAtGithub
Copy link
Contributor Author

Reviewing just the CMake.

How do you anticipate the generator be used? Will it just be used by other benchmarks in our repository? If so, the CMake looks fine as-is. If you intend broader usage (I'd caution against that unless we have a very good reason) then there is some additional machinery.

Thanks for taking a look @vyasr. For now, we only want our internal benchmark code to use the datagen by just linking to the static library libtpch_data_generator.a.

Copy link
Contributor

@karthikeyann karthikeyann left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Looks good to me 👍
Nice work! This is going to be very useful for benchmarks.

@karthikeyann
Copy link
Contributor

/merge

@rapids-bot rapids-bot bot merged commit 23fb31e into rapidsai:branch-24.10 Aug 30, 2024
86 checks passed
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
CMake CMake build issue feature request New feature or request libcudf Affects libcudf (C++/CUDA) code. non-breaking Non-breaking change
Projects
Status: Done
Status: Landed
Development

Successfully merging this pull request may close these issues.

9 participants