Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Remove 'sample' parameter from stats::mean API #2389

Open
wants to merge 15 commits into
base: branch-24.08
Choose a base branch
from

Conversation

mfoerste4
Copy link
Collaborator

This PR removes the sample-parameter from the raft::stats::mean API to prevent people from using it by accident when for example computing the mean for a sampled variance computation.

This also invalidates some of the testcases. Within raft only test-code is affected by this change as the active usage of the sample parameter was already removed in #2381.

This PR is based on #2381 but was separated for tracking purposes.

Note that this requires adaption of downstream libraries using the API. I am aware of at least one occurrence in cuml.

@mfoerste4 mfoerste4 requested a review from a team as a code owner July 24, 2024 12:16
@github-actions github-actions bot added the cpp label Jul 24, 2024
@mfoerste4
Copy link
Collaborator Author

@tfeher , I have already added the PR for this before #2381 is merged in order to start the pipeline, the only actual commit is this one.

@mfoerste4 mfoerste4 added breaking Breaking change improvement Improvement / enhancement to an existing function labels Jul 24, 2024
@@ -27,10 +27,9 @@ namespace stats {
namespace detail {

template <typename Type, typename IdxType = int>
void mean(
Type* mu, const Type* data, IdxType D, IdxType N, bool sample, bool rowMajor, cudaStream_t stream)
void mean(Type* mu, const Type* data, IdxType D, IdxType N, bool rowMajor, cudaStream_t stream)
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

What's the reason for removing the ability to compute a sample-based mean vs a population-based mean? This feature seems useful.

Copy link
Collaborator Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

This is correct for some statistical metrics like stddev or var, but IIUC that theory does not hold for the mean. The sample mean is the closest we have to the population mean.

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Both sample and population mean shall have 1/N factor. That gives the unbiased estimate. (This is unlike stdev or variance where we need to use different factor for sample and population mean).

Copy link
Contributor

@tfeher tfeher left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Thanks Malte for this PR. I agree that using 1/(N-1) factor for sample mean is not correct, we should have a single option that has 1/N factor. Therefore it makes sense to remove the sample parameter.

One should still consider that this is a breaking change, so we can consider to postpone this PR for 24.10, and issue a warning in 24.08 if the sample param is used.

@cjnolet
Copy link
Member

cjnolet commented Jul 24, 2024

One should still consider that this is a breaking change, so we can consider to postpone this PR for 24.10, and issue a warning in 24.08 if the sample param is used.

This was the primary reason I pointed this out. We need to be careful about updates to public apis specifically, because we can only assume they are being used downstream (and we need to check).

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
breaking Breaking change cpp improvement Improvement / enhancement to an existing function Public API
Projects
Status: In Progress
Development

Successfully merging this pull request may close these issues.

3 participants