Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

fix: Improve histogram bin logic #18761

Open
wants to merge 4 commits into
base: main
Choose a base branch
from

Conversation

mcrumiller
Copy link
Contributor

@mcrumiller mcrumiller commented Sep 16, 2024

Closes #18650.

This is a revamp of hist and how it deals with bin edges when data is not provided, with the intent of better aligning the defaults with numpy. Bins are still left half-open intervals, i.e. of the form (a, b]. Here are the main changes:

  1. If neither bin_count nor bins is provided, 10 bins are created. The bins are all equally-spaced, spanning from the minimum value to the maximum value. The lowest bin's left edge is increase by 0.1% the span of the entire data (in line with pandas, and our existing implementation) to include the lowest value. This way, all values are included in a call to series.hist() with no arguments, or bin_count provided:
    pl.Series([2, 5, 8]).hist(bin_count=4)
    # shape: (4, 3)
    # ┌────────────┬──────────────┬───────┐
    # │ breakpoint ┆ category     ┆ count │
    # │ ---        ┆ ---          ┆ ---   │
    # │ f64        ┆ cat          ┆ u32   │
    # ╞════════════╪══════════════╪═══════╡
    # │ 3.5        ┆ (1.994, 3.5] ┆ 1     │
    # │ 5.0        ┆ (3.5, 5.0]   ┆ 1     │
    # │ 6.5        ┆ (5.0, 6.5]   ┆ 0     │
    # │ 8.0        ┆ (6.5, 8.0]   ┆ 1     │
    # └────────────┴──────────────┴───────┘
  2. If the series has only a single (non-null) value, a span of 1.0 is used, regardless of the number of bins (when bin_counts is used). The default bin count is still 10.
    pl.Series([5]).hist()
    # shape: (10, 3)
    # ┌────────────┬────────────┬───────┐
    # │ breakpoint ┆ category   ┆ count │
    # │ ---        ┆ ---        ┆ ---   │
    # │ f64        ┆ cat        ┆ u32   │
    # ╞════════════╪════════════╪═══════╡
    # │ 4.6        ┆ (4.5, 4.6] ┆ 0     │
    # │ 4.7        ┆ (4.6, 4.7] ┆ 0     │
    # │ 4.8        ┆ (4.7, 4.8] ┆ 0     │
    # │ 4.9        ┆ (4.8, 4.9] ┆ 0     │
    # │ 5.0        ┆ (4.9, 5.0] ┆ 0     │
    # │ 5.1        ┆ (5.0, 5.1] ┆ 1     │
    # │ 5.2        ┆ (5.1, 5.2] ┆ 0     │
    # │ 5.3        ┆ (5.2, 5.3] ┆ 0     │
    # │ 5.4        ┆ (5.3, 5.4] ┆ 0     │
    # │ 5.5        ┆ (5.4, 5.5] ┆ 0     │
    # └────────────┴────────────┴───────┘
  3. If the series has no values, bins range over the unit interval.
    pl.Series([], dtype=pl.Int8).hist()
    # shape: (10, 3)
    # ┌────────────┬────────────┬───────┐
    # │ breakpoint ┆ category   ┆ count │
    # │ ---        ┆ ---        ┆ ---   │
    # │ f64        ┆ cat        ┆ u32   │
    # ╞════════════╪════════════╪═══════╡
    # │ 0.1        ┆ (0.0, 0.1] ┆ 0     │
    # │ 0.2        ┆ (0.1, 0.2] ┆ 0     │
    # │ 0.3        ┆ (0.2, 0.3] ┆ 0     │
    # │ 0.4        ┆ (0.3, 0.4] ┆ 0     │
    # │ 0.5        ┆ (0.4, 0.5] ┆ 0     │
    # │ 0.6        ┆ (0.5, 0.6] ┆ 0     │
    # │ 0.7        ┆ (0.6, 0.7] ┆ 0     │
    # │ 0.8        ┆ (0.7, 0.8] ┆ 0     │
    # │ 0.9        ┆ (0.8, 0.9] ┆ 0     │
    # │ 1.0        ┆ (0.9, 1.0] ┆ 0     │
    # └────────────┴────────────┴───────┘
  4. If bins is provided, it is always respected.
  5. If bins is provided with only a single edge, then it is zero-width, and no values are binned. The result is empty:
    pl.Series([1, 2, 3]).hist(bins=[2])
    # shape: (0, 3)
    # ┌────────────┬──────────┬───────┐
    # │ breakpoint ┆ category ┆ count │
    # │ ---        ┆ ---      ┆ ---   │
    # │ f64        ┆ cat      ┆ u32   │
    # ╞════════════╪══════════╪═══════╡
    # └────────────┴──────────┴───────┘
  6. You cannot provide both bins and bin_counts:
    pl.Series([2, 5, 8]).hist(bin_count=4)
    # polars.exceptions.ComputeError: can only provide one of `bin_count` or `bins`

This implementation is a bit cleaner in that it separates the bin-creation logic from the binning logic. It also includes a check for uniformity of bins, even if bins are provided by the user, enabling the fast path to be used in more cases. The fast path and the slower path are now a bit more obvious in separate functions, and I think that the new slow should be a bit faster than the older slow path. The new fast path revives an additional check that fixes a few floating point issues when dealing with integers but it's a bit improved (performance wise) compared to the old one.

I do think that hist could use some improvements with a include_lower parameter to avoid the "ugly" lowest bin which is extended by 0.1%, but that can be for another day.

@mcrumiller mcrumiller changed the title Add initial test fix: Improve histogram bin logic Sep 16, 2024
@github-actions github-actions bot added fix Bug fix python Related to Python Polars rust Related to Rust Polars and removed title needs formatting labels Sep 16, 2024
Copy link

codecov bot commented Sep 18, 2024

Codecov Report

Attention: Patch coverage is 96.02649% with 6 lines in your changes missing coverage. Please review.

Project coverage is 79.90%. Comparing base (4894e24) to head (6edf934).
Report is 20 commits behind head on main.

Files with missing lines Patch % Lines
crates/polars-ops/src/chunked_array/hist.rs 96.02% 6 Missing ⚠️
Additional details and impacted files
@@            Coverage Diff             @@
##             main   #18761      +/-   ##
==========================================
+ Coverage   79.85%   79.90%   +0.05%     
==========================================
  Files        1517     1518       +1     
  Lines      205530   205703     +173     
  Branches     2892     2893       +1     
==========================================
+ Hits       164119   164365     +246     
+ Misses      40863    40790      -73     
  Partials      548      548              

☔ View full report in Codecov by Sentry.
📢 Have feedback on the report? Share it here.

@ritchie46
Copy link
Member

@alexander-beedie can you take a look at this one?

@alexander-beedie
Copy link
Collaborator

@alexander-beedie can you take a look at this one?

Looks like an improvement to me, but I'll see if I can run it by one of our more hands-on researchers for bonus feedback tomorrow 👍

@Julian-J-S
Copy link
Contributor

Julian-J-S commented Sep 24, 2024

just sharing my thoughts:

Why deviate from numpy

numpy interval is opened the other side.

  • polars result: 1, 1, 0, 1
  • numpy result: 1, 0, 1, 1
pl.Series([2, 5, 8]).hist(bin_count=4)
shape: (4, 3)
┌────────────┬──────────────┬───────┐
│ breakpointcategorycount │
│ ---------   │
│ f64catu32   │
╞════════════╪══════════════╪═══════╡
│ 3.5        ┆ (1.994, 3.5] ┆ 1     │
│ 5.0        ┆ (3.5, 5.0]   ┆ 1     │
│ 6.5        ┆ (5.0, 6.5]   ┆ 0     │
│ 8.0        ┆ (6.5, 8.0]   ┆ 1     │
└────────────┴──────────────┴───────┘


np.histogram([2, 5, 8], bins=4)
(array([1, 0, 1, 1]), array([2. , 3.5, 5. , 6.5, 8. ]))

Return Struct with Edges (feature)

The current "category" is almost unusable since it is a string/cat.
Would love to see a struct with the edges here instead or additionally.
This also would remove the "strange" 1.994 hack",
Then just document how the intervals are defined and return struct with x1 & x2 as edges 😄
You could also completely remove that "breakpoint" column and use the struct values instead. simpler and more performance I would think 🤔

example

┌────────────┬──────────────┬───────┐
│ breakpoint ┆ EDGES        ┆ count │
│ ---        ┆ ---          ┆ ---   │
│ f64        ┆ STRUCT       ┆ u32   │
╞════════════╪══════════════╪═══════╡
│ 3.5        ┆ {2.0, 3.5}   ┆ 1     │
│ 5.0        ┆ {3.5, 5.0}   ┆ 1     │
│ 6.5        ┆ {5.0, 6.5}   ┆ 0     │
│ 8.0        ┆ {6.5, 8.0}   ┆ 1     │
└────────────┴──────────────┴───────┘

I can also open a feature request if this is too much for this PR?

@mcrumiller
Copy link
Contributor Author

@Julian-J-S I believe @MarcoGorelli has an issue open for exactly the point you make. I did not want to address that in this PR as I didn't want to cover too much ground but I wholeheartedly agree.

Regarding your other issue, I think the bounds difference from numpy is to align with pandas. In my opinion, the type of interval is probably not super important to the histogram, but I do agree about the edge bins being ugly.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
fix Bug fix python Related to Python Polars rust Related to Rust Polars
Projects
None yet
Development

Successfully merging this pull request may close these issues.

hist panics after creating zero bins
4 participants