Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Simplifying fallback kernels #303

Merged
merged 3 commits into from
Sep 21, 2023
Merged

Simplifying fallback kernels #303

merged 3 commits into from
Sep 21, 2023

Conversation

manodeep
Copy link
Owner

Reduced the amount of code in the fallback kernels. At least on my M2 laptop, it runs faster - slightly faster (5-10%) for DD-type (i.e. small number-density) and significantly (~20-25%) faster for RR-type calculations

@manodeep
Copy link
Owner Author

@lgarrison Didn't realise I hadn't requested a review -- oops!

@manodeep
Copy link
Owner Author

Ohh forgot to mention that I ran the INTEGRATION_TESTS for this change and the exhaustive tests passed

Copy link
Collaborator

@lgarrison lgarrison left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

All looks fine to me! Did you do any tests to figure out where the speedup is coming from?

npairs[ibin]++;
if(need_rpavg) {
rpavg[ibin]+=rp;
src_npairs[ibin]++;
Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I guess the idea here is that there's no point in making stack buffers, since the passed buffers are already local to the current thread?

Copy link
Owner Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Yup. I am also considering whether it would be worthwhile to move to a malloc'ed buffer rather than the stack (but there may be side-effects of false sharing under OpenMP with such a malloc'ed src_npairs[nthreads][nbins] kind of matrix)

@manodeep
Copy link
Owner Author

Didn't actually do a line-by-line comparison timer. Will attempt to do that on my laptop; plus, I will also check that the runtime is not adversely affected on our local linux supercomputer (Skylake and AMD EPYC)

@manodeep
Copy link
Owner Author

Timed the tests on the master and this branch on Skylake cpus - essentially no difference for theory but the mocks, specifically DDtheta is faster.

Timed the tests on master vs this branch on EPYC cpus - same as above difference in theory but slightly faster with the simplified kernels (but smaller improvements compared to SKX). In general, both branches run faster on EPYC compared to SKX.

@manodeep manodeep merged commit e72cf70 into master Sep 21, 2023
@manodeep manodeep deleted the simplify_fallback branch September 21, 2023 12:10
@manodeep
Copy link
Owner Author

Totally forgot to merge this PR!

@manodeep
Copy link
Owner Author

manodeep commented May 2, 2024

Commenting to add the link to the original #296 that spurred this work

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

Successfully merging this pull request may close these issues.

2 participants