Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

special handling of linear binning #258

Merged
merged 22 commits into from
May 12, 2022

Conversation

adematti
Copy link

Hi,

As discussed with Lehman for DESI applications, this is a first attempt to implement a special handling of linear binning, which helps corrfunc run faster with a large number of bins.
There is no speed gain at low number of bins ~ 10, but a speed gain > 3x for bins ~ 200.
I added a flag bin_type to the Python wrapper, to choose between: 'auto', 'lin', 'custom'; with 'lin' for linear binning, 'auto' (default) to automatically detect linear binning
(which happens here: https://github.com/adematti/Corrfunc/blob/c500d2c9137ff7f66d34eb4189bc19f9325d84ff/utils/utils.c#L85).
The following pair counters have been updated:
mocks: DDrppi_mocks, DDsmu_mocks
theory: DD, DDrppi, DDsmu, xi, wp
I did not update DDtheta_mocks (yet), as there are implementation choices to be made:
a) should I consider the binning to be linear in theta or acos? (I guess the former, as the pair counts for the latter can be obained from DD if maximum separation < \pi)
b) should I use fast acos to compute the bin for a given pair? in this case, a pair can fall in the wrong bin due to numerical approximation
To be discussed, also:

  1. should I pass rstep (bin width), rmin to the kernels directly (currently taking the square root of sqr_rmin, sqr_rmax, which is slightly suboptimal)
  2. should I print the chosen binning type; in this case, where? in each countpairs_DOUBLE function?
  3. should I leave option 'auto'? it checks for linear binning, with absolute and relative tolerance of 1e-12 for both double and floats.
    Not sure how it behaves in practice for the latter case (floats), see remark ii) below.
    I also have a couple of remarks:
    i) this line
    return ra.astype(input_dtype), dec.astype(input_dtype)
    seems unnecessary?
    (as array updates are done in place and calls to this functions do not retrieve the returned arrays)
    ii) I don't think setup_bins_float is used anywhere in the code, and setup_bins_double is only used here:
    setup_bins_DOUBLE(binfile,&thetamin,&thetamax,&nthetabin,&theta_upp);

    Is there any reason not to use a single version of this function, setup_bins?
    iii) Some time ago I had binning errors with Corrfunc after calling matplotlib. I figured out this was due to how the string separation for floats was handled (matplotlib changed the environment variables specifying this convention;
    export LC_NUMERIC=C solved this.). For this issue not to happen to other people, I guess the simplest solution would be to pass bins as an array rather than a file to be read from disk. Is there any reason not to do so?

Thanks!
Best,
Arnaud

@pep8speaks
Copy link

pep8speaks commented Sep 20, 2021

Hello @adematti! Thanks for updating this PR. We checked the lines you've touched for PEP 8 issues, and found:

Line 225:80: E501 line too long (80 > 79 characters)
Line 233:80: E501 line too long (80 > 79 characters)
Line 236:80: E501 line too long (81 > 79 characters)
Line 346:80: E501 line too long (96 > 79 characters)

Line 235:80: E501 line too long (80 > 79 characters)
Line 238:80: E501 line too long (81 > 79 characters)
Line 267:80: E501 line too long (96 > 79 characters)

Line 202:80: E501 line too long (82 > 79 characters)
Line 212:80: E501 line too long (80 > 79 characters)
Line 215:80: E501 line too long (81 > 79 characters)
Line 297:80: E501 line too long (96 > 79 characters)

Line 15:1: E302 expected 2 blank lines, found 1
Line 17:26: E231 missing whitespace after ','
Line 17:38: E231 missing whitespace after ','
Line 23:80: E501 line too long (95 > 79 characters)
Line 29:67: E251 unexpected spaces around keyword / parameter equals
Line 29:69: E251 unexpected spaces around keyword / parameter equals
Line 29:80: E501 line too long (98 > 79 characters)
Line 33:1: W293 blank line contains whitespace
Line 40:52: E231 missing whitespace after ','
Line 44:21: E128 continuation line under-indented for visual indent
Line 49:22: E231 missing whitespace after ','
Line 52:1: W293 blank line contains whitespace
Line 54:1: W293 blank line contains whitespace
Line 55:1: E302 expected 2 blank lines, found 1
Line 55:62: E231 missing whitespace after ','
Line 59:21: E128 continuation line under-indented for visual indent
Line 64:22: E231 missing whitespace after ','
Line 67:1: W293 blank line contains whitespace
Line 77:44: E231 missing whitespace after ','
Line 77:52: E231 missing whitespace after ','
Line 77:58: E231 missing whitespace after ','
Line 78:80: E501 line too long (116 > 79 characters)
Line 78:95: E251 unexpected spaces around keyword / parameter equals
Line 78:97: E251 unexpected spaces around keyword / parameter equals
Line 80:7: E231 missing whitespace after ','
Line 80:11: E231 missing whitespace after ','
Line 80:14: E231 missing whitespace after ','
Line 81:1: W293 blank line contains whitespace
Line 83:19: E128 continuation line under-indented for visual indent
Line 84:19: E128 continuation line under-indented for visual indent
Line 85:19: E128 continuation line under-indented for visual indent
Line 86:19: E128 continuation line under-indented for visual indent
Line 87:1: W293 blank line contains whitespace
Line 88:1: E302 expected 2 blank lines, found 1
Line 88:44: E231 missing whitespace after ','
Line 88:52: E231 missing whitespace after ','
Line 88:58: E231 missing whitespace after ','
Line 89:80: E501 line too long (145 > 79 characters)
Line 89:125: E251 unexpected spaces around keyword / parameter equals
Line 89:127: E251 unexpected spaces around keyword / parameter equals
Line 89:138: E251 unexpected spaces around keyword / parameter equals
Line 89:140: E251 unexpected spaces around keyword / parameter equals
Line 91:7: E231 missing whitespace after ','
Line 91:11: E231 missing whitespace after ','
Line 91:14: E231 missing whitespace after ','
Line 92:1: W293 blank line contains whitespace
Line 94:19: E128 continuation line under-indented for visual indent
Line 95:19: E128 continuation line under-indented for visual indent
Line 96:19: E128 continuation line under-indented for visual indent
Line 97:19: E128 continuation line under-indented for visual indent
Line 100:44: E231 missing whitespace after ','
Line 100:52: E231 missing whitespace after ','
Line 100:58: E231 missing whitespace after ','
Line 101:80: E501 line too long (90 > 79 characters)
Line 103:7: E231 missing whitespace after ','
Line 103:11: E231 missing whitespace after ','
Line 103:14: E231 missing whitespace after ','
Line 104:1: W293 blank line contains whitespace
Line 106:19: E128 continuation line under-indented for visual indent
Line 107:19: E128 continuation line under-indented for visual indent
Line 108:19: E128 continuation line under-indented for visual indent
Line 109:19: E128 continuation line under-indented for visual indent
Line 110:19: E128 continuation line under-indented for visual indent
Line 113:44: E231 missing whitespace after ','
Line 113:52: E231 missing whitespace after ','
Line 113:58: E231 missing whitespace after ','
Line 114:80: E501 line too long (96 > 79 characters)
Line 116:6: E231 missing whitespace after ','
Line 116:8: E231 missing whitespace after ','
Line 116:10: E231 missing whitespace after ','
Line 118:1: W293 blank line contains whitespace
Line 122:1: W293 blank line contains whitespace
Line 124:44: E231 missing whitespace after ','
Line 124:52: E231 missing whitespace after ','
Line 124:58: E231 missing whitespace after ','
Line 125:80: E501 line too long (111 > 79 characters)
Line 127:6: E231 missing whitespace after ','
Line 127:8: E231 missing whitespace after ','
Line 127:10: E231 missing whitespace after ','
Line 131:19: E128 continuation line under-indented for visual indent
Line 132:19: E128 continuation line under-indented for visual indent
Line 133:19: E128 continuation line under-indented for visual indent
Line 134:19: E128 continuation line under-indented for visual indent
Line 137:44: E231 missing whitespace after ','
Line 137:52: E231 missing whitespace after ','
Line 137:58: E231 missing whitespace after ','
Line 138:80: E501 line too long (98 > 79 characters)
Line 139:39: E127 continuation line over-indented for visual indent
Line 141:6: E231 missing whitespace after ','
Line 141:8: E231 missing whitespace after ','
Line 141:10: E231 missing whitespace after ','
Line 145:19: E128 continuation line under-indented for visual indent
Line 146:19: E128 continuation line under-indented for visual indent
Line 147:19: E128 continuation line under-indented for visual indent
Line 148:19: E128 continuation line under-indented for visual indent
Line 149:19: E128 continuation line under-indented for visual indent
Line 151:1: W293 blank line contains whitespace
Line 152:44: E231 missing whitespace after ','
Line 152:52: E231 missing whitespace after ','
Line 152:58: E231 missing whitespace after ','
Line 153:80: E501 line too long (107 > 79 characters)
Line 155:6: E231 missing whitespace after ','
Line 155:8: E231 missing whitespace after ','
Line 155:10: E231 missing whitespace after ','
Line 157:1: W293 blank line contains whitespace
Line 159:19: E128 continuation line under-indented for visual indent
Line 160:19: E128 continuation line under-indented for visual indent
Line 161:19: E128 continuation line under-indented for visual indent
Line 163:1: W293 blank line contains whitespace
Line 164:44: E231 missing whitespace after ','
Line 164:52: E231 missing whitespace after ','
Line 164:58: E231 missing whitespace after ','
Line 165:80: E501 line too long (96 > 79 characters)
Line 167:6: E231 missing whitespace after ','
Line 167:8: E231 missing whitespace after ','
Line 167:10: E231 missing whitespace after ','
Line 169:1: W293 blank line contains whitespace
Line 171:19: E128 continuation line under-indented for visual indent
Line 172:19: E128 continuation line under-indented for visual indent
Line 173:19: E128 continuation line under-indented for visual indent

Line 153:80: E501 line too long (80 > 79 characters)
Line 156:80: E501 line too long (81 > 79 characters)
Line 224:80: E501 line too long (96 > 79 characters)

Line 163:80: E501 line too long (80 > 79 characters)
Line 171:80: E501 line too long (80 > 79 characters)
Line 174:80: E501 line too long (81 > 79 characters)
Line 271:80: E501 line too long (96 > 79 characters)

Line 179:80: E501 line too long (80 > 79 characters)
Line 182:80: E501 line too long (81 > 79 characters)
Line 277:80: E501 line too long (96 > 79 characters)

Line 288:80: E501 line too long (80 > 79 characters)
Line 412:80: E501 line too long (80 > 79 characters)
Line 420:80: E501 line too long (80 > 79 characters)
Line 423:80: E501 line too long (81 > 79 characters)
Line 500:80: E501 line too long (96 > 79 characters)

Line 139:80: E501 line too long (80 > 79 characters)
Line 142:80: E501 line too long (81 > 79 characters)
Line 210:80: E501 line too long (96 > 79 characters)

Line 18:13: E127 continuation line over-indented for visual indent
Line 18:80: E501 line too long (99 > 79 characters)

Comment last updated at 2022-05-05 19:51:25 UTC

@manodeep
Copy link
Owner

@adematti Great - thanks for the PR. I took a quick look at some C files and can see a couple of improvements. Quickly leaving comments for now - will add a full code review later.

  • It might be better to rename the new function setup_bin_type to detect_bin_type
  • What happens when there is exactly 1 bin - the current setup will end up with a divide by zero
  • In the kernels, you are likely to get better performance by computing the reciprocal of rpstep and then multiplying with that reciprocal (in both the SIMD and the regular trailing loop sections)
  • Will you please check if it is possible to cast the linear histogram bin calculation [truncate((r - rmin)*inv_rstep)] as a fused multiply-add (FMA)? If so, that might save a bit more cpu cycles

@manodeep
Copy link
Owner

Hi,

As discussed with Lehman for DESI applications, this is a first attempt to implement a special handling of linear binning, which helps corrfunc run faster with a large number of bins.
There is no speed gain at low number of bins ~ 10, but a speed gain > 3x for bins ~ 200.
I added a flag bin_type to the Python wrapper, to choose between: 'auto', 'lin', 'custom'; with 'lin' for linear binning, 'auto' (default) to automatically detect linear binning
(which happens here: https://github.com/adematti/Corrfunc/blob/c500d2c9137ff7f66d34eb4189bc19f9325d84ff/utils/utils.c#L85).
The following pair counters have been updated:
mocks: DDrppi_mocks, DDsmu_mocks
theory: DD, DDrppi, DDsmu, xi, wp
I did not update DDtheta_mocks (yet), as there are implementation choices to be made:
a) should I consider the binning to be linear in theta or acos? (I guess the former, as the pair counts for the latter can be obained from DD if maximum separation < \pi)

I suspect linear in theta is the more typical choice. Is it possible to support both through some user-option?

b) should I use fast acos to compute the bin for a given pair? in this case, a pair can fall in the wrong bin due to numerical approximation
To be discussed, also:

Hmmm - fast-acos option should be independent of the bin-type choice. If the user has specified fast_acos, then we should the approximations, otherwise, we should use the full-accuracy trig function

1. should I pass rstep (bin width), rmin to the kernels directly (currently taking the square root of sqr_rmin, sqr_rmax, which is slightly suboptimal)

Yeah - I saw that one. May be change the behaviour to pass rmin and rmax to the kernels, and then square those to get the sqr_rmin and sqr_rmax`?

2. should I print the chosen binning type; in this case, where? in each countpairs_DOUBLE function?

Yes, but only if options->verbose is set.

3. should I leave option 'auto'? it checks for linear binning, with absolute and relative tolerance of 1e-12 for both double and floats.
   Not sure how it behaves in practice for the latter case (floats), see remark ii) below.

I would change the tolerance to be appropriate for the precision of the numeric type. 1e-12 seems reasonable for double but too small for float (probably more like 1e-8). May be check what the default options for np.allclose are?

   I also have a couple of remarks:
   i) this line https://github.com/manodeep/Corrfunc/blob/596fe77078d59b296b34608927e301c427331919/Corrfunc/utils.py#L461
    seems unnecessary?
   (as array updates are done in place and calls to this functions do not retrieve the returned arrays)

This is for @lgarrison

   ii) I don't think setup_bins_float is used anywhere in the code, and setup_bins_double is only used here: https://github.com/manodeep/Corrfunc/blob/74c6fc29f9a0236eaebbc1830a1f59fd9a53bfc8/mocks/DDtheta_mocks/countpairs_theta_mocks_impl.c.src#L534

The source (.c.src files) are pre-processed by sed twice and DOUBLE becomes double and float -- i.e., setup_bins_DOUBLE translates to both setup_bins_float and setup_bins_double

   Is there any reason not to use a single version of this function, setup_bins?
   iii) Some time ago I had binning errors with Corrfunc after calling matplotlib. I figured out this was due to how the string separation for floats was handled (matplotlib changed the environment variables specifying this convention;
   export LC_NUMERIC=C solved this.). For this issue not to happen to other people, I guess the simplest solution would be to pass bins as an array rather than a file to be read from disk. Is there any reason not to do so?

Huh - okay! I can see that happening but that's poor behaviour on maplotlib's part.

The reason for specifying the bins from a file is historical - reading from a file lets us keep the exact same bins as the output from a previous Corrfunc run.

Many thanks for using Corrfunc and adding such a feature! Hopefully, I have answered all/most of the queries.

Cheers,
Manodeep

Thanks!
Best,
Arnaud

@manodeep
Copy link
Owner

* Will you please check if it is possible to cast the linear histogram bin calculation `[truncate((r - rmin)*inv_rstep)]` as a fused multiply-add (FMA)? If so, that might save a bit more cpu cycles

https://software.intel.com/sites/landingpage/IntrinsicsGuide/#avx512techs=AVX512F&expand=3441,3186&text=mm512_mask3_fmadd_round_p

If you rewrite the bin calculation from rpbin := (r - rmin)*inv_rstep to rpbin := r*inv_rstep - rmin*inv_rstep, then the calculation takes the required FMA form. The rounding parameter in the intrinsic is likely the _MM_FROUND_TO_ZERO |_MM_FROUND_NO_EXC but might require a separate truncate to integer.

@adematti
Copy link
Author

Thanks for your answer!
Having one bin (nbin = 2) is ok; zero bin (nbin < 2) fails after checks like

fprintf(stderr,"Error: Could not setup with S bins correctly. (smin = %lf, smax = %lf, with nbins = %d). Expected non-zero smin/smax with smax > smin and nbins >=1 \n",

I can have detect_bin_type return an EXIT_FAILURE in case nbin <= 1 (though that will have no impact on the code).
Concerning kernels and FMA, I will check that asap.

I suspect linear in theta is the more typical choice. Is it possible to support both through some user-option?

hum, like "cos_lin"? but that will only make sense for DDtheta...

Hmmm - fast-acos option should be independent of the bin-type choice. If the user has specified fast_acos, then we should the approximations, otherwise, we should use the full-accuracy trig function

My point is that using fast-acos to perform linear binning may yield different npairs than with the "custom" binning (fast-acos being an approximation). I don't think we want that. So linear binning should rely on true-acos, whatever is required for thetaavg. I guess that will degrade efficiency by a lot - though not tested yet.

By the way, "auto" is used by default, meaning linear binning whenever input bins are linear. My tests showed it did not slow down pair counts even with 10 bins, but these tests are not exhaustive...

May be check what the default options for np.allclose are?

This is rtol=1e-05, atol=1e-08. It may yield incorrect pair counts. Then I guess we should have "custom" as a default instead of "auto", as pair counts should be guaranteed correct with the default options.

The source (.c.src files) are pre-processed by sed twice and DOUBLE becomes double and float -- i.e., setup_bins_DOUBLE translates to both setup_bins_float and setup_bins_double

hm... setup_bins_float and setup_bins_double are defined in utils.c. There is no utils.c.src, hence no setup_bins_DOUBLE definition. All the calls are to setup_bins (except for DDtheta, where setup_bins_double in called). Maybe I missed something?

Copy link
Owner

@manodeep manodeep left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I have looked through the code - thank you! This is an enormous effort (and apologies for all of the stray trailing white-spaces)

I have made some comments - nothing too major really but most of the comments apply to a lot of the files and I stopped repeating myself. Hopefully, my comments are useful enough to give you a sense of what to change.

Thanks again!
Manodeep

common.mk Outdated
@@ -19,7 +19,7 @@ CLINK ?=
## Set the python command (supply the full path to python you want to
## use, if different from directly calling `python` on the shell,
## as can be the case if python is set via an alias)
PYTHON:=python
PYTHON:=/local/home/adematti/anaconda3/envs/cosmopipe-dev/bin/python
Copy link
Owner

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

This will need to be reverted to python

@@ -23,7 +23,8 @@ def DDrppi_mocks(autocorr, cosmology, nthreads, pimax, binfile,
xbin_refine_factor=2, ybin_refine_factor=2,
zbin_refine_factor=1, max_cells_per_dim=100,
copy_particles=True, enable_min_sep_opt=True,
c_api_timer=False, isa=r'fastest', weight_type=None):
c_api_timer=False, isa=r'fastest',
weight_type=None, bin_type=r'auto'):
Copy link
Owner

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Is the r specifier required? @lgarrison?

Copy link
Collaborator

@lgarrison lgarrison Sep 23, 2021

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

No, should be fine without it (can remove it from the isa too for consistency)
EDIT: this is referring to the r for the raw string. I thought I was replying to a comment thread when I wrote this; same with some comments below. I guess it starts a new thread for any comments that come from a review...?

Comment on lines 72 to 82
print("\nRunning 2-D correlation function xi(rp,pi)")
results_DDrppi_ref = DDrppi_mocks(autocorr, cosmology, nthreads,
pimax, binfile,
ra, dec, cz,
weights1=w, weight_type='pair_product',
output_rpavg=True, verbose=True, isa=isa, bin_type='custom')
results_DDrppi = DDrppi_mocks(autocorr, cosmology, nthreads,
pimax, binfile,
ra, dec, cz,
weights1=w, weight_type='pair_product',
output_rpavg=True, verbose=False, isa=isa, bin_type='lin')
Copy link
Owner

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Is it possible to create a for loop that runs over bin_types=['custom', 'lin'] and compare that the previous output (if any) matches the current output? That will make the test code more concise

(Same applies to all of the other instances for the tests)

except NameError:
if not isinstance(bin_type, str):
raise TypeError(msg)
valid_bin_type = ['AUTO', 'CUSTOM', 'LIN']
Copy link
Owner

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Will you please rename to valid_bin_types (i.e., plural)

@@ -297,7 +297,7 @@ int countpairs_mocks_DOUBLE(const int64_t ND1, DOUBLE *ra1, DOUBLE *dec1, DOUBLE
double *rupp;
int nrpbin ;
double rpmin,rpmax;
setup_bins(binfile,&rpmin,&rpmax,&nrpbin,&rupp);
setup_bins(binfile,&rpmin,&rpmax,&nrpbin,&rupp,&(options->bin_type));
Copy link
Owner

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Is it possible to separate the functionality for reading the bins from a file and the detection of the bin type - i.e., revert this to before and then (after the rpmin etc check), add a new function call to detect the bin type

break;
AVX512_FLOATS m_sbin = AVX512_SETZERO_FLOAT();

if (bin_type == 1) {
Copy link
Owner

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

enum here

AVX512_FLOATS m_sbin = AVX512_SETZERO_FLOAT();

if (bin_type == 1) {
if (!need_savg) {
Copy link
Owner

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

typo with !?

Copy link
Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I wanted to compute SQRT(s2) if not computed before (i.e. need_savg is False).
Maybe I am missing something?

Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Right; the custom binning version uses s^2, but the linear binning uses s, so we need to take the sqrt, unless it was already done because need_savg.

Perhaps line 272 above should just be if(need_savg || bin_type == BIN_LIN)?

Copy link
Owner

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Yup - what @lgarrison said

@@ -282,7 +283,7 @@ extern "C" {
#define AVX512_RECIPROCAL_FLOATS(X) _mm512_rcp14_pd(X)
#define AVX512_MASK_RECIPROCAL_FLOATS(FALSEVALS, MASK, X) _mm512_mask_rcp14_pd(FALSEVALS, MASK, X)
#define AVX512_MASKZ_RECIPROCAL_FLOATS(MASK, X) _mm512_maskz_rcp14_pd(MASK, X)

#define AVX512_CONVERT_INT_TO_FLOAT(X) _mm512_cvtepi32_pd(X)
Copy link
Owner

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I am wondering if you need the (instruction-less) cast to float (these ops) rather than the expensive convert options

@@ -41,11 +41,12 @@ extern "C" {
#define SSE_TRUNCATE_FLOAT_TO_INT(X) _mm_cvttps_epi32(X)
#define SSE_SQUARE_FLOAT(X) _mm_mul_ps(X,X)
#define SSE_SET_FLOAT(X) _mm_set1_ps(X)
#define SSE_CONVERT_INT_TO_FLOAT(X) _mm_cvtepi32_ps(X)
Copy link
Owner

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Same as the AVX512F case - you might need a "free" cast operation rather than the convert operation

utils/utils.c Outdated
return EXIT_SUCCESS;
}

static int setup_bin_type_double(const double *rupp, int nbin, uint8_t *bin_type)
Copy link
Owner

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

As noted earlier, these functions should be available to call publicly (rather than static)

@adematti
Copy link
Author

Hi,
Sorry, I meant to push this earlier, but forgot... I think all your comments above have been addressed, but:

  1. bin_type=r'auto' => r'' is not necessary, but have included it as it was in e.g. isa=r'fastest'
  2. linear binning treatment for DDtheta still missing
  3. I could not check implementation for AVX512 kernels as my processors do not support it
    I reverted the default binning type to 'custom', to avoid issues related to "binning is interpreted as linear but is actually not" with default options.

@lgarrison
Copy link
Collaborator

Hey @adematti, thank you so much for this contribution! This is not a small effort, and it looks great so far. And thanks @manodeep for reviewing this while I was away. I will try to catch up on the full set of comments above and do my own review, but probably not until tomorrow or Monday at the earliest.

Copy link
Collaborator

@lgarrison lgarrison left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

This looks great! Just some minor comments.

@@ -23,7 +23,8 @@ def DDrppi_mocks(autocorr, cosmology, nthreads, pimax, binfile,
xbin_refine_factor=2, ybin_refine_factor=2,
zbin_refine_factor=1, max_cells_per_dim=100,
copy_particles=True, enable_min_sep_opt=True,
c_api_timer=False, isa=r'fastest', weight_type=None):
c_api_timer=False, isa=r'fastest',
weight_type=None, bin_type=r'auto'):
Copy link
Collaborator

@lgarrison lgarrison Sep 23, 2021

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

No, should be fine without it (can remove it from the isa too for consistency)
EDIT: this is referring to the r for the raw string. I thought I was replying to a comment thread when I wrote this; same with some comments below. I guess it starts a new thread for any comments that come from a review...?

Comment on lines 223 to 238
bin_type : string, case-insensitive (default ``auto``)
If bins in ``binfile`` are linearly-spaced, set to ``lin`` for speed-up.
Else, set to ``custom``.
``auto`` allows for auto-detection of the binning type.

Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

We ought to tell the user something here about accuracy. I think almost all users will be comfortable with the notion that a tiny fraction of pairs may shift bins when using this feature, but this should be mentioned explicitly.

return toret

def test_bin_type(pair_counter, *args, **kwargs):
res = (pair_counter(*args, bin_type=bin_type, weight_type=weight_type, **kwargs) for bin_type in bin_types for weight_type in [None, "pair_product"])
Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I think you can just test the weighted case; recall that the weighted case returns the integer number of raw pairs, in addition to the floating-point weighted pair count. So one can check both cases by just running the weighted computation. The same unweighted code path is exercised in both cases.

Comment on lines 59 to 64
binfile = pjoin(dirname(abspath(__file__)),
"../mocks/tests/", "bins_lin")
rbins = np.linspace(0.1, 45.1, 46)
with open(binfile,'w') as file:
for low, hi in zip(rbins[:-1], rbins[1:]):
file.write("{0} {1}\n".format(low, hi))
Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

No need to actually write to a file here, probably. Can just pass the bins as a Python list/array. (This is calling the Python wrapper, not the CPython API, right?)

Corrfunc/tests.py Outdated Show resolved Hide resolved
Comment on lines 307 to 310
if (options->verbose) {
if (options->bin_type == BIN_LIN) fprintf(stderr,"Linear R binning\n");
else fprintf(stderr,"Custom R binning\n");
}
Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

If you wanted to reduce duplication, I'd have no problem with detect_bin_type having the print statements internally, and either taking an extra int verbose flag or just taking the whole options struct as an arg (to enable checking options->verbose and options->bin_type).

Copy link
Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Ok! I originally separated the print statement to mimic the check on rmin/rmax/nbins above.

AVX512_FLOATS m_sbin = AVX512_SETZERO_FLOAT();

if (bin_type == 1) {
if (!need_savg) {
Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Right; the custom binning version uses s^2, but the linear binning uses s, so we need to take the sqrt, unless it was already done because need_savg.

Perhaps line 272 above should just be if(need_savg || bin_type == BIN_LIN)?

@@ -179,6 +183,10 @@ extern "C" {
#define AVX2_STREAMING_STORE_FLOATS(X,Y) _mm256_stream_pd(X,Y)
#define AVX2_STREAMING_STORE_INTS(X,Y) _mm_stream_si128(X,Y)

/* returns Z + XY*/
#define AVX2_FMA_ADD_FLOATS(X,Y,Z) _mm256_fmadd_pd(X,Y,Z)
#define AVX2_FMA_ADD_TRUNCATE_FLOATS(X,Y,Z) _mm256_round_pd(_mm256_fmadd_pd(X,Y,Z),_MM_FROUND_TO_ZERO|_MM_FROUND_NO_EXC)
Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Is this being used anywhere? round_pd/ps returns a float, not an int, right? So it's not useful for the bin index?

Copy link
Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Good catch, AVX2_FMA_ADD_TRUNCATE_FLOATS specifically is not used, indeed! (but e.g. AVX_FMA_ADD_TRUNCATE_FLOATS is useful in DDrppi - to take the floor value of the rp index before adding in the pi index)

Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Oh, right; I forgot that part of the code used floats to compute the integer index. @manodeep I guess the reason it's done this way is to be able to use FMA?

Copy link
Owner

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Originally I had the cast-to-float to perform some operations (mult? truncate?) that there not available for integer vectors. The question might be whether using FMA here helps.

@adematti You might be able to combine the two ops into one with something from this list. _mm512_fmadd_round_pd would be the exact replacement for what you already have, but you will likely need to use the masked variant

Copy link
Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I think _mm512_fmadd_round_pd and _mm512_maskz_fmadd_round_pd are only available within AVX512?
Your comment actually made me realize that I should be careful about clipping the binning index (using mask) for some pair counters. It should be ok now, though I couldn't compile/test with AVX512.

@manodeep
Copy link
Owner

Thanks for your answer!
Having one bin (nbin = 2) is ok; zero bin (nbin < 2) fails after checks like

Great! I always forget that the way the code is setup nbins in the code is not the usual nbins from the user-perspective

fprintf(stderr,"Error: Could not setup with S bins correctly. (smin = %lf, smax = %lf, with nbins = %d). Expected non-zero smin/smax with smax > smin and nbins >=1 \n",

I can have detect_bin_type return an EXIT_FAILURE in case nbin <= 1 (though that will have no impact on the code).
Concerning kernels and FMA, I will check that asap.

Awesome - yup!

I suspect linear in theta is the more typical choice. Is it possible to support both through some user-option?

hum, like "cos_lin"? but that will only make sense for DDtheta...

Or better yet, we can design for the ability to allow cos_lin in the future but only implement theta_lin for now.

Hmmm - fast-acos option should be independent of the bin-type choice. If the user has specified fast_acos, then we should the approximations, otherwise, we should use the full-accuracy trig function

My point is that using fast-acos to perform linear binning may yield different npairs than with the "custom" binning (fast-acos being an approximation). I don't think we want that. So linear binning should rely on true-acos, whatever is required for thetaavg. I guess that will degrade efficiency by a lot - though not tested yet.

If I am understanding correctly, then you are referring to the impact of fast_acos during the detection of the bin type. If so, yes, I agree - the bin detection should run with the full precision acos. The impact of fast_acos is only in the kernels where the actual calculations are done.

The source (.c.src files) are pre-processed by sed twice and DOUBLE becomes double and float -- i.e., setup_bins_DOUBLE translates to both setup_bins_float and setup_bins_double

hm... setup_bins_float and setup_bins_double are defined in utils.c. There is no utils.c.src, hence no setup_bins_DOUBLE definition. All the calls are to setup_bins (except for DDtheta, where setup_bins_double in called). Maybe I missed something?

Most of the functions in utils.c.src are not precision dependent - hence we did not have the two versions. However, that can certainly change if the need arises.

@adematti
Copy link
Author

adematti commented Sep 27, 2021

Or better yet, we can design for the ability to allow cos_lin in the future but only implement theta_lin for now.

bin_type = 'lin' is now implemented for DDtheta_mocks

If I am understanding correctly, then you are referring to the impact of fast_acos during the detection of the bin type.

No, I am referring to the impact of using fast_acos to get theta, and then the bin index (theta - theta_min)/nbins + theta_min.
With fast_acos=True instead of fast_acos=False, some fraction of pairs may shift bins (in addition to thetaavg being different if required). I have added a note in the docstrings to explain this.

@lgarrison
Copy link
Collaborator

@adematti Just enabled the CI checks. Have a look when the tests are finished running and see if there are any failures? I can't remember if the CI has AVX-512, but if not, I'll run the tests locally (Cori KNL also has AVX-512).

@lgarrison lgarrison added this to the v2.4.0 milestone Sep 27, 2021
@manodeep
Copy link
Owner

Noting that this will need to be tested with something similar to make tests -DINTEGRATION_TESTS (i.e., tested for all combinations of bin_refinement_factors, isa, min_sep_opt etc. These tests should that the output is identical between bin_type = custom and bin_type = lin (for a different set of linear bins) - i.e., what @adematti has already done within the python code

@manodeep
Copy link
Owner

@lgarrison 2.4.0 is already overdue - is this okay to push back to milestone 2.5.0?

@lgarrison
Copy link
Collaborator

I was thinking this was useful enough that we'd want to do a release when it is ready, but also that it is progressing pretty rapidly, so we'd just end up doing two releases back-to-back if we target 2.5. But I don't feel strongly; we can instruct DESI to install from git if necessary (although they'll prefer versioned code eventually).

@manodeep manodeep modified the milestones: v2.4.0, v2.5.0 Sep 28, 2021
@manodeep
Copy link
Owner

@lgarrison I agree - this is a significant enough change to warrant a release. I have moved the milestone to 2.5.0 (including all remaining 2.4.0 milestoned issues). Let me get to releasing the v2.4.0 now

@adematti
Copy link
Author

clang tests fail because the -mfma flag is required.
gcc tests fail because of this line (which I left untouched):

(void) sqr_rpmax, (void) sqr_rpmin;

I guess one could remove it?
I very naively tried compiling on Cori KNL, after loading craype-mic-knl, but install did not pick the AVX512 flag,
I guess there are other things to specify?

@lgarrison
Copy link
Collaborator

clang tests fail because the -mfma flag is required.

Hmm, would have hoped -march=native would have set that if the CPU supports it. The CI CPUs probably support it, as its quite old. But if it's a weird virtual environment they might not, in which case the correct behavior would be to protect the SSE kernels with an FMA check.

We should probably add a lscpu command to the GitHub CI so we can triage cases like this. Will PR this...

gcc tests fail because of this line (which I left untouched):

(void) sqr_rpmax, (void) sqr_rpmin;

I guess one could remove it?

Yeah, should remove that line now that those variables are not part of the function signature. And actually, sqr_rpmin below that can be removed completely, as its unused.

I very naively tried compiling on Cori KNL, after loading craype-mic-knl, but install did not pick the AVX512 flag,
I guess there are other things to specify?

I guess so. I'll try it myself once I fix a quota issue at NERSC!

@lgarrison
Copy link
Collaborator

Hi @adematti, can you merge the latest master into your feature branch and push so the tests are rerun with the new CI config? That will also dump the CPU flags so we can debug the OSX failure, although my preliminary look at #259 suggests that the OSX CI CPUs indeed have SSE but no FMA, and since this branch introduces an FMA requirement, it breaks. So, we could require that both FMA and SSE are present; that shouldn't be too burdensome, as FMA is quite old (2011/2012).

@lgarrison
Copy link
Collaborator

Here's a better solution, I think: in sse_calls.h, can you just alias the call to a non-FMA version if FMA isn't available? i.e.:

#ifdef __FMA__
#define SSE_FMA_ADD_FLOATS(X,Y,Z)          _mm_fmadd_ps(X,Y,Z)
#else
#define SSE_FMA_ADD_FLOATS(X,Y,Z)          _mm_add_ps(_mm_mul_ps(X,Y),Z)
#endif

@lgarrison
Copy link
Collaborator

So I think (but am not 100% positive) that the __FMA__ flag only corresponds to SSE FMA, not AVX FMA. E.g. the Wikipedia page calls it out as an SSE extension specifically: https://en.wikipedia.org/wiki/FMA_instruction_set.

@adematti
Copy link
Author

adematti commented Oct 4, 2021

I see, thanks! I read FMA was introduced at the same time as AVX2, hence wrongly thought FMA flag was defined for AVX as well. I will remove changes to avx_calls.h (and solve errors below).

@adematti
Copy link
Author

adematti commented Oct 4, 2021

Removing the FMA flag for AVX raises again errors like:

'countpairs_avx_intrinsics_double' that is compiled without support for 'fma'
                m_rpbin = AVX_FMA_ADD_FLOATS(union_mDperp.m_Dperp,m_inv_rpstep,m_rpmin_invstep); // r2, then union_mDperp.m_Dperp aready clipped

(there are also some more bugs with AVX512, not unexpected as I could not try compiling on my computer)

@lgarrison
Copy link
Collaborator

Okay, there was a bug in the C tests that took a while to track down. The bins were being written to file with the %g specifier, which uses only 6 digits by default even for doubles. More precision is needed to match the linear binning. Switching to an exact representation with %a fixed the issue.

So now the C test pass (including the linear binning integration tests), as does test_lin.py. I tried to support test_lin.py under Python 2.7, but the deeper I got, the more issues appeared. So while linear binning will likely run under Python 2.7, I think we'll just have to say the tests won't.

@manodeep, can you review test_lin.py and let me know if we are ready to merge?

@manodeep
Copy link
Owner

manodeep commented Dec 6, 2021

Okay, there was a bug in the C tests that took a while to track down. The bins were being written to file with the %g specifier, which uses only 6 digits by default even for doubles. More precision is needed to match the linear binning. Switching to an exact representation with %a fixed the issue.

Awesome - makes sense!

So now the C test pass (including the linear binning integration tests), as does test_lin.py. I tried to support test_lin.py under Python 2.7, but the deeper I got, the more issues appeared. So while linear binning will likely run under Python 2.7, I think we'll just have to say the tests won't.

Do you have details about the issues you encountered? I am happy to try and take a look...

If we can't test with python2.7, what do you think about only supporting other bin types in >= python3? That way, we also signal that we are planning to drop python2 support

@lgarrison
Copy link
Collaborator

lgarrison commented Dec 6, 2021

Do you have details about the issues you encountered? I am happy to try and take a look...

One problem was that Python 2.7 didn't like some of our test function signatures. I'm sure it's possible to fix, but the changes would touch all the test functions.

If we can't test with python2.7, what do you think about only supporting other bin types in >= python3? That way, we also signal that we are planning to drop python2 support

Sure, we could turn off linear binning just for Python 2.7, but it probably works fine and I tend to see all Python 2.7 usage these days as "use at your own risk" since it's been discontinued for so long! We do have the warning of Python 2.7 Corrfunc deprecation in place, too.

@manodeep
Copy link
Owner

manodeep commented Dec 8, 2021

Ohh good point about the deprecation warning - yup that is a reasonable warning that using python2.7 is in the "user-beware" territory.

Okay - I am going to look through the code in a little more detail. I remember having some thoughts about how the code could be structured, naming conventions etc

@lgarrison
Copy link
Collaborator

Sounds good! Let me know if I can help.

@manodeep
Copy link
Owner

It is embarrassing how long this PR is taking me to review - just commenting to say that I am doing my best to get the review done

@manodeep
Copy link
Owner

@adematti I am happy to review the remaining code - would that be useful?

@manodeep
Copy link
Owner

manodeep commented May 2, 2022

@lgarrison Not sure if @adematti has notifications enabled or perhaps is slammed. How do you want to proceed on this PR? Seems like there might be a good use-case and the speedups are fairly reasonable from what I recall

@adematti
Copy link
Author

adematti commented May 2, 2022

Hi @manodeep, sorry for the delay (saw your message, then got extremely busy). I think this is up to you and Lehman! (the Corrfunc version we are using within DESI has evolved significantly to address new weighting schemes & other line-of-sight definitions).

@lgarrison
Copy link
Collaborator

I'm happy to merge this, it's a great feature as-is! @adematti, are there any bugs you found in the linear binning that might have been addressed in the DESI fork but not here?

@adematti
Copy link
Author

adematti commented May 2, 2022

yes, I was just looking at this. The only bug I remember of was due to this line: https://github.com/adematti/Corrfunc/blob/6ae2b9f4e470558ceb0b1d59fb0c2f62ff4bdbdc/mocks/DDtheta_mocks/countpairs_theta_mocks_kernels.c.src#L924
which should be:
if (need_rpavg || bin_type == BIN_LIN) {
(see cosmodesi@8d95155)

@manodeep
Copy link
Owner

manodeep commented May 5, 2022

Ahh thanks @adematti :)

@lgarrison Do you want to merge this PR into a separate branch (we have one already right?) and then sort it out from there?

@manodeep
Copy link
Owner

manodeep commented May 5, 2022

yes, I was just looking at this. The only bug I remember of was due to this line: https://github.com/adematti/Corrfunc/blob/6ae2b9f4e470558ceb0b1d59fb0c2f62ff4bdbdc/mocks/DDtheta_mocks/countpairs_theta_mocks_kernels.c.src#L924 which should be: if (need_rpavg || bin_type == BIN_LIN) { (see adematti@8d95155)

@adematti This fix s already in the PR right?

@lgarrison
Copy link
Collaborator

@manodeep I think this branch still needs to be fixed; I'll do that in the next day or two.

@adematti
Copy link
Author

adematti commented May 5, 2022

(I found myself a couple of minutes to fix the bug)

@lgarrison
Copy link
Collaborator

@manodeep Checks are passing; let's merge?

@manodeep
Copy link
Owner

manodeep commented May 9, 2022

@lgarrison I was thinking of merging to a different branch - there are some naming conventions etc that I would like alter. Is the linearbinning-rounding appropriate or should we just create a new branch?

@lgarrison
Copy link
Collaborator

If you're feeling motivated to do another editing pass, then sure, let's merge to a new branch. But the PR has been open for 8 months, and I think functionally it's ready to go! It would be great to get this released so our users can experience the benefits.

@manodeep manodeep changed the base branch from master to linearbinning-rounding May 11, 2022 02:07
@manodeep manodeep merged commit a7e1dd5 into manodeep:linearbinning May 12, 2022
@manodeep
Copy link
Owner

@adematti I have merged this PR into the repo - many many thanks for such a massive effort!

@lgarrison Yeah I know - but now that I am back at ~full-time, the minor refinemennts should be much easier to get done

@lgarrison
Copy link
Collaborator

Hey @manodeep, have you had a chance to make these refinements? If not, I suggest we go ahead and merge this into master before the branches diverge even more.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

Successfully merging this pull request may close these issues.

4 participants