special handling of linear binning #258

adematti · 2021-09-20T18:28:36Z

Hi,

As discussed with Lehman for DESI applications, this is a first attempt to implement a special handling of linear binning, which helps corrfunc run faster with a large number of bins.
There is no speed gain at low number of bins ~ 10, but a speed gain > 3x for bins ~ 200.
I added a flag bin_type to the Python wrapper, to choose between: 'auto', 'lin', 'custom'; with 'lin' for linear binning, 'auto' (default) to automatically detect linear binning
(which happens here: https://github.com/adematti/Corrfunc/blob/c500d2c9137ff7f66d34eb4189bc19f9325d84ff/utils/utils.c#L85).
The following pair counters have been updated:
mocks: DDrppi_mocks, DDsmu_mocks
theory: DD, DDrppi, DDsmu, xi, wp
I did not update DDtheta_mocks (yet), as there are implementation choices to be made:
a) should I consider the binning to be linear in theta or acos? (I guess the former, as the pair counts for the latter can be obained from DD if maximum separation < \pi)
b) should I use fast acos to compute the bin for a given pair? in this case, a pair can fall in the wrong bin due to numerical approximation
To be discussed, also:

should I pass rstep (bin width), rmin to the kernels directly (currently taking the square root of sqr_rmin, sqr_rmax, which is slightly suboptimal)
should I print the chosen binning type; in this case, where? in each countpairs_DOUBLE function?
should I leave option 'auto'? it checks for linear binning, with absolute and relative tolerance of 1e-12 for both double and floats.
Not sure how it behaves in practice for the latter case (floats), see remark ii) below.
I also have a couple of remarks:
i) this line

Corrfunc/Corrfunc/utils.py

Line 461 in 596fe77

return ra.astype(input_dtype), dec.astype(input_dtype)

seems unnecessary?
(as array updates are done in place and calls to this functions do not retrieve the returned arrays)
ii) I don't think setup_bins_float is used anywhere in the code, and setup_bins_double is only used here:

Corrfunc/mocks/DDtheta_mocks/countpairs_theta_mocks_impl.c.src

Line 534 in 74c6fc2

setup_bins_DOUBLE(binfile,&thetamin,&thetamax,&nthetabin,&theta_upp);

Is there any reason not to use a single version of this function, setup_bins?
iii) Some time ago I had binning errors with Corrfunc after calling matplotlib. I figured out this was due to how the string separation for floats was handled (matplotlib changed the environment variables specifying this convention;
export LC_NUMERIC=C solved this.). For this issue not to happen to other people, I guess the simplest solution would be to pass bins as an array rather than a file to be read from disk. Is there any reason not to do so?

Thanks!
Best,
Arnaud

pep8speaks · 2021-09-20T18:28:42Z

Hello @adematti! Thanks for updating this PR. We checked the lines you've touched for PEP 8 issues, and found:

In the file Corrfunc/mocks/DDrppi_mocks.py:

Line 225:80: E501 line too long (80 > 79 characters)
Line 233:80: E501 line too long (80 > 79 characters)
Line 236:80: E501 line too long (81 > 79 characters)
Line 346:80: E501 line too long (96 > 79 characters)

In the file Corrfunc/mocks/DDsmu_mocks.py:

Line 235:80: E501 line too long (80 > 79 characters)
Line 238:80: E501 line too long (81 > 79 characters)
Line 267:80: E501 line too long (96 > 79 characters)

In the file Corrfunc/mocks/DDtheta_mocks.py:

Line 202:80: E501 line too long (82 > 79 characters)
Line 212:80: E501 line too long (80 > 79 characters)
Line 215:80: E501 line too long (81 > 79 characters)
Line 297:80: E501 line too long (96 > 79 characters)

In the file Corrfunc/tests/test_lin.py:

Line 15:1: E302 expected 2 blank lines, found 1
Line 17:26: E231 missing whitespace after ','
Line 17:38: E231 missing whitespace after ','
Line 23:80: E501 line too long (95 > 79 characters)
Line 29:67: E251 unexpected spaces around keyword / parameter equals
Line 29:69: E251 unexpected spaces around keyword / parameter equals
Line 29:80: E501 line too long (98 > 79 characters)
Line 33:1: W293 blank line contains whitespace
Line 40:52: E231 missing whitespace after ','
Line 44:21: E128 continuation line under-indented for visual indent
Line 49:22: E231 missing whitespace after ','
Line 52:1: W293 blank line contains whitespace
Line 54:1: W293 blank line contains whitespace
Line 55:1: E302 expected 2 blank lines, found 1
Line 55:62: E231 missing whitespace after ','
Line 59:21: E128 continuation line under-indented for visual indent
Line 64:22: E231 missing whitespace after ','
Line 67:1: W293 blank line contains whitespace
Line 77:44: E231 missing whitespace after ','
Line 77:52: E231 missing whitespace after ','
Line 77:58: E231 missing whitespace after ','
Line 78:80: E501 line too long (116 > 79 characters)
Line 78:95: E251 unexpected spaces around keyword / parameter equals
Line 78:97: E251 unexpected spaces around keyword / parameter equals
Line 80:7: E231 missing whitespace after ','
Line 80:11: E231 missing whitespace after ','
Line 80:14: E231 missing whitespace after ','
Line 81:1: W293 blank line contains whitespace
Line 83:19: E128 continuation line under-indented for visual indent
Line 84:19: E128 continuation line under-indented for visual indent
Line 85:19: E128 continuation line under-indented for visual indent
Line 86:19: E128 continuation line under-indented for visual indent
Line 87:1: W293 blank line contains whitespace
Line 88:1: E302 expected 2 blank lines, found 1
Line 88:44: E231 missing whitespace after ','
Line 88:52: E231 missing whitespace after ','
Line 88:58: E231 missing whitespace after ','
Line 89:80: E501 line too long (145 > 79 characters)
Line 89:125: E251 unexpected spaces around keyword / parameter equals
Line 89:127: E251 unexpected spaces around keyword / parameter equals
Line 89:138: E251 unexpected spaces around keyword / parameter equals
Line 89:140: E251 unexpected spaces around keyword / parameter equals
Line 91:7: E231 missing whitespace after ','
Line 91:11: E231 missing whitespace after ','
Line 91:14: E231 missing whitespace after ','
Line 92:1: W293 blank line contains whitespace
Line 94:19: E128 continuation line under-indented for visual indent
Line 95:19: E128 continuation line under-indented for visual indent
Line 96:19: E128 continuation line under-indented for visual indent
Line 97:19: E128 continuation line under-indented for visual indent
Line 100:44: E231 missing whitespace after ','
Line 100:52: E231 missing whitespace after ','
Line 100:58: E231 missing whitespace after ','
Line 101:80: E501 line too long (90 > 79 characters)
Line 103:7: E231 missing whitespace after ','
Line 103:11: E231 missing whitespace after ','
Line 103:14: E231 missing whitespace after ','
Line 104:1: W293 blank line contains whitespace
Line 106:19: E128 continuation line under-indented for visual indent
Line 107:19: E128 continuation line under-indented for visual indent
Line 108:19: E128 continuation line under-indented for visual indent
Line 109:19: E128 continuation line under-indented for visual indent
Line 110:19: E128 continuation line under-indented for visual indent
Line 113:44: E231 missing whitespace after ','
Line 113:52: E231 missing whitespace after ','
Line 113:58: E231 missing whitespace after ','
Line 114:80: E501 line too long (96 > 79 characters)
Line 116:6: E231 missing whitespace after ','
Line 116:8: E231 missing whitespace after ','
Line 116:10: E231 missing whitespace after ','
Line 118:1: W293 blank line contains whitespace
Line 122:1: W293 blank line contains whitespace
Line 124:44: E231 missing whitespace after ','
Line 124:52: E231 missing whitespace after ','
Line 124:58: E231 missing whitespace after ','
Line 125:80: E501 line too long (111 > 79 characters)
Line 127:6: E231 missing whitespace after ','
Line 127:8: E231 missing whitespace after ','
Line 127:10: E231 missing whitespace after ','
Line 131:19: E128 continuation line under-indented for visual indent
Line 132:19: E128 continuation line under-indented for visual indent
Line 133:19: E128 continuation line under-indented for visual indent
Line 134:19: E128 continuation line under-indented for visual indent
Line 137:44: E231 missing whitespace after ','
Line 137:52: E231 missing whitespace after ','
Line 137:58: E231 missing whitespace after ','
Line 138:80: E501 line too long (98 > 79 characters)
Line 139:39: E127 continuation line over-indented for visual indent
Line 141:6: E231 missing whitespace after ','
Line 141:8: E231 missing whitespace after ','
Line 141:10: E231 missing whitespace after ','
Line 145:19: E128 continuation line under-indented for visual indent
Line 146:19: E128 continuation line under-indented for visual indent
Line 147:19: E128 continuation line under-indented for visual indent
Line 148:19: E128 continuation line under-indented for visual indent
Line 149:19: E128 continuation line under-indented for visual indent
Line 151:1: W293 blank line contains whitespace
Line 152:44: E231 missing whitespace after ','
Line 152:52: E231 missing whitespace after ','
Line 152:58: E231 missing whitespace after ','
Line 153:80: E501 line too long (107 > 79 characters)
Line 155:6: E231 missing whitespace after ','
Line 155:8: E231 missing whitespace after ','
Line 155:10: E231 missing whitespace after ','
Line 157:1: W293 blank line contains whitespace
Line 159:19: E128 continuation line under-indented for visual indent
Line 160:19: E128 continuation line under-indented for visual indent
Line 161:19: E128 continuation line under-indented for visual indent
Line 163:1: W293 blank line contains whitespace
Line 164:44: E231 missing whitespace after ','
Line 164:52: E231 missing whitespace after ','
Line 164:58: E231 missing whitespace after ','
Line 165:80: E501 line too long (96 > 79 characters)
Line 167:6: E231 missing whitespace after ','
Line 167:8: E231 missing whitespace after ','
Line 167:10: E231 missing whitespace after ','
Line 169:1: W293 blank line contains whitespace
Line 171:19: E128 continuation line under-indented for visual indent
Line 172:19: E128 continuation line under-indented for visual indent
Line 173:19: E128 continuation line under-indented for visual indent

In the file Corrfunc/theory/DD.py:

Line 153:80: E501 line too long (80 > 79 characters)
Line 156:80: E501 line too long (81 > 79 characters)
Line 224:80: E501 line too long (96 > 79 characters)

In the file Corrfunc/theory/DDrppi.py:

Line 163:80: E501 line too long (80 > 79 characters)
Line 171:80: E501 line too long (80 > 79 characters)
Line 174:80: E501 line too long (81 > 79 characters)
Line 271:80: E501 line too long (96 > 79 characters)

In the file Corrfunc/theory/DDsmu.py:

Line 179:80: E501 line too long (80 > 79 characters)
Line 182:80: E501 line too long (81 > 79 characters)
Line 277:80: E501 line too long (96 > 79 characters)

In the file Corrfunc/theory/wp.py:

Line 288:80: E501 line too long (80 > 79 characters)
Line 412:80: E501 line too long (80 > 79 characters)
Line 420:80: E501 line too long (80 > 79 characters)
Line 423:80: E501 line too long (81 > 79 characters)
Line 500:80: E501 line too long (96 > 79 characters)

In the file Corrfunc/theory/xi.py:

Line 139:80: E501 line too long (80 > 79 characters)
Line 142:80: E501 line too long (81 > 79 characters)
Line 210:80: E501 line too long (96 > 79 characters)

In the file Corrfunc/utils.py:

Line 18:13: E127 continuation line over-indented for visual indent
Line 18:80: E501 line too long (99 > 79 characters)

Comment last updated at 2022-05-05 19:51:25 UTC

manodeep · 2021-09-20T20:19:05Z

@adematti Great - thanks for the PR. I took a quick look at some C files and can see a couple of improvements. Quickly leaving comments for now - will add a full code review later.

It might be better to rename the new function setup_bin_type to detect_bin_type
What happens when there is exactly 1 bin - the current setup will end up with a divide by zero
In the kernels, you are likely to get better performance by computing the reciprocal of rpstep and then multiplying with that reciprocal (in both the SIMD and the regular trailing loop sections)
Will you please check if it is possible to cast the linear histogram bin calculation [truncate((r - rmin)*inv_rstep)] as a fused multiply-add (FMA)? If so, that might save a bit more cpu cycles

manodeep · 2021-09-20T20:30:38Z

Hi,

As discussed with Lehman for DESI applications, this is a first attempt to implement a special handling of linear binning, which helps corrfunc run faster with a large number of bins.
There is no speed gain at low number of bins ~ 10, but a speed gain > 3x for bins ~ 200.
I added a flag bin_type to the Python wrapper, to choose between: 'auto', 'lin', 'custom'; with 'lin' for linear binning, 'auto' (default) to automatically detect linear binning
(which happens here: https://github.com/adematti/Corrfunc/blob/c500d2c9137ff7f66d34eb4189bc19f9325d84ff/utils/utils.c#L85).
The following pair counters have been updated:
mocks: DDrppi_mocks, DDsmu_mocks
theory: DD, DDrppi, DDsmu, xi, wp
I did not update DDtheta_mocks (yet), as there are implementation choices to be made:
a) should I consider the binning to be linear in theta or acos? (I guess the former, as the pair counts for the latter can be obained from DD if maximum separation < \pi)

I suspect linear in theta is the more typical choice. Is it possible to support both through some user-option?

b) should I use fast acos to compute the bin for a given pair? in this case, a pair can fall in the wrong bin due to numerical approximation
To be discussed, also:

Hmmm - fast-acos option should be independent of the bin-type choice. If the user has specified fast_acos, then we should the approximations, otherwise, we should use the full-accuracy trig function

1. should I pass rstep (bin width), rmin to the kernels directly (currently taking the square root of sqr_rmin, sqr_rmax, which is slightly suboptimal)

Yeah - I saw that one. May be change the behaviour to pass rmin and rmax to the kernels, and then square those to get the sqr_rmin and sqr_rmax`?

2. should I print the chosen binning type; in this case, where? in each countpairs_DOUBLE function?

Yes, but only if options->verbose is set.

3. should I leave option 'auto'? it checks for linear binning, with absolute and relative tolerance of 1e-12 for both double and floats.
   Not sure how it behaves in practice for the latter case (floats), see remark ii) below.

I would change the tolerance to be appropriate for the precision of the numeric type. 1e-12 seems reasonable for double but too small for float (probably more like 1e-8). May be check what the default options for np.allclose are?

   I also have a couple of remarks:
   i) this line https://github.com/manodeep/Corrfunc/blob/596fe77078d59b296b34608927e301c427331919/Corrfunc/utils.py#L461
    seems unnecessary?
   (as array updates are done in place and calls to this functions do not retrieve the returned arrays)

This is for @lgarrison

   ii) I don't think setup_bins_float is used anywhere in the code, and setup_bins_double is only used here: https://github.com/manodeep/Corrfunc/blob/74c6fc29f9a0236eaebbc1830a1f59fd9a53bfc8/mocks/DDtheta_mocks/countpairs_theta_mocks_impl.c.src#L534

The source (.c.src files) are pre-processed by sed twice and DOUBLE becomes double and float -- i.e., setup_bins_DOUBLE translates to both setup_bins_float and setup_bins_double

   Is there any reason not to use a single version of this function, setup_bins?
   iii) Some time ago I had binning errors with Corrfunc after calling matplotlib. I figured out this was due to how the string separation for floats was handled (matplotlib changed the environment variables specifying this convention;
   export LC_NUMERIC=C solved this.). For this issue not to happen to other people, I guess the simplest solution would be to pass bins as an array rather than a file to be read from disk. Is there any reason not to do so?

Huh - okay! I can see that happening but that's poor behaviour on maplotlib's part.

The reason for specifying the bins from a file is historical - reading from a file lets us keep the exact same bins as the output from a previous Corrfunc run.

Many thanks for using Corrfunc and adding such a feature! Hopefully, I have answered all/most of the queries.

Cheers,
Manodeep

Thanks!
Best,
Arnaud

manodeep · 2021-09-20T20:48:41Z

* Will you please check if it is possible to cast the linear histogram bin calculation `[truncate((r - rmin)*inv_rstep)]` as a fused multiply-add (FMA)? If so, that might save a bit more cpu cycles

https://software.intel.com/sites/landingpage/IntrinsicsGuide/#avx512techs=AVX512F&expand=3441,3186&text=mm512_mask3_fmadd_round_p

If you rewrite the bin calculation from rpbin := (r - rmin)*inv_rstep to rpbin := r*inv_rstep - rmin*inv_rstep, then the calculation takes the required FMA form. The rounding parameter in the intrinsic is likely the _MM_FROUND_TO_ZERO |_MM_FROUND_NO_EXC but might require a separate truncate to integer.

adematti · 2021-09-20T22:09:15Z

Thanks for your answer!
Having one bin (nbin = 2) is ok; zero bin (nbin < 2) fails after checks like

Corrfunc/mocks/DDsmu_mocks/countpairs_s_mu_mocks_impl.c.src

Line 302 in 74c6fc2

    
           fprintf(stderr,"Error: Could not setup with S bins correctly. (smin = %lf, smax = %lf, with nbins = %d). Expected non-zero smin/smax with smax > smin and nbins >=1 \n",

I can have detect_bin_type return an EXIT_FAILURE in case nbin <= 1 (though that will have no impact on the code).
Concerning kernels and FMA, I will check that asap.

I suspect linear in theta is the more typical choice. Is it possible to support both through some user-option?

hum, like "cos_lin"? but that will only make sense for DDtheta...

Hmmm - fast-acos option should be independent of the bin-type choice. If the user has specified fast_acos, then we should the approximations, otherwise, we should use the full-accuracy trig function

My point is that using fast-acos to perform linear binning may yield different npairs than with the "custom" binning (fast-acos being an approximation). I don't think we want that. So linear binning should rely on true-acos, whatever is required for thetaavg. I guess that will degrade efficiency by a lot - though not tested yet.

By the way, "auto" is used by default, meaning linear binning whenever input bins are linear. My tests showed it did not slow down pair counts even with 10 bins, but these tests are not exhaustive...

May be check what the default options for np.allclose are?

This is rtol=1e-05, atol=1e-08. It may yield incorrect pair counts. Then I guess we should have "custom" as a default instead of "auto", as pair counts should be guaranteed correct with the default options.

The source (.c.src files) are pre-processed by sed twice and DOUBLE becomes double and float -- i.e., setup_bins_DOUBLE translates to both setup_bins_float and setup_bins_double

hm... setup_bins_float and setup_bins_double are defined in utils.c. There is no utils.c.src, hence no setup_bins_DOUBLE definition. All the calls are to setup_bins (except for DDtheta, where setup_bins_double in called). Maybe I missed something?

manodeep

I have looked through the code - thank you! This is an enormous effort (and apologies for all of the stray trailing white-spaces)

I have made some comments - nothing too major really but most of the comments apply to a lot of the files and I stopped repeating myself. Hopefully, my comments are useful enough to give you a sense of what to change.

Thanks again!
Manodeep

manodeep · 2021-09-22T04:22:55Z

common.mk

@@ -19,7 +19,7 @@ CLINK ?=
 ## Set the python command (supply the full path to python you want to
 ## use, if different from directly calling `python` on the shell,
 ## as can be the case if python is set via an alias)
-PYTHON:=python
+PYTHON:=/local/home/adematti/anaconda3/envs/cosmopipe-dev/bin/python


This will need to be reverted to python

manodeep · 2021-09-22T04:23:33Z

Corrfunc/mocks/DDrppi_mocks.py

@@ -23,7 +23,8 @@ def DDrppi_mocks(autocorr, cosmology, nthreads, pimax, binfile,
                 xbin_refine_factor=2, ybin_refine_factor=2,
                 zbin_refine_factor=1, max_cells_per_dim=100,
                 copy_particles=True, enable_min_sep_opt=True,
-                 c_api_timer=False, isa=r'fastest', weight_type=None):
+                 c_api_timer=False, isa=r'fastest',
+                 weight_type=None, bin_type=r'auto'):


Is the r specifier required? @lgarrison?

No, should be fine without it (can remove it from the isa too for consistency)
EDIT: this is referring to the r for the raw string. I thought I was replying to a comment thread when I wrote this; same with some comments below. I guess it starts a new thread for any comments that come from a review...?

manodeep · 2021-09-22T04:26:26Z

Corrfunc/tests.py

+    print("\nRunning 2-D correlation function xi(rp,pi)")
+    results_DDrppi_ref = DDrppi_mocks(autocorr, cosmology, nthreads,
+                                      pimax, binfile,
+                                      ra, dec, cz,
+                                      weights1=w, weight_type='pair_product',
+                                      output_rpavg=True, verbose=True, isa=isa, bin_type='custom')
+    results_DDrppi = DDrppi_mocks(autocorr, cosmology, nthreads,
+                                  pimax, binfile,
+                                  ra, dec, cz,
+                                  weights1=w, weight_type='pair_product',
+                                  output_rpavg=True, verbose=False, isa=isa, bin_type='lin')


Is it possible to create a for loop that runs over bin_types=['custom', 'lin'] and compare that the previous output (if any) matches the current output? That will make the test code more concise

(Same applies to all of the other instances for the tests)

manodeep · 2021-09-22T04:28:36Z

Corrfunc/utils.py

+    except NameError:
+        if not isinstance(bin_type, str):
+            raise TypeError(msg)
+    valid_bin_type = ['AUTO', 'CUSTOM', 'LIN']


Will you please rename to valid_bin_types (i.e., plural)

manodeep · 2021-09-22T04:30:31Z

mocks/DDrppi_mocks/countpairs_rp_pi_mocks_impl.c.src

@@ -297,7 +297,7 @@ int countpairs_mocks_DOUBLE(const int64_t ND1, DOUBLE *ra1, DOUBLE *dec1, DOUBLE
    double *rupp;
    int nrpbin ;
    double rpmin,rpmax;
-    setup_bins(binfile,&rpmin,&rpmax,&nrpbin,&rupp);
+    setup_bins(binfile,&rpmin,&rpmax,&nrpbin,&rupp,&(options->bin_type));


Is it possible to separate the functionality for reading the bins from a file and the detection of the bin type - i.e., revert this to before and then (after the rpmin etc check), add a new function call to detect the bin type

manodeep · 2021-09-22T04:51:27Z

mocks/DDsmu_mocks/countpairs_s_mu_mocks_kernels.c.src

-                    break;
+            AVX512_FLOATS m_sbin = AVX512_SETZERO_FLOAT();
+
+            if (bin_type == 1) {


manodeep · 2021-09-22T04:51:43Z

mocks/DDsmu_mocks/countpairs_s_mu_mocks_kernels.c.src

+            AVX512_FLOATS m_sbin = AVX512_SETZERO_FLOAT();
+
+            if (bin_type == 1) {
+                if (!need_savg) {


typo with !?

I wanted to compute SQRT(s2) if not computed before (i.e. need_savg is False).
Maybe I am missing something?

Right; the custom binning version uses s^2, but the linear binning uses s, so we need to take the sqrt, unless it was already done because need_savg.

Perhaps line 272 above should just be if(need_savg || bin_type == BIN_LIN)?

Yup - what @lgarrison said

manodeep · 2021-09-22T05:06:43Z

utils/avx512_calls.h

@@ -282,7 +283,7 @@ extern "C" {
 #define AVX512_RECIPROCAL_FLOATS(X)                               _mm512_rcp14_pd(X)
 #define AVX512_MASK_RECIPROCAL_FLOATS(FALSEVALS, MASK, X)         _mm512_mask_rcp14_pd(FALSEVALS, MASK, X)
 #define AVX512_MASKZ_RECIPROCAL_FLOATS(MASK, X)                   _mm512_maskz_rcp14_pd(MASK, X)
-
+#define AVX512_CONVERT_INT_TO_FLOAT(X)                            _mm512_cvtepi32_pd(X)


I am wondering if you need the (instruction-less) cast to float (these ops) rather than the expensive convert options

manodeep · 2021-09-22T05:07:29Z

utils/sse_calls.h

@@ -41,11 +41,12 @@ extern "C" {
 #define SSE_TRUNCATE_FLOAT_TO_INT(X)     _mm_cvttps_epi32(X)
 #define SSE_SQUARE_FLOAT(X)              _mm_mul_ps(X,X)
 #define SSE_SET_FLOAT(X)                 _mm_set1_ps(X)
+#define SSE_CONVERT_INT_TO_FLOAT(X)      _mm_cvtepi32_ps(X)


Same as the AVX512F case - you might need a "free" cast operation rather than the convert operation

manodeep · 2021-09-22T05:08:35Z

utils/utils.c

+    return EXIT_SUCCESS;
+}
+
+static int setup_bin_type_double(const double *rupp, int nbin, uint8_t *bin_type)


As noted earlier, these functions should be available to call publicly (rather than static)

adematti · 2021-09-22T14:19:19Z

Hi,
Sorry, I meant to push this earlier, but forgot... I think all your comments above have been addressed, but:

bin_type=r'auto' => r'' is not necessary, but have included it as it was in e.g. isa=r'fastest'
linear binning treatment for DDtheta still missing
I could not check implementation for AVX512 kernels as my processors do not support it
I reverted the default binning type to 'custom', to avoid issues related to "binning is interpreted as linear but is actually not" with default options.

lgarrison · 2021-09-23T15:09:26Z

Hey @adematti, thank you so much for this contribution! This is not a small effort, and it looks great so far. And thanks @manodeep for reviewing this while I was away. I will try to catch up on the full set of comments above and do my own review, but probably not until tomorrow or Monday at the earliest.

lgarrison

This looks great! Just some minor comments.

lgarrison · 2021-09-23T19:06:17Z

Corrfunc/mocks/DDrppi_mocks.py

@@ -23,7 +23,8 @@ def DDrppi_mocks(autocorr, cosmology, nthreads, pimax, binfile,
                 xbin_refine_factor=2, ybin_refine_factor=2,
                 zbin_refine_factor=1, max_cells_per_dim=100,
                 copy_particles=True, enable_min_sep_opt=True,
-                 c_api_timer=False, isa=r'fastest', weight_type=None):
+                 c_api_timer=False, isa=r'fastest',
+                 weight_type=None, bin_type=r'auto'):


No, should be fine without it (can remove it from the isa too for consistency)
EDIT: this is referring to the r for the raw string. I thought I was replying to a comment thread when I wrote this; same with some comments below. I guess it starts a new thread for any comments that come from a review...?

lgarrison · 2021-09-23T19:10:27Z

Corrfunc/mocks/DDrppi_mocks.py

+    bin_type : string, case-insensitive (default ``auto``)
+        If bins in ``binfile`` are linearly-spaced, set to ``lin`` for speed-up.
+        Else, set to ``custom``.
+        ``auto`` allows for auto-detection of the binning type.
+


We ought to tell the user something here about accuracy. I think almost all users will be comfortable with the notion that a tiny fraction of pairs may shift bins when using this feature, but this should be mentioned explicitly.

lgarrison · 2021-09-23T19:13:44Z

Corrfunc/tests.py

+        return toret
+
+    def test_bin_type(pair_counter, *args, **kwargs):
+        res = (pair_counter(*args, bin_type=bin_type, weight_type=weight_type, **kwargs) for bin_type in bin_types for weight_type in [None, "pair_product"])


I think you can just test the weighted case; recall that the weighted case returns the integer number of raw pairs, in addition to the floating-point weighted pair count. So one can check both cases by just running the weighted computation. The same unweighted code path is exercised in both cases.

lgarrison · 2021-09-23T19:15:23Z

Corrfunc/tests.py

+    binfile = pjoin(dirname(abspath(__file__)),
+                    "../mocks/tests/", "bins_lin")
+    rbins = np.linspace(0.1, 45.1, 46)
+    with open(binfile,'w') as file:
+        for low, hi in zip(rbins[:-1], rbins[1:]):
+            file.write("{0} {1}\n".format(low, hi))


No need to actually write to a file here, probably. Can just pass the bins as a Python list/array. (This is calling the Python wrapper, not the CPython API, right?)

Corrfunc/tests.py

lgarrison · 2021-09-23T19:56:53Z

mocks/DDrppi_mocks/countpairs_rp_pi_mocks_impl.c.src

+    if (options->verbose) {
+        if (options->bin_type == BIN_LIN) fprintf(stderr,"Linear R binning\n");
+        else fprintf(stderr,"Custom R binning\n");
+    }


If you wanted to reduce duplication, I'd have no problem with detect_bin_type having the print statements internally, and either taking an extra int verbose flag or just taking the whole options struct as an arg (to enable checking options->verbose and options->bin_type).

Ok! I originally separated the print statement to mimic the check on rmin/rmax/nbins above.

lgarrison · 2021-09-24T19:42:06Z

mocks/DDsmu_mocks/countpairs_s_mu_mocks_kernels.c.src

+            AVX512_FLOATS m_sbin = AVX512_SETZERO_FLOAT();
+
+            if (bin_type == 1) {
+                if (!need_savg) {


Right; the custom binning version uses s^2, but the linear binning uses s, so we need to take the sqrt, unless it was already done because need_savg.

Perhaps line 272 above should just be if(need_savg || bin_type == BIN_LIN)?

lgarrison · 2021-09-24T19:48:16Z

utils/avx2_calls.h

@@ -179,6 +183,10 @@ extern "C" {
 #define AVX2_STREAMING_STORE_FLOATS(X,Y)   _mm256_stream_pd(X,Y)
 #define AVX2_STREAMING_STORE_INTS(X,Y)     _mm_stream_si128(X,Y)

+/* returns Z + XY*/
+#define AVX2_FMA_ADD_FLOATS(X,Y,Z)          _mm256_fmadd_pd(X,Y,Z)
+#define AVX2_FMA_ADD_TRUNCATE_FLOATS(X,Y,Z) _mm256_round_pd(_mm256_fmadd_pd(X,Y,Z),_MM_FROUND_TO_ZERO|_MM_FROUND_NO_EXC)


Is this being used anywhere? round_pd/ps returns a float, not an int, right? So it's not useful for the bin index?

Good catch, AVX2_FMA_ADD_TRUNCATE_FLOATS specifically is not used, indeed! (but e.g. AVX_FMA_ADD_TRUNCATE_FLOATS is useful in DDrppi - to take the floor value of the rp index before adding in the pi index)

Oh, right; I forgot that part of the code used floats to compute the integer index. @manodeep I guess the reason it's done this way is to be able to use FMA?

Originally I had the cast-to-float to perform some operations (mult? truncate?) that there not available for integer vectors. The question might be whether using FMA here helps.

@adematti You might be able to combine the two ops into one with something from this list. _mm512_fmadd_round_pd would be the exact replacement for what you already have, but you will likely need to use the masked variant

I think _mm512_fmadd_round_pd and _mm512_maskz_fmadd_round_pd are only available within AVX512?
Your comment actually made me realize that I should be careful about clipping the binning index (using mask) for some pair counters. It should be ok now, though I couldn't compile/test with AVX512.

manodeep · 2021-09-27T03:45:33Z

Thanks for your answer!
Having one bin (nbin = 2) is ok; zero bin (nbin < 2) fails after checks like

Great! I always forget that the way the code is setup nbins in the code is not the usual nbins from the user-perspective

Corrfunc/mocks/DDsmu_mocks/countpairs_s_mu_mocks_impl.c.src

Line 302 in 74c6fc2

fprintf(stderr,"Error: Could not setup with S bins correctly. (smin = %lf, smax = %lf, with nbins = %d). Expected non-zero smin/smax with smax > smin and nbins >=1 \n",

I can have detect_bin_type return an EXIT_FAILURE in case nbin <= 1 (though that will have no impact on the code).
Concerning kernels and FMA, I will check that asap.

Awesome - yup!

I suspect linear in theta is the more typical choice. Is it possible to support both through some user-option?

hum, like "cos_lin"? but that will only make sense for DDtheta...

Or better yet, we can design for the ability to allow cos_lin in the future but only implement theta_lin for now.

Hmmm - fast-acos option should be independent of the bin-type choice. If the user has specified fast_acos, then we should the approximations, otherwise, we should use the full-accuracy trig function

My point is that using fast-acos to perform linear binning may yield different npairs than with the "custom" binning (fast-acos being an approximation). I don't think we want that. So linear binning should rely on true-acos, whatever is required for thetaavg. I guess that will degrade efficiency by a lot - though not tested yet.

If I am understanding correctly, then you are referring to the impact of fast_acos during the detection of the bin type. If so, yes, I agree - the bin detection should run with the full precision acos. The impact of fast_acos is only in the kernels where the actual calculations are done.

The source (.c.src files) are pre-processed by sed twice and DOUBLE becomes double and float -- i.e., setup_bins_DOUBLE translates to both setup_bins_float and setup_bins_double

hm... setup_bins_float and setup_bins_double are defined in utils.c. There is no utils.c.src, hence no setup_bins_DOUBLE definition. All the calls are to setup_bins (except for DDtheta, where setup_bins_double in called). Maybe I missed something?

Most of the functions in utils.c.src are not precision dependent - hence we did not have the two versions. However, that can certainly change if the need arises.

adematti · 2021-09-27T22:31:11Z

Or better yet, we can design for the ability to allow cos_lin in the future but only implement theta_lin for now.

bin_type = 'lin' is now implemented for DDtheta_mocks

If I am understanding correctly, then you are referring to the impact of fast_acos during the detection of the bin type.

No, I am referring to the impact of using fast_acos to get theta, and then the bin index (theta - theta_min)/nbins + theta_min.
With fast_acos=True instead of fast_acos=False, some fraction of pairs may shift bins (in addition to thetaavg being different if required). I have added a note in the docstrings to explain this.

lgarrison · 2021-09-27T23:18:42Z

@adematti Just enabled the CI checks. Have a look when the tests are finished running and see if there are any failures? I can't remember if the CI has AVX-512, but if not, I'll run the tests locally (Cori KNL also has AVX-512).

manodeep · 2021-09-28T00:54:14Z

Noting that this will need to be tested with something similar to make tests -DINTEGRATION_TESTS (i.e., tested for all combinations of bin_refinement_factors, isa, min_sep_opt etc. These tests should that the output is identical between bin_type = custom and bin_type = lin (for a different set of linear bins) - i.e., what @adematti has already done within the python code

manodeep · 2021-09-28T00:55:01Z

@lgarrison 2.4.0 is already overdue - is this okay to push back to milestone 2.5.0?

lgarrison · 2021-09-28T01:00:11Z

I was thinking this was useful enough that we'd want to do a release when it is ready, but also that it is progressing pretty rapidly, so we'd just end up doing two releases back-to-back if we target 2.5. But I don't feel strongly; we can instruct DESI to install from git if necessary (although they'll prefer versioned code eventually).

manodeep · 2021-09-28T05:29:53Z

@lgarrison I agree - this is a significant enough change to warrant a release. I have moved the milestone to 2.5.0 (including all remaining 2.4.0 milestoned issues). Let me get to releasing the v2.4.0 now

adematti · 2021-09-28T07:08:22Z

clang tests fail because the -mfma flag is required.
gcc tests fail because of this line (which I left untouched):

Corrfunc/theory/DD/countpairs_kernels.c.src

Line 35 in 596fe77

(void) sqr_rpmax, (void) sqr_rpmin;

I guess one could remove it?
I very naively tried compiling on Cori KNL, after loading craype-mic-knl, but install did not pick the AVX512 flag,
I guess there are other things to specify?

lgarrison · 2021-09-28T15:06:19Z

clang tests fail because the -mfma flag is required.

Hmm, would have hoped -march=native would have set that if the CPU supports it. The CI CPUs probably support it, as its quite old. But if it's a weird virtual environment they might not, in which case the correct behavior would be to protect the SSE kernels with an FMA check.

We should probably add a lscpu command to the GitHub CI so we can triage cases like this. Will PR this...

gcc tests fail because of this line (which I left untouched):

Corrfunc/theory/DD/countpairs_kernels.c.src

Line 35 in 596fe77

(void) sqr_rpmax, (void) sqr_rpmin;

I guess one could remove it?

Yeah, should remove that line now that those variables are not part of the function signature. And actually, sqr_rpmin below that can be removed completely, as its unused.

I very naively tried compiling on Cori KNL, after loading craype-mic-knl, but install did not pick the AVX512 flag,
I guess there are other things to specify?

I guess so. I'll try it myself once I fix a quota issue at NERSC!

lgarrison · 2021-09-30T14:29:26Z

Hi @adematti, can you merge the latest master into your feature branch and push so the tests are rerun with the new CI config? That will also dump the CPU flags so we can debug the OSX failure, although my preliminary look at #259 suggests that the OSX CI CPUs indeed have SSE but no FMA, and since this branch introduces an FMA requirement, it breaks. So, we could require that both FMA and SSE are present; that shouldn't be too burdensome, as FMA is quite old (2011/2012).

lgarrison · 2021-10-01T15:31:03Z

Here's a better solution, I think: in sse_calls.h, can you just alias the call to a non-FMA version if FMA isn't available? i.e.:

#ifdef __FMA__
#define SSE_FMA_ADD_FLOATS(X,Y,Z)          _mm_fmadd_ps(X,Y,Z)
#else
#define SSE_FMA_ADD_FLOATS(X,Y,Z)          _mm_add_ps(_mm_mul_ps(X,Y),Z)
#endif

lgarrison · 2021-10-04T14:01:20Z

So I think (but am not 100% positive) that the __FMA__ flag only corresponds to SSE FMA, not AVX FMA. E.g. the Wikipedia page calls it out as an SSE extension specifically: https://en.wikipedia.org/wiki/FMA_instruction_set.

adematti · 2021-10-04T14:34:08Z

I see, thanks! I read FMA was introduced at the same time as AVX2, hence wrongly thought FMA flag was defined for AVX as well. I will remove changes to avx_calls.h (and solve errors below).

adematti · 2021-10-04T15:34:24Z

Removing the FMA flag for AVX raises again errors like:

'countpairs_avx_intrinsics_double' that is compiled without support for 'fma'
                m_rpbin = AVX_FMA_ADD_FLOATS(union_mDperp.m_Dperp,m_inv_rpstep,m_rpmin_invstep); // r2, then union_mDperp.m_Dperp aready clipped

(there are also some more bugs with AVX512, not unexpected as I could not try compiling on my computer)

lgarrison · 2021-12-03T20:49:31Z

Okay, there was a bug in the C tests that took a while to track down. The bins were being written to file with the %g specifier, which uses only 6 digits by default even for doubles. More precision is needed to match the linear binning. Switching to an exact representation with %a fixed the issue.

So now the C test pass (including the linear binning integration tests), as does test_lin.py. I tried to support test_lin.py under Python 2.7, but the deeper I got, the more issues appeared. So while linear binning will likely run under Python 2.7, I think we'll just have to say the tests won't.

@manodeep, can you review test_lin.py and let me know if we are ready to merge?

manodeep · 2021-12-06T00:47:56Z

Okay, there was a bug in the C tests that took a while to track down. The bins were being written to file with the %g specifier, which uses only 6 digits by default even for doubles. More precision is needed to match the linear binning. Switching to an exact representation with %a fixed the issue.

Awesome - makes sense!

So now the C test pass (including the linear binning integration tests), as does test_lin.py. I tried to support test_lin.py under Python 2.7, but the deeper I got, the more issues appeared. So while linear binning will likely run under Python 2.7, I think we'll just have to say the tests won't.

Do you have details about the issues you encountered? I am happy to try and take a look...

If we can't test with python2.7, what do you think about only supporting other bin types in >= python3? That way, we also signal that we are planning to drop python2 support

lgarrison · 2021-12-06T15:07:13Z

Do you have details about the issues you encountered? I am happy to try and take a look...

One problem was that Python 2.7 didn't like some of our test function signatures. I'm sure it's possible to fix, but the changes would touch all the test functions.

If we can't test with python2.7, what do you think about only supporting other bin types in >= python3? That way, we also signal that we are planning to drop python2 support

Sure, we could turn off linear binning just for Python 2.7, but it probably works fine and I tend to see all Python 2.7 usage these days as "use at your own risk" since it's been discontinued for so long! We do have the warning of Python 2.7 Corrfunc deprecation in place, too.

manodeep · 2021-12-08T04:16:10Z

Ohh good point about the deprecation warning - yup that is a reasonable warning that using python2.7 is in the "user-beware" territory.

Okay - I am going to look through the code in a little more detail. I remember having some thoughts about how the code could be structured, naming conventions etc

lgarrison · 2021-12-08T04:19:14Z

Sounds good! Let me know if I can help.

manodeep · 2021-12-21T04:47:41Z

It is embarrassing how long this PR is taking me to review - just commenting to say that I am doing my best to get the review done

manodeep · 2022-04-21T06:15:30Z

@adematti I am happy to review the remaining code - would that be useful?

manodeep · 2022-05-02T00:47:55Z

@lgarrison Not sure if @adematti has notifications enabled or perhaps is slammed. How do you want to proceed on this PR? Seems like there might be a good use-case and the speedups are fairly reasonable from what I recall

adematti · 2022-05-02T12:58:42Z

Hi @manodeep, sorry for the delay (saw your message, then got extremely busy). I think this is up to you and Lehman! (the Corrfunc version we are using within DESI has evolved significantly to address new weighting schemes & other line-of-sight definitions).

lgarrison · 2022-05-02T13:21:57Z

I'm happy to merge this, it's a great feature as-is! @adematti, are there any bugs you found in the linear binning that might have been addressed in the DESI fork but not here?

adematti · 2022-05-02T13:44:43Z

yes, I was just looking at this. The only bug I remember of was due to this line: https://github.com/adematti/Corrfunc/blob/6ae2b9f4e470558ceb0b1d59fb0c2f62ff4bdbdc/mocks/DDtheta_mocks/countpairs_theta_mocks_kernels.c.src#L924
which should be:
if (need_rpavg || bin_type == BIN_LIN) {
(see cosmodesi@8d95155)

manodeep · 2022-05-05T03:32:54Z

Ahh thanks @adematti :)

@lgarrison Do you want to merge this PR into a separate branch (we have one already right?) and then sort it out from there?

manodeep · 2022-05-05T03:36:52Z

yes, I was just looking at this. The only bug I remember of was due to this line: https://github.com/adematti/Corrfunc/blob/6ae2b9f4e470558ceb0b1d59fb0c2f62ff4bdbdc/mocks/DDtheta_mocks/countpairs_theta_mocks_kernels.c.src#L924 which should be: if (need_rpavg || bin_type == BIN_LIN) { (see adematti@8d95155)

@adematti This fix s already in the PR right?

lgarrison · 2022-05-05T13:20:14Z

@manodeep I think this branch still needs to be fixed; I'll do that in the next day or two.

adematti · 2022-05-05T19:52:08Z

(I found myself a couple of minutes to fix the bug)

lgarrison · 2022-05-09T22:51:01Z

@manodeep Checks are passing; let's merge?

manodeep · 2022-05-09T23:18:28Z

@lgarrison I was thinking of merging to a different branch - there are some naming conventions etc that I would like alter. Is the linearbinning-rounding appropriate or should we just create a new branch?

lgarrison · 2022-05-10T13:13:43Z

If you're feeling motivated to do another editing pass, then sure, let's merge to a new branch. But the PR has been open for 8 months, and I think functionally it's ready to go! It would be great to get this released so our users can experience the benefits.

manodeep · 2022-05-12T03:36:47Z

@adematti I have merged this PR into the repo - many many thanks for such a massive effort!

@lgarrison Yeah I know - but now that I am back at ~full-time, the minor refinemennts should be much easier to get done

lgarrison · 2022-10-31T21:40:04Z

Hey @manodeep, have you had a chance to make these refinements? If not, I suggest we go ahead and merge this into master before the branches diverge even more.

manodeep requested review from manodeep and lgarrison September 20, 2021 20:09

manodeep reviewed Sep 22, 2021

View reviewed changes

adematti force-pushed the linearbinning branch from 40467c0 to 93c2835 Compare September 22, 2021 14:06

lgarrison requested changes Sep 24, 2021

View reviewed changes

lgarrison added this to the v2.4.0 milestone Sep 27, 2021

manodeep modified the milestones: v2.4.0, v2.5.0 Sep 28, 2021

adematti force-pushed the linearbinning branch from a941bcb to 4b91ef2 Compare October 4, 2021 15:17

lgarrison added 3 commits December 3, 2021 14:40

Fix nthreads

c3d0666

Don't test linear binning on Python 2.7

c6bbf7d

Try to support old Numpy

6ae2b9f

manodeep mentioned this pull request Dec 13, 2021

Drop certain code options #267

Open

lgarrison mentioned this pull request Dec 22, 2021

GNU Assembler bug workaround #196

Merged

adematti mentioned this pull request Feb 3, 2022

separation average #271

Open

fixing issue in theta AVX linear binning

d3bf99d

manodeep changed the base branch from master to linearbinning-rounding May 11, 2022 02:07

manodeep merged commit a7e1dd5 into manodeep:linearbinning May 12, 2022

lgarrison mentioned this pull request Nov 5, 2022

pytest fails on local supercomputer (Skylake cpus) #283

Open

special handling of linear binning #258

special handling of linear binning #258

Conversation

adematti commented Sep 20, 2021

pep8speaks commented Sep 20, 2021 • edited Loading

Comment last updated at 2022-05-05 19:51:25 UTC

manodeep commented Sep 20, 2021

manodeep commented Sep 20, 2021

manodeep commented Sep 20, 2021

adematti commented Sep 20, 2021

manodeep left a comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

lgarrison Sep 23, 2021 • edited Loading

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

adematti commented Sep 22, 2021

lgarrison commented Sep 23, 2021

lgarrison left a comment

Choose a reason for hiding this comment

lgarrison Sep 23, 2021 • edited Loading

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

manodeep commented Sep 27, 2021

adematti commented Sep 27, 2021 • edited Loading

lgarrison commented Sep 27, 2021

manodeep commented Sep 28, 2021

manodeep commented Sep 28, 2021

lgarrison commented Sep 28, 2021

manodeep commented Sep 28, 2021

adematti commented Sep 28, 2021

lgarrison commented Sep 28, 2021

lgarrison commented Sep 30, 2021

lgarrison commented Oct 1, 2021

lgarrison commented Oct 4, 2021

adematti commented Oct 4, 2021 • edited Loading

adematti commented Oct 4, 2021 • edited Loading

lgarrison commented Dec 3, 2021

manodeep commented Dec 6, 2021

lgarrison commented Dec 6, 2021 • edited Loading

manodeep commented Dec 8, 2021

lgarrison commented Dec 8, 2021

manodeep commented Dec 21, 2021

manodeep commented Apr 21, 2022

manodeep commented May 2, 2022

adematti commented May 2, 2022 • edited Loading

lgarrison commented May 2, 2022

adematti commented May 2, 2022 • edited Loading

manodeep commented May 5, 2022

manodeep commented May 5, 2022

lgarrison commented May 5, 2022

adematti commented May 5, 2022

lgarrison commented May 9, 2022

manodeep commented May 9, 2022

lgarrison commented May 10, 2022

manodeep commented May 12, 2022

lgarrison commented Oct 31, 2022

pep8speaks commented Sep 20, 2021 •

edited

Loading

lgarrison Sep 23, 2021 •

edited

Loading

lgarrison Sep 23, 2021 •

edited

Loading

adematti commented Sep 27, 2021 •

edited

Loading

adematti commented Oct 4, 2021 •

edited

Loading

adematti commented Oct 4, 2021 •

edited

Loading

lgarrison commented Dec 6, 2021 •

edited

Loading

adematti commented May 2, 2022 •

edited

Loading

adematti commented May 2, 2022 •

edited

Loading