feat: add fit_curve and predict_curve #139

LukeWeidenwalker · 2023-07-17T14:09:11Z

I've now taken a shot at implementing fit_curve too, and thought I'd just start from scratch to see what the problem really is. With just the xarray built-in function curvefit, I get throughput of 1000pixels/sec with a vanilla OpenEO dask cluster (6 workers w/ 4CPUs and 12GB RAM each). Dask has added constant memory rechunking a while ago with their P2P rechunking scheme, so I'm somewhat hopeful that this will be okay for larger datacubes too. I haven't tried a truly humonguous dataset yet (the largest I tried was 1million pixels, 2x of what was used in the SRR2 notebook, which took 30 minutes), but I'm inclined to not worry about the rechunking anymore unless it comes up again.

@clausmichele do you think this level of performance allows you to make progress on this usecase?

There's some open questions about the interface:

I don't think fit_curve should be run through apply_dimension or reduce_dimension, as was changed in fit_curve return schema openeo-processes#425, because this makes everything awkward. I've therefore added the dimension parameter back in.
I don't know how to support ignore_nodata right now
The inference (predict_curve) is fine to run as a reducer though! I haven't had a chance to take a closer look though, so might need changes there too!

The parameter unpacking I've only kind of eyeballed so far, could use a closer look to confirm that what I'm doing there makes sense!

Note on performance: Most of the wall time is spent in a vectorize__wrapper step - afaict this is where apply_ufunc vectorizes the function before passing it to scipy's curve_fit, and this work isn't parallelising. I've already tried pre-compiling the function with numba before, but to no avail. Not entirely sure how we'd go about speeding this up further, but maybe the current throughput is already good enough to demonstrate the usecase?

codecov · 2023-07-17T14:12:09Z

Codecov Report

Merging #139 (20a8326) into main (7490bc3) will increase coverage by 0.69%.
Report is 1 commits behind head on main.
The diff coverage is 93.47%.

@@            Coverage Diff             @@
##             main     #139      +/-   ##
==========================================
+ Coverage   78.95%   79.64%   +0.69%     
==========================================
  Files          28       29       +1     
  Lines        1207     1253      +46     
==========================================
+ Hits          953      998      +45     
- Misses        254      255       +1

Files Changed	Coverage Δ
...s_dask/process_implementations/ml/curve_fitting.py	`93.33% <93.33%> (ø)`
...cesses_dask/process_implementations/ml/__init__.py	`100.00% <100.00%> (ø)`

... and 2 files with indirect coverage changes

📣 We’re building smart automated test selection to slash your CI/CD build times. Learn more

clausmichele · 2023-07-17T15:18:30Z

@LukeWeidenwalker can you clarify how many time steps and how many pixels in the spatial dimension are present in your test set?

LukeWeidenwalker · 2023-07-17T15:20:08Z

@LukeWeidenwalker can you clarify how many time steps and how many pixels in the spatial dimension are present in your test set?

bands: 5 t: 160 y: 1151 x: 953 (1.63 GiB), this took 31 minutes to complete. The temporal extent corresponds to ["2016-09-01", "2018-10-01"] of the boa_sentinel_2 collection

clausmichele · 2023-07-18T08:03:31Z

What is the chunk size? We would also need to check what happens if we increase the size in the spatial domain.

Anyway, to make usage of this functionality it should work also with a larger area, like a Sentinel-2 tile (~10000x10000).
Possibilities:

Allocate more resources depending on the requested area? Need to check if throwing more workers reduces the required time.
Splitting the jobs, similarly to what the Aggregator does for large scale processing?

LukeWeidenwalker · 2023-07-18T08:12:25Z

What is the chunk size? We would also need to check what happens if we increase the size in the spatial domain.

Chunk size after rechunking was (1, 160, 1151, 953)!

Anyway, to make usage of this functionality it should work also with a larger area, like a Sentinel-2 tile (~10000x10000).

The time seems to scale linearly as far as I can see - doing half the number of pixels took ~50% of the time. Therefore we should expect a full sentinel-2 tile to take 50h with the current setup.

Allocate more resources depending on the requested area? Need to check if throwing more workers reduces the required time.

I will try doubling the dask cluster size and report back on how that changes runtime - although I don't expect that to help, because as I said above, most of the time is spent in single cores working on this vectorize__wrapper call.

Splitting the jobs, similarly to what the Aggregator does for large scale processing?

That should totally work, only subject to resource limitations on our backend!

clausmichele · 2023-07-18T08:27:43Z

Where is set the re-chunking size? From the experiments I did, I remember that overall it was performing much better using small chunks like 128x128, 256x256, ...
It would be nice if you could test how much time it takes with the same settings but smaller chunks, currently they're pretty large.

LukeWeidenwalker · 2023-07-18T12:21:01Z

I've now done a run with 12 workers instead of 6 - I get a minor speedup, from 31min to 24min, but it does indeed look like these extra resources aren't being fully utilised in this vectorisation step.

I've also tried different chunksizes (still with 12 workers):

256x256: took 30 minutes, so actually a bit slower than the larger chunk
128x128: took 20 minutes, so a bit faster
64x64: took 20 minutes, pretty much the same as 128x128

Interestingly, with smaller chunks I get pretty much full CPU utilisation across these 12 workers, as there's more vectorize__wrapper calls, but each of them is smaller, so all workers can work on them in parallel.

So extrapolating from these, I'd expect a sentinel-2 tile to take ~35h.
lmk what you think!

LukeWeidenwalker · 2023-07-18T12:42:17Z

Actually @clausmichele, do you still have performance characteristics from your experiments around? Would be interesting to compare the throughput (pixels/second)! If your numbers were higher than this, I think we could expect more low-hanging fruit and try optimising a bit further - otherwise, we should probably just accept this level of performance!

LukeWeidenwalker · 2023-07-18T14:43:00Z

I've also just launched and killed pre-emptively a few experiments on a 10000x10000 spatial extent - my dask cluster did not run out of memory handling this array along the way, so although it would take a fair bit longer, I'm fairly confident that this computation would run through!

clausmichele · 2023-07-19T09:19:51Z

Thanks Lukas for the tests. I actually don't have my old ones any more, but they were showing that with "small" chunks I was getting better performance.
When you have time, could you provide me the code to call fit_curve as you're doing using the python client?
If predict works as well I could start testing them any time soon.

LukeWeidenwalker · 2023-07-19T13:20:02Z

Thanks Lukas for the tests. I actually don't have my old ones any more, but they were showing that with "small" chunks I was getting better performance. When you have time, could you provide me the code to call fit_curve as you're doing using the python client? If predict works as well I could start testing them any time soon.

Ah, I'm not calling this with the python client, just locally with a remote dask cluster attached. Are you suggesting that we merge this, deploy it to prod as an experimental process and you start testing on that? Or should we add predict_curve too first and then deploy?

clausmichele · 2023-07-19T13:37:57Z

We should have predict_curve too, of course. My notebook using fit and predict is still available here: https://github.com/openEOPlatform/SRR2_notebooks/blob/main/UC6%20-%20Forest%20Dynamics.ipynb

LukeWeidenwalker · 2023-08-16T12:44:38Z

@clausmichele @ValentinaHutter I've now also implemented predict_curve, so this PR is ready for a thorough review! Since you're both more familiar with the actual usecase, it would be great if you could both have a look and tell me whether there's anything else missing here! :)

ValentinaHutter

LGTM! There are just some things, I was wondering about - see comments :)

openeo_processes_dask/process_implementations/ml/curve_fitting.py

LukeWeidenwalker · 2023-08-21T08:35:41Z

@clausmichele would be great if you get some time to have a look over this PR this week - once this is merged, you should be able to start testing the usecase in short order!

openeo_processes_dask/process_implementations/ml/curve_fitting.py

clausmichele · 2023-08-21T08:45:42Z

openeo_processes_dask/process_implementations/ml/curve_fitting.py

+    parameters = {f"param_{i}": v for i, v in enumerate(parameters)}
+
+    # The dimension along which to predict cannot be chunked!
+    rechunked_data = data.chunk({dimension: -1})


Shouldn't we do the same in predict_curve as well?

I don't think it matters for predict, because there each timestep can be inferenced for independently of the other steps!

@LukeWeidenwalker that's true. But I was wondering if it would be faster when predicting on a datacube with many timesteps?

Hmm - not sure tbh, I haven't profiled predict_curve at all yet - I think I'll merge and deploy this now so we can start a training run at least, and revisit this if performance of inference turns out to be a problem!

clausmichele

@LukeWeidenwalker I left two comments, other than that it seems fine.

add prototype for fit_curve

76cf964

Lukas Weidenholzer added 2 commits July 17, 2023 17:04

add proper parameter parsing

40a7c43

use partial function in test

9a66759

Lukas Weidenholzer added 14 commits July 20, 2023 16:12

add assert to test

e99dceb

Merge branch 'main' into add-fit-curve-that-works

f716b87

cleanups in fit_curve

a55decd

minor edits

23a68b2

Merge branch 'main' into add-fit-curve-that-works

5e421ce

progress with parameter passing

991970a

revert changes to process decorator

f4e68f8

add to tests

7b903c6

progress on predict

f626450

get predict to work

623626b

remove comment

a40449e

fix up output datacube

f42f812

add assertions

688a3f1

preserve dimension order

0b2b64a

LukeWeidenwalker changed the title ~~Draft: feat: add fit_curve~~ feat: add fit_curve Aug 16, 2023

LukeWeidenwalker requested a review from ValentinaHutter August 16, 2023 12:44

LukeWeidenwalker assigned ValentinaHutter Aug 16, 2023

LukeWeidenwalker requested a review from clausmichele August 16, 2023 12:44

ValentinaHutter reviewed Aug 17, 2023

View reviewed changes

openeo_processes_dask/process_implementations/ml/curve_fitting.py Outdated Show resolved Hide resolved

Lukas Weidenholzer added 3 commits August 17, 2023 13:56

Merge branch 'main' into add-fit-curve-that-works

c2e6bf2

add cast to datetime if appropriate

1fbbdd5

keep attrs

021e459

clausmichele reviewed Aug 21, 2023

View reviewed changes

openeo_processes_dask/process_implementations/ml/curve_fitting.py Outdated Show resolved Hide resolved

clausmichele reviewed Aug 21, 2023

View reviewed changes

Lukas Weidenholzer added 3 commits August 21, 2023 11:11

fix typo

407f4fe

add ignore_nodata

53fbfbd

bump submodule

20a8326

LukeWeidenwalker changed the title ~~feat: add fit_curve~~ feat: add fit_curve and predict_curve Aug 22, 2023

LukeWeidenwalker merged commit 3a55003 into main Aug 22, 2023
5 checks passed

LukeWeidenwalker deleted the add-fit-curve-that-works branch August 22, 2023 12:43

LukeWeidenwalker mentioned this pull request Aug 22, 2023

Draft: fit_curve and predict_curve #57

Closed

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

feat: add fit_curve and predict_curve #139

feat: add fit_curve and predict_curve #139

LukeWeidenwalker commented Jul 17, 2023 •

edited

Loading

codecov bot commented Jul 17, 2023 •

edited

Loading

clausmichele commented Jul 17, 2023

LukeWeidenwalker commented Jul 17, 2023 •

edited

Loading

clausmichele commented Jul 18, 2023

LukeWeidenwalker commented Jul 18, 2023 •

edited

Loading

clausmichele commented Jul 18, 2023

LukeWeidenwalker commented Jul 18, 2023 •

edited

Loading

LukeWeidenwalker commented Jul 18, 2023

LukeWeidenwalker commented Jul 18, 2023

clausmichele commented Jul 19, 2023

LukeWeidenwalker commented Jul 19, 2023

clausmichele commented Jul 19, 2023

LukeWeidenwalker commented Aug 16, 2023

ValentinaHutter left a comment

LukeWeidenwalker commented Aug 21, 2023

clausmichele Aug 21, 2023

LukeWeidenwalker Aug 21, 2023

clausmichele Aug 21, 2023

LukeWeidenwalker Aug 22, 2023

clausmichele left a comment

feat: add fit_curve and predict_curve #139

feat: add fit_curve and predict_curve #139

Conversation

LukeWeidenwalker commented Jul 17, 2023 • edited Loading

codecov bot commented Jul 17, 2023 • edited Loading

Codecov Report

clausmichele commented Jul 17, 2023

LukeWeidenwalker commented Jul 17, 2023 • edited Loading

clausmichele commented Jul 18, 2023

LukeWeidenwalker commented Jul 18, 2023 • edited Loading

clausmichele commented Jul 18, 2023

LukeWeidenwalker commented Jul 18, 2023 • edited Loading

LukeWeidenwalker commented Jul 18, 2023

LukeWeidenwalker commented Jul 18, 2023

clausmichele commented Jul 19, 2023

LukeWeidenwalker commented Jul 19, 2023

clausmichele commented Jul 19, 2023

LukeWeidenwalker commented Aug 16, 2023

ValentinaHutter left a comment

Choose a reason for hiding this comment

LukeWeidenwalker commented Aug 21, 2023

clausmichele Aug 21, 2023

Choose a reason for hiding this comment

LukeWeidenwalker Aug 21, 2023

Choose a reason for hiding this comment

clausmichele Aug 21, 2023

Choose a reason for hiding this comment

LukeWeidenwalker Aug 22, 2023

Choose a reason for hiding this comment

clausmichele left a comment

Choose a reason for hiding this comment

LukeWeidenwalker commented Jul 17, 2023 •

edited

Loading

codecov bot commented Jul 17, 2023 •

edited

Loading

LukeWeidenwalker commented Jul 17, 2023 •

edited

Loading

LukeWeidenwalker commented Jul 18, 2023 •

edited

Loading

LukeWeidenwalker commented Jul 18, 2023 •

edited

Loading