Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

feat: add fit_curve and predict_curve #139

Merged
merged 23 commits into from
Aug 22, 2023
Merged

Conversation

LukeWeidenwalker
Copy link
Contributor

@LukeWeidenwalker LukeWeidenwalker commented Jul 17, 2023

I've now taken a shot at implementing fit_curve too, and thought I'd just start from scratch to see what the problem really is. With just the xarray built-in function curvefit, I get throughput of 1000pixels/sec with a vanilla OpenEO dask cluster (6 workers w/ 4CPUs and 12GB RAM each). Dask has added constant memory rechunking a while ago with their P2P rechunking scheme, so I'm somewhat hopeful that this will be okay for larger datacubes too. I haven't tried a truly humonguous dataset yet (the largest I tried was 1million pixels, 2x of what was used in the SRR2 notebook, which took 30 minutes), but I'm inclined to not worry about the rechunking anymore unless it comes up again.

@clausmichele do you think this level of performance allows you to make progress on this usecase?

There's some open questions about the interface:

  • I don't think fit_curve should be run through apply_dimension or reduce_dimension, as was changed in fit_curve return schema openeo-processes#425, because this makes everything awkward. I've therefore added the dimension parameter back in.
  • I don't know how to support ignore_nodata right now
  • The inference (predict_curve) is fine to run as a reducer though! I haven't had a chance to take a closer look though, so might need changes there too!

The parameter unpacking I've only kind of eyeballed so far, could use a closer look to confirm that what I'm doing there makes sense!

Note on performance: Most of the wall time is spent in a vectorize__wrapper step - afaict this is where apply_ufunc vectorizes the function before passing it to scipy's curve_fit, and this work isn't parallelising. I've already tried pre-compiling the function with numba before, but to no avail. Not entirely sure how we'd go about speeding this up further, but maybe the current throughput is already good enough to demonstrate the usecase?
image

@codecov
Copy link

codecov bot commented Jul 17, 2023

Codecov Report

Merging #139 (20a8326) into main (7490bc3) will increase coverage by 0.69%.
Report is 1 commits behind head on main.
The diff coverage is 93.47%.

@@            Coverage Diff             @@
##             main     #139      +/-   ##
==========================================
+ Coverage   78.95%   79.64%   +0.69%     
==========================================
  Files          28       29       +1     
  Lines        1207     1253      +46     
==========================================
+ Hits          953      998      +45     
- Misses        254      255       +1     
Files Changed Coverage Δ
...s_dask/process_implementations/ml/curve_fitting.py 93.33% <93.33%> (ø)
...cesses_dask/process_implementations/ml/__init__.py 100.00% <100.00%> (ø)

... and 2 files with indirect coverage changes

📣 We’re building smart automated test selection to slash your CI/CD build times. Learn more

@clausmichele
Copy link
Member

@LukeWeidenwalker can you clarify how many time steps and how many pixels in the spatial dimension are present in your test set?

@LukeWeidenwalker
Copy link
Contributor Author

LukeWeidenwalker commented Jul 17, 2023

@LukeWeidenwalker can you clarify how many time steps and how many pixels in the spatial dimension are present in your test set?

bands: 5 t: 160 y: 1151 x: 953 (1.63 GiB), this took 31 minutes to complete. The temporal extent corresponds to ["2016-09-01", "2018-10-01"] of the boa_sentinel_2 collection

@clausmichele
Copy link
Member

What is the chunk size? We would also need to check what happens if we increase the size in the spatial domain.

Anyway, to make usage of this functionality it should work also with a larger area, like a Sentinel-2 tile (~10000x10000).
Possibilities:

  • Allocate more resources depending on the requested area? Need to check if throwing more workers reduces the required time.
  • Splitting the jobs, similarly to what the Aggregator does for large scale processing?

@LukeWeidenwalker
Copy link
Contributor Author

LukeWeidenwalker commented Jul 18, 2023

What is the chunk size? We would also need to check what happens if we increase the size in the spatial domain.

Chunk size after rechunking was (1, 160, 1151, 953)!

Anyway, to make usage of this functionality it should work also with a larger area, like a Sentinel-2 tile (~10000x10000).

The time seems to scale linearly as far as I can see - doing half the number of pixels took ~50% of the time. Therefore we should expect a full sentinel-2 tile to take 50h with the current setup.

  • Allocate more resources depending on the requested area? Need to check if throwing more workers reduces the required time.

I will try doubling the dask cluster size and report back on how that changes runtime - although I don't expect that to help, because as I said above, most of the time is spent in single cores working on this vectorize__wrapper call.

  • Splitting the jobs, similarly to what the Aggregator does for large scale processing?

That should totally work, only subject to resource limitations on our backend!

@clausmichele
Copy link
Member

Where is set the re-chunking size? From the experiments I did, I remember that overall it was performing much better using small chunks like 128x128, 256x256, ...
It would be nice if you could test how much time it takes with the same settings but smaller chunks, currently they're pretty large.

@LukeWeidenwalker
Copy link
Contributor Author

LukeWeidenwalker commented Jul 18, 2023

I've now done a run with 12 workers instead of 6 - I get a minor speedup, from 31min to 24min, but it does indeed look like these extra resources aren't being fully utilised in this vectorisation step.

I've also tried different chunksizes (still with 12 workers):

  • 256x256: took 30 minutes, so actually a bit slower than the larger chunk
  • 128x128: took 20 minutes, so a bit faster
  • 64x64: took 20 minutes, pretty much the same as 128x128

Interestingly, with smaller chunks I get pretty much full CPU utilisation across these 12 workers, as there's more vectorize__wrapper calls, but each of them is smaller, so all workers can work on them in parallel.
image

So extrapolating from these, I'd expect a sentinel-2 tile to take ~35h.
lmk what you think!

@LukeWeidenwalker
Copy link
Contributor Author

Actually @clausmichele, do you still have performance characteristics from your experiments around? Would be interesting to compare the throughput (pixels/second)! If your numbers were higher than this, I think we could expect more low-hanging fruit and try optimising a bit further - otherwise, we should probably just accept this level of performance!

@LukeWeidenwalker
Copy link
Contributor Author

I've also just launched and killed pre-emptively a few experiments on a 10000x10000 spatial extent - my dask cluster did not run out of memory handling this array along the way, so although it would take a fair bit longer, I'm fairly confident that this computation would run through!

@clausmichele
Copy link
Member

Thanks Lukas for the tests. I actually don't have my old ones any more, but they were showing that with "small" chunks I was getting better performance.
When you have time, could you provide me the code to call fit_curve as you're doing using the python client?
If predict works as well I could start testing them any time soon.

@LukeWeidenwalker
Copy link
Contributor Author

Thanks Lukas for the tests. I actually don't have my old ones any more, but they were showing that with "small" chunks I was getting better performance. When you have time, could you provide me the code to call fit_curve as you're doing using the python client? If predict works as well I could start testing them any time soon.

Ah, I'm not calling this with the python client, just locally with a remote dask cluster attached. Are you suggesting that we merge this, deploy it to prod as an experimental process and you start testing on that? Or should we add predict_curve too first and then deploy?

@clausmichele
Copy link
Member

We should have predict_curve too, of course. My notebook using fit and predict is still available here: https://github.com/openEOPlatform/SRR2_notebooks/blob/main/UC6%20-%20Forest%20Dynamics.ipynb

@LukeWeidenwalker LukeWeidenwalker changed the title Draft: feat: add fit_curve feat: add fit_curve Aug 16, 2023
@LukeWeidenwalker
Copy link
Contributor Author

@clausmichele @ValentinaHutter I've now also implemented predict_curve, so this PR is ready for a thorough review! Since you're both more familiar with the actual usecase, it would be great if you could both have a look and tell me whether there's anything else missing here! :)

Copy link
Collaborator

@ValentinaHutter ValentinaHutter left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

LGTM! There are just some things, I was wondering about - see comments :)

@LukeWeidenwalker
Copy link
Contributor Author

@clausmichele would be great if you get some time to have a look over this PR this week - once this is merged, you should be able to start testing the usecase in short order!

parameters = {f"param_{i}": v for i, v in enumerate(parameters)}

# The dimension along which to predict cannot be chunked!
rechunked_data = data.chunk({dimension: -1})
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Shouldn't we do the same in predict_curve as well?

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I don't think it matters for predict, because there each timestep can be inferenced for independently of the other steps!

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

@LukeWeidenwalker that's true. But I was wondering if it would be faster when predicting on a datacube with many timesteps?

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Hmm - not sure tbh, I haven't profiled predict_curve at all yet - I think I'll merge and deploy this now so we can start a training run at least, and revisit this if performance of inference turns out to be a problem!

Copy link
Member

@clausmichele clausmichele left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

@LukeWeidenwalker I left two comments, other than that it seems fine.

@LukeWeidenwalker LukeWeidenwalker changed the title feat: add fit_curve feat: add fit_curve and predict_curve Aug 22, 2023
@LukeWeidenwalker LukeWeidenwalker merged commit 3a55003 into main Aug 22, 2023
5 checks passed
@LukeWeidenwalker LukeWeidenwalker deleted the add-fit-curve-that-works branch August 22, 2023 12:43
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

Successfully merging this pull request may close these issues.

3 participants