Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Speedup caching of audbcards.Dataset #83

Merged
merged 22 commits into from
Apr 30, 2024
Merged

Speedup caching of audbcards.Dataset #83

merged 22 commits into from
Apr 30, 2024

Conversation

hagenw
Copy link
Member

@hagenw hagenw commented Apr 24, 2024

When caching audbcards.Dataset we store objects that are not needed to create a datacard,
e.g. the dependency table and header of a dataset. This increases the size of the cache and makes loading slower than it is needed.
This pull request speeds up caching of audbcards.Dataset by pickling only cached properties, as listed by audbcards.Dataset._cached_properties() (formerly audbcards.Dataset.properties()).

The execution time for building our database overview page is as follows on compute5:

branch fresh build build from cache
main 15 minutes 3 minutes
this branch 15 minutes 2 minutes

The size of the cache is reduced from 2.6G to 133M.

We can further improve execution time by also caching the images / audio examples from audbcards.Datacard, but I will handle this in a follow up pull request.


Further changes:

  • Renamed audbcards.Dataset.properties() to audbcards.Dataset._cached_properties()
  • Added cached property audbcards.Dataset.schemes_summary, that holds entries needed for the dataset overview page
  • Added docstring for audbcards.Dataset.cache_root attribute
  • Converted audbcards.Dataset.deps and audbcards.Dataset.header to properties, and added them to the documentation
  • Added audbcards.Dataset.backend and audbcards.Dataset.repository_object properties
  • Removed adding __getstate__ and __setstate__ methods to the dohq_artifactory.GenericRepository object, as the repository is no longer pickled

Newly added API entries:

image

image

image

image

image

image

audbcards/core/datacard.py Outdated Show resolved Hide resolved
@hagenw hagenw marked this pull request as draft April 24, 2024 14:17
@hagenw hagenw marked this pull request as ready for review April 26, 2024 10:28
Copy link
Member

@ChristianGeng ChristianGeng left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

This achieves a smaller footprint by changing the caching mechanisms,
and now distinguishing between cached properties and "normal" ones, i.e. between ones decorated with @functools.cached_property and others decorated with @poperty.

These are treated differently now, with the heavy ones that are conceptually not descriptive of a adtaset - backend, deps, header, repository_object - as "normal ones" that are not being cached.

These "normal" properties also have a kind of lazy loading implemented, and are loaded once in the lifetime of an object. As data-artifacts are slowly changing (and audbcards descriptives too), I can see no problem with that - so one does not have to deal with datasets that change during the liketoime of an object. So I belive this is fine.

The whole MR makes the process of building the "carddeck" 50% faster, but the saving in disk space is by a magnitude larget which is great.

The tests cover the new features and look sound to me.

The only concern I have is about dependencies: is the code depending on a newer version of audbackend already? I see not changes in the pyproject.toml.

I do not think this will be a great deal and am approving tentatively.

@hagenw
Copy link
Member Author

hagenw commented Apr 30, 2024

The only concern I have is about dependencies: is the code depending on a newer version of audbackend already? I see not changes in the pyproject.toml.

No, this does not yet depend on a newer audbackend version (version 2.0.0 is also not released yet, but consists only in the dev branch of audbackend). I will prepare a pull request for testing audbackend 2.0.0 after the caching speed is handled to avoid merge conflicts.

@hagenw hagenw merged commit fa0cbf0 into main Apr 30, 2024
6 checks passed
@hagenw hagenw deleted the speedup-caching branch April 30, 2024 08:01
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

Successfully merging this pull request may close these issues.

2 participants