Add audbcards.Dataset.segments #30

hagenw · 2023-10-09T06:46:25Z

In #28 (comment) @ChristianGeng proposed to add a property audbcards.Dataset.datapoints that would return the number of files or segments based on a possible existing files or segments table.

The downside is that neither a files nor a segments table have to exist inside a database. The ground truth to get all files in a database is audformat.Database.files and audformat.Database.segments to get all segments. We already have audbcards.Dataset.files to get the number of files. So it seems reasonable to also add audbcards.Dataset.segments.

This has one downside though: in order to get the number of all possible segments, we need to load all tables first and calculate the union of existing segments, compare https://github.com/audeering/audformat/blob/07b000266735ce460af3e4c09b611c15a63f76c0/audformat/core/database.py#L280-L286.

The text was updated successfully, but these errors were encountered:

hagenw · 2024-07-10T14:33:18Z

Returning the actual number of unique segments per dataset is a very challenging problem, as we need to load all segmented tables and build a union of their index, which can be both memory and computational heavy.

An alternative to counting unique samples would be to just count all samples. This can be done easily for tables stored in parquet, as those files contain a metadata entry listing how many samples they contain. In this case, we would not need to download any table, but could stream the metadata from the backend (if the parquet file is not available in the local) cache.

The only downside is, that this would not allow us to create a duration distribution plot as we do for the files (compare: #95)

But at the moment, I don't see a meaningful way how to achieve this.

@ChristianGeng would the overall number of segments be of any value, or do you see another solution for this problem?

ChristianGeng · 2024-07-11T09:55:30Z

Am I not sure whether I am correct that the worst case would be to count segments possibly many times thereby exaggerating the number of segments by a large amount. So the segment count might end up too optimistic. While the "numbers" would look nice they might not be very accurate.

The relevance I cannot decide but my understanding would be that accurate segment information would be a nice to have.

On the implementation side you mention that the accurate computation would depend on audformat.utils.union, and that this can become memory and computation heavy. I am not sure whether I can apprectiate the implementation in its full complexity, in particular I cannot digest some points and relate it to the code by only superficially looking into the code. These are mainly the commutativity property and its relation to sorting, and the role that UNION_MAX_INDEX_LEN_THRES plays. My understanding is that the objects get concatenated using a pd.concat in most or evan all cases.

So on that side I do not know whether the union implementation can be improved. I found here that there are large implementation details, but cannot judge whether any of these approaches can be made productive at all. One could hope that itertools.chain can be more memory-friendly and possibly faster than doing a chaining of pd.concat and pd.drop_duplicates, but this would need to be tried out - and it might even be unapplicable in our usecase.

So taken together, relevance is something that depends on external stakeholders, and implementation is hard to judge for me. A third solution I unfortunately cannot offer, either one downgrades relevance or upgrades compuatation (if at all possible). For now we could stay with the "cheap" solution and then see how urgent this is considered.

The second approach would defintively be cumbersome in the first place - as it would require benchmarking efforts to be able to have a say on the performance. In case one goes for this, I would probably first want to finish the polars evaluation before working on a second performance related issue.

ChristianGeng · 2024-07-11T10:01:00Z

And possibly there is yet another angle from which one could see it, from the "usage of audb.publish" angle: Would a set of recommendations for publishing exist, that helps to facilitate the extraction of correct segment counts? And in case these existed: how would these need to be phrased?

hagenw · 2024-07-11T10:07:00Z

So on that side I do not know whether the union implementation can be improved.

We have devoted already some time in making audformat.utils.union() as fast as possible. E.g. you can find a benchmark script at https://github.com/audeering/audformat/blob/main/benchmarks/benchmark_union.py, and audeering/audformat#354 for a speed up, we implemented.

Most likely, it can be even further improved, but I think this would be a huge effort, and I would not recommend to do it.

Would a set of recommendations for publishing exist, that helps to facilitate the extraction of correct segment counts?

You could recommend, that a database should always contain a segments tables listing all unique segments. But I do not find this very elegant, and also cumbersome for users that just want to publish data. An alternative would be, that audb automatically publishes this information in another "dependency table", but this time listing all segments instead of all files. But at the moment, I'm also not very convinced by that approach.

While the "numbers" would look nice they might not be very accurate.

I would also be in favor of counting the actual number of segments (using union()), and present statistics on the durations of the segments. So, I would propose we first try the implementation from #31 on all our datasets and see how long it takes. I created https://gitlab.audeering.com/data/data.pp.audeering.com/-/merge_requests/99 to try this (this fails for another reason at the moment, see audeering/audformat#449).

This was referenced Oct 9, 2023

Add Dataset.segments + Dataset.segment_durations #31

Merged

Separate card from dataset #28

Merged

hagenw closed this as completed in #31 Jul 24, 2024

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Add audbcards.Dataset.segments #30

Add audbcards.Dataset.segments #30

hagenw commented Oct 9, 2023

hagenw commented Jul 10, 2024

ChristianGeng commented Jul 11, 2024 •

edited

Loading

ChristianGeng commented Jul 11, 2024

hagenw commented Jul 11, 2024

Add audbcards.Dataset.segments #30

Add audbcards.Dataset.segments #30

Comments

hagenw commented Oct 9, 2023

hagenw commented Jul 10, 2024

ChristianGeng commented Jul 11, 2024 • edited Loading

ChristianGeng commented Jul 11, 2024

hagenw commented Jul 11, 2024

ChristianGeng commented Jul 11, 2024 •

edited

Loading