-
Notifications
You must be signed in to change notification settings - Fork 0
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Add audbcards.Dataset.segments #30
Comments
Returning the actual number of unique segments per dataset is a very challenging problem, as we need to load all segmented tables and build a union of their index, which can be both memory and computational heavy. An alternative to counting unique samples would be to just count all samples. This can be done easily for tables stored in parquet, as those files contain a metadata entry listing how many samples they contain. In this case, we would not need to download any table, but could stream the metadata from the backend (if the parquet file is not available in the local) cache. The only downside is, that this would not allow us to create a duration distribution plot as we do for the files (compare: #95) But at the moment, I don't see a meaningful way how to achieve this. @ChristianGeng would the overall number of segments be of any value, or do you see another solution for this problem? |
Am I not sure whether I am correct that the worst case would be to count segments possibly many times thereby exaggerating the number of segments by a large amount. So the segment count might end up too optimistic. While the "numbers" would look nice they might not be very accurate. The relevance I cannot decide but my understanding would be that accurate segment information would be a nice to have. On the implementation side you mention that the accurate computation would depend on So on that side I do not know whether the union implementation can be improved. I found here that there are large implementation details, but cannot judge whether any of these approaches can be made productive at all. One could hope that So taken together, relevance is something that depends on external stakeholders, and implementation is hard to judge for me. A third solution I unfortunately cannot offer, either one downgrades relevance or upgrades compuatation (if at all possible). For now we could stay with the "cheap" solution and then see how urgent this is considered. The second approach would defintively be cumbersome in the first place - as it would require benchmarking efforts to be able to have a say on the performance. In case one goes for this, I would probably first want to finish the polars evaluation before working on a second performance related issue. |
And possibly there is yet another angle from which one could see it, from the "usage of audb.publish" angle: Would a set of recommendations for publishing exist, that helps to facilitate the extraction of correct segment counts? And in case these existed: how would these need to be phrased? |
We have devoted already some time in making Most likely, it can be even further improved, but I think this would be a huge effort, and I would not recommend to do it.
You could recommend, that a database should always contain a
I would also be in favor of counting the actual number of segments (using |
In #28 (comment) @ChristianGeng proposed to add a property
audbcards.Dataset.datapoints
that would return the number of files or segments based on a possible existingfiles
orsegments
table.The downside is that neither a
files
nor asegments
table have to exist inside a database. The ground truth to get all files in a database isaudformat.Database.files
andaudformat.Database.segments
to get all segments. We already haveaudbcards.Dataset.files
to get the number of files. So it seems reasonable to also addaudbcards.Dataset.segments
.This has one downside though: in order to get the number of all possible segments, we need to load all tables first and calculate the union of existing segments, compare https://github.com/audeering/audformat/blob/07b000266735ce460af3e4c09b611c15a63f76c0/audformat/core/database.py#L280-L286.
The text was updated successfully, but these errors were encountered: