Add map argument to audb.load_table() #447

hagenw · 2024-08-14T13:52:09Z

Adds the map argument to audb.load_table() to provide the user the possibility to map values of columns if they contain respective scheme labels. Without the new added argument, a user would have to use db = audb.load() + db.get(map=) instead.

It also fixes a bug inside audb.load_table() to only load misc tables, that are needed by a scheme of the selected table. Before, it was downloading all misc tables that were used as labels inside any scheme of the database.

audb/core/load.py

tests/test_load.py

ChristianGeng

semantics of "map"

https://github.com/audeering/audb/pull/447/files#top

Already previously the map argument in audb has confused me:

In pd.rename, the mapper (or the columns argument) accepts the union of dict-like and function, whereas in pd.map or pd.DataFrame.apply it only accepts a callable, i.e. a function or a lambda. And audb only accepts a dict.

In other words, the audb use of map uses map to designate a hashmap aka dict as a data structure, and not as a higher order function as in functional programming, where the function is applied on each element of a collection.

So there seems to be a slight mismatch of audb on one side, and pandas (and also the builting python map) on the other. Would it make sense to mention this terminology in the documentation at some level? Eventually this behavior is implemented in audformat.Column.get, probably this would be the right location for this. This is not a need-be, but reviewing this PR, I got surprised about the signature of the load_table arg and the terminology of map.

In the docstring you describe the argument as dict_mapping. In a world in which api breaking was not be considrered evil, I would find it a better argument name.

`test_load_table_map`

Interestingly, this led me to some confusing patterns when testing:

Running a test from the commandline succeeded:

pytest --color=yes 'tests/test_load.py::test_load_table_map'

But when I run the same test through my editor plugin I get this:


FAIL Required test coverage of 100% not reached. Total coverage: 0.00%
======================================= short test summary info ========================================
FAILED tests/test_load.py::test_load_table_map[1.0.0-files-None-expected0] - TypeError: load_table() got an unexpected keyword argument 'map'
FAILED tests/test_load.py::test_load_table_map[1.0.0-files-map1-expected1] - TypeError: load_table() got an unexpected keyword argument 'map'
========================================== 2 failed in 1.12s ===========================================

Normally they converge quite well. I have no explanation on that side.

hagenw · 2024-08-20T06:59:11Z

semantics of "map"

The map argument is indeed not straightforward, independent of its name. New users tend to struggle already with the concept, or do not understand how to get "age" if the table contains only a "speaker" column. On the other hand, it's also one of the main features of audformat as it allows us to have several tables, without the need to repeat a lot of the columns all the time, e.g. it is sufficient to have a single speaker column instead of repeating 8 columns with speaker attributes. HuggingFace Datasets solves the problem by not having several tables, but a dict (json) with the single datapoints, each datapoint containing all possible labels. This is more straightforward for the user, but I think we have more flexibility in audformat.

Regarding your question of the allowed type and name of map: it depends if you call it on a column or table. Let's use emodb as an example. First we inspect the speaker table, that is used as labels for the speaker scheme:

>>> db = audb.load("emodb", version="1.4.1", only_metadata=True, full_path=False, verbose=False)
>>> db.schemes["speaker"].labels
'speaker'
>>> db["speaker"].df.head()
         age  gender language
speaker                      
3         31    male      deu
8         34  female      deu
9         21  female      deu
10        32    male      deu
11        26    male      deu

If you use audformat.Column.get(), the map argument is of type string as column to map is already selected, and we can only map to a single label as the return value needs to be a column, e.g.

>>> db["files"]["speaker"].get(map="age").head()
file
wav/03a01Fa.wav    31
wav/03a01Nc.wav    31
wav/03a01Wa.wav    31
wav/03a02Fc.wav    31
wav/03a02Nc.wav    31
Name: age, dtype: Int64

When you use audformat.Table.get() the type of map is Dict[str, Union[str, Sequence[str]]]]. You have to select the column that should be mapped, and you can map it to one or more labels, including the original column, e.g.

>>> db["files"].get(map={"speaker": "age"}).head()
                                 duration transcription  age
file                                                        
wav/03a01Fa.wav    0 days 00:00:01.898250           a01   31
wav/03a01Nc.wav    0 days 00:00:01.611250           a01   31
wav/03a01Wa.wav 0 days 00:00:01.877812500           a01   31
wav/03a02Fc.wav    0 days 00:00:02.006250           a02   31
wav/03a02Nc.wav 0 days 00:00:01.439812500           a02   31

>>> db["files"].get(map={"speaker": ["age", "gender"]}).head()
                                 duration transcription  age gender
file                                                               
wav/03a01Fa.wav    0 days 00:00:01.898250           a01   31   male
wav/03a01Nc.wav    0 days 00:00:01.611250           a01   31   male
wav/03a01Wa.wav 0 days 00:00:01.877812500           a01   31   male
wav/03a02Fc.wav    0 days 00:00:02.006250           a02   31   male
wav/03a02Nc.wav 0 days 00:00:01.439812500           a02   31   male

>>> db["files"].get(map={"speaker": ["speaker", "age"]}).head()
                                 duration speaker transcription  age
file                                                                
wav/03a01Fa.wav    0 days 00:00:01.898250       3           a01   31
wav/03a01Nc.wav    0 days 00:00:01.611250       3           a01   31
wav/03a01Wa.wav 0 days 00:00:01.877812500       3           a01   31
wav/03a02Fc.wav    0 days 00:00:02.006250       3           a02   31
wav/03a02Nc.wav 0 days 00:00:01.439812500       3           a02   31

We thought it would be nice to have the same name for the argument in audformat.Column.get() and audformat.Table.get(), and it should be short, hence we came up with map.

So there seems to be a slight mismatch of audb on one side, and pandas (and also the builting python map) on the other.

You are right, this is indeed not ideal. I'm still not in favor of changing the argument name, but maybe extending the docstring would help.

hagenw · 2024-08-20T06:59:51Z

`test_load_table_map`

The failing tests inside your editor look to me, as the editor has not yet updated the virtual environment to include the latest changes

ChristianGeng · 2024-08-20T08:06:25Z

test_load_table_map

The failing tests inside your editor look to me, as the editor has not yet updated the virtual environment to include the latest changes

It was a different problem: Putting the cursor in the test parametrization selects the previos test. So an editor problem.

ChristianGeng

All points have been adressed. The meaning of the "map" kwarg departs a little from how it is used elsewhere but this can be mended in a docstring at a later stage.

hagenw added 3 commits August 14, 2024 15:47

Add map argument to audb.load_table()

68dfd4a

Improve docstring example

61f59fa

Add another example

b760c8d

hagenw marked this pull request as draft August 14, 2024 13:59

Download only required misc tables

264884d

hagenw marked this pull request as ready for review August 15, 2024 09:33

hagenw mentioned this pull request Aug 15, 2024

Add audb.stream() and audb.DatabaseIterator #448

Merged

hagenw requested a review from ChristianGeng August 19, 2024 11:56