Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Add map argument to audb.load_table() #447

Merged
merged 6 commits into from
Aug 20, 2024
Merged

Add map argument to audb.load_table() #447

merged 6 commits into from
Aug 20, 2024

Conversation

hagenw
Copy link
Member

@hagenw hagenw commented Aug 14, 2024

Adds the map argument to audb.load_table() to provide the user the possibility to map values of columns if they contain respective scheme labels. Without the new added argument, a user would have to use db = audb.load() + db.get(map=) instead.

It also fixes a bug inside audb.load_table() to only load misc tables, that are needed by a scheme of the selected table. Before, it was downloading all misc tables that were used as labels inside any scheme of the database.

image

@hagenw hagenw marked this pull request as draft August 14, 2024 13:59
@hagenw hagenw marked this pull request as ready for review August 15, 2024 09:33
audb/core/load.py Outdated Show resolved Hide resolved
Copy link
Member

@ChristianGeng ChristianGeng left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

semantics of "map"

https://github.com/audeering/audb/pull/447/files#top

Already previously the map argument in audb has confused me:

In pd.rename, the mapper (or the columns argument) accepts the union of dict-like and function, whereas in pd.map or pd.DataFrame.apply it only accepts a callable, i.e. a function or a lambda. And audb only accepts a dict.

In other words, the audb use of map uses map to designate a hashmap aka dict as a data structure, and not as a higher order function as in functional programming, where the function is applied on each element of a collection.

So there seems to be a slight mismatch of audb on one side, and pandas (and also the builting python map) on the other. Would it make sense to mention this terminology in the documentation at some level? Eventually this behavior is implemented in audformat.Column.get, probably this would be the right location for this. This is not a need-be, but reviewing this PR, I got surprised about the signature of the load_table arg and the terminology of map.

In the docstring you describe the argument as dict_mapping. In a world in which api breaking was not be considrered evil, I would find it a better argument name.

test_load_table_map

Interestingly, this led me to some confusing patterns when testing:

Running a test from the commandline succeeded:

pytest --color=yes 'tests/test_load.py::test_load_table_map'

But when I run the same test through my editor plugin I get this:


FAIL Required test coverage of 100% not reached. Total coverage: 0.00%
======================================= short test summary info ========================================
FAILED tests/test_load.py::test_load_table_map[1.0.0-files-None-expected0] - TypeError: load_table() got an unexpected keyword argument 'map'
FAILED tests/test_load.py::test_load_table_map[1.0.0-files-map1-expected1] - TypeError: load_table() got an unexpected keyword argument 'map'
========================================== 2 failed in 1.12s ===========================================

Normally they converge quite well. I have no explanation on that side.

@hagenw
Copy link
Member Author

hagenw commented Aug 20, 2024

semantics of "map"

The map argument is indeed not straightforward, independent of its name. New users tend to struggle already with the concept, or do not understand how to get "age" if the table contains only a "speaker" column. On the other hand, it's also one of the main features of audformat as it allows us to have several tables, without the need to repeat a lot of the columns all the time, e.g. it is sufficient to have a single speaker column instead of repeating 8 columns with speaker attributes. HuggingFace Datasets solves the problem by not having several tables, but a dict (json) with the single datapoints, each datapoint containing all possible labels. This is more straightforward for the user, but I think we have more flexibility in audformat.

Regarding your question of the allowed type and name of map: it depends if you call it on a column or table. Let's use emodb as an example. First we inspect the speaker table, that is used as labels for the speaker scheme:

>>> db = audb.load("emodb", version="1.4.1", only_metadata=True, full_path=False, verbose=False)
>>> db.schemes["speaker"].labels
'speaker'
>>> db["speaker"].df.head()
         age  gender language
speaker                      
3         31    male      deu
8         34  female      deu
9         21  female      deu
10        32    male      deu
11        26    male      deu

If you use audformat.Column.get(), the map argument is of type string as column to map is already selected, and we can only map to a single label as the return value needs to be a column, e.g.

>>> db["files"]["speaker"].get(map="age").head()
file
wav/03a01Fa.wav    31
wav/03a01Nc.wav    31
wav/03a01Wa.wav    31
wav/03a02Fc.wav    31
wav/03a02Nc.wav    31
Name: age, dtype: Int64

When you use audformat.Table.get() the type of map is Dict[str, Union[str, Sequence[str]]]]. You have to select the column that should be mapped, and you can map it to one or more labels, including the original column, e.g.

>>> db["files"].get(map={"speaker": "age"}).head()
                                 duration transcription  age
file                                                        
wav/03a01Fa.wav    0 days 00:00:01.898250           a01   31
wav/03a01Nc.wav    0 days 00:00:01.611250           a01   31
wav/03a01Wa.wav 0 days 00:00:01.877812500           a01   31
wav/03a02Fc.wav    0 days 00:00:02.006250           a02   31
wav/03a02Nc.wav 0 days 00:00:01.439812500           a02   31

>>> db["files"].get(map={"speaker": ["age", "gender"]}).head()
                                 duration transcription  age gender
file                                                               
wav/03a01Fa.wav    0 days 00:00:01.898250           a01   31   male
wav/03a01Nc.wav    0 days 00:00:01.611250           a01   31   male
wav/03a01Wa.wav 0 days 00:00:01.877812500           a01   31   male
wav/03a02Fc.wav    0 days 00:00:02.006250           a02   31   male
wav/03a02Nc.wav 0 days 00:00:01.439812500           a02   31   male

>>> db["files"].get(map={"speaker": ["speaker", "age"]}).head()
                                 duration speaker transcription  age
file                                                                
wav/03a01Fa.wav    0 days 00:00:01.898250       3           a01   31
wav/03a01Nc.wav    0 days 00:00:01.611250       3           a01   31
wav/03a01Wa.wav 0 days 00:00:01.877812500       3           a01   31
wav/03a02Fc.wav    0 days 00:00:02.006250       3           a02   31
wav/03a02Nc.wav 0 days 00:00:01.439812500       3           a02   31

We thought it would be nice to have the same name for the argument in audformat.Column.get() and audformat.Table.get(), and it should be short, hence we came up with map.

So there seems to be a slight mismatch of audb on one side, and pandas (and also the builting python map) on the other.

You are right, this is indeed not ideal. I'm still not in favor of changing the argument name, but maybe extending the docstring would help.

@hagenw
Copy link
Member Author

hagenw commented Aug 20, 2024

test_load_table_map

The failing tests inside your editor look to me, as the editor has not yet updated the virtual environment to include the latest changes

@ChristianGeng
Copy link
Member

test_load_table_map

The failing tests inside your editor look to me, as the editor has not yet updated the virtual environment to include the latest changes

It was a different problem: Putting the cursor in the test parametrization selects the previos test. So an editor problem.

Copy link
Member

@ChristianGeng ChristianGeng left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

All points have been adressed. The meaning of the "map" kwarg departs a little from how it is used elsewhere but this can be mended in a docstring at a later stage.

@hagenw hagenw merged commit 06b029f into main Aug 20, 2024
8 checks passed
@hagenw hagenw deleted the load-table-map branch August 20, 2024 09:06
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

Successfully merging this pull request may close these issues.

2 participants