add list of all names to Taxon #18

mkuhn · 2024-09-06T13:17:08Z

Hi Antonio, not sure if you will find this useful, but I needed a more complete list of names for a taxon in addition to the scientific name. I didn't update the docs yet. If you'd consider merging this, I can also add something to the documentation.

(Pdb) taxopy.Taxon(9606, ncbi_taxonomy).name
'Homo sapiens'

(Pdb) taxopy.Taxon(9606, ncbi_taxonomy).names
[('authority', 'Homo sapiens Linnaeus, 1758'), ('scientific name', 'Homo sapiens'), ('genbank common name', 'human')]

(Pdb) taxopy.Taxon(10090, ncbi_taxonomy).names
[('genbank common name', 'house mouse'), ('includes', 'LK3 transgenic mice'), ('common name', 'mouse'), ('authority', 'Mus musculus Linnaeus, 1758'), ('scientific name', 'Mus musculus'), ('includes', 'Mus sp. 129SV'), ('includes', 'nude mice'), ('includes', 'transgenic mice')]

(Pdb) taxopy.Taxon(9823, ncbi_taxonomy).names
[('genbank common name', 'pig'), ('common name', 'pigs'), ('authority', 'Sus scrofa Linnaeus, 1758'), ('scientific name', 'Sus scrofa'), ('common name', 'swine'), ('common name', 'wild boar')]

Summary by Sourcery

Introduce a new feature to the Taxon class that allows retrieval of a comprehensive list of names for a taxon, enhancing the existing functionality by including various name types such as scientific, common, and authority names.

New Features:

Add a new property names to the Taxon class that provides a list of all names associated with a taxon, including scientific, common, and authority names.

Enhancements:

Extend the _import_names method to populate a new dictionary taxid2names that stores all names associated with each taxon ID.

sourcery-ai · 2024-09-06T13:17:14Z

Reviewer's Guide by Sourcery

This pull request adds a new feature to the Taxon class in the taxopy library, introducing a 'names' property that provides a list of all names associated with a taxon, including scientific names, common names, and other variants. The implementation involves modifications to the TaxDb class to store and manage multiple names per taxon, and updates to the Taxon class to expose this new information.

File-Level Changes

Change	Details	Files
Added support for multiple names per taxon in TaxDb class	Modified _import_names method to store all names for each taxon Added taxid2names property to TaxDb class Updated init method to initialize taxid2names	`taxopy/core.py`
Enhanced Taxon class with new 'names' property	Added _names attribute to store all names for a taxon Implemented names property to access all names Updated class docstring to include information about the new names property	`taxopy/core.py`
Refactored code for better readability and maintainability	Improved formatting in _download_taxonomy method Updated type hints and return values in relevant methods	`taxopy/core.py`

Tips

Trigger a new Sourcery review by commenting @sourcery-ai review on the pull request.
Continue your discussion with Sourcery by replying directly to review comments.
You can change your review settings at any time by accessing your dashboard:
- Enable or disable the Sourcery-generated pull request summary or reviewer's guide;
- Change the review language;
You can always contact us if you have any questions or feedback.

sourcery-ai

Hey @mkuhn - I've reviewed your changes - here's some feedback:

Overall Comments:

Please update the documentation to reflect the new names property and its usage. This will help users understand and utilize this new feature effectively.
Consider the performance implications of storing all names for every taxon. You might want to implement lazy loading of names or provide an option to disable this feature for users who don't need it, to minimize memory usage.

Here's what I looked at during the review

🟡 General issues: 2 issues found
🟢 Security: all looks good
🟢 Testing: all looks good
🟢 Complexity: all looks good
🟢 Documentation: all looks good

Sourcery is free for open source - if you like our reviews please consider sharing them ✨

_{Help me be more useful! Please click 👍 or 👎 on each comment to tell me if it was helpful.}

sourcery-ai · 2024-09-06T13:18:20Z

taxopy/core.py

@@ -136,6 +136,10 @@ def __init__(
    def taxid2name(self) -> Dict[int, str]:
        return self._taxid2name

+    @property
+    def taxid2names(self) -> Dict[int, str]:


suggestion: Add a docstring for the taxid2names property

To maintain consistency with other properties in the class and improve documentation, consider adding a docstring to the taxid2names property. For example:

@property def taxid2names(self) -> Dict[int, List[Tuple[str, str]]]: """Returns a dictionary mapping taxon IDs to lists of (kind, name) tuples.""" return self._taxid2names

Also, note that the return type hint should be updated to reflect the actual structure of the returned dictionary.

@property def taxid2names(self) -> Dict[int, List[Tuple[str, str]]]: """Returns a dictionary mapping taxon IDs to lists of (kind, name) tuples.""" return self._taxid2names

sourcery-ai · 2024-09-06T13:18:20Z

taxopy/core.py

@@ -302,6 +317,7 @@ def __init__(self, taxid: int, taxdb: TaxDb):
                "The input integer is not a valid NCBI taxonomic identifier."
            )
        self._name = taxdb.taxid2name[self.taxid]
+        self._names = taxdb.taxid2names[self.taxid]


suggestion: Improve robustness of names attribute assignment

To handle cases where a taxid might not be present in the taxid2names dictionary, consider using the dict.get() method with a default value. This would make the code more robust:

self._names = taxdb.taxid2names.get(self.taxid, [])

This approach ensures that even if a taxid is not found, an empty list is assigned instead of raising a KeyError.

Suggested change

self._names = taxdb.taxid2names[self.taxid]

self._names = taxdb.taxid2names.get(self.taxid, [])

taxopy/core.py

apcamargo · 2024-09-10T13:15:49Z

taxopy/core.py

        if self._merged_dmp:
            for oldtaxid, newtaxid in self._oldtaxid2newtaxid.items():
                taxid2name[oldtaxid] = taxid2name[newtaxid]
-        return taxid2name
+        return taxid2name, taxid2names


Maybe here we could convert taxid2names to a dict just so both taxid2names and taxid2name are of the same class

apcamargo · 2024-09-10T13:23:49Z

Hi @mkuhn,

Sorry for the delay. I’ve been on post-congress vacation and will be back to work on Thursday.

Thanks for the PR! Let's merge it into the main branch :)

A few comments:

The names taxid2name/taxid2names and name/names are a bit too similar, and I feel that users will associate the word "name" with "scientific name." Could we consider alternative names for the attribute? Perhaps something like all_names, though I’m not entirely sure if that solves the issue.
I’m not very familiar with the non-scientific names in the taxdump. Are their types consistent? If they are, would it make more sense to use a dictionary for names (or whatever we decide to call this attribute) instead of a list of tuples?
Have you checked the additional memory usage for the TaxDb object with the extra names? If it’s substantial (though I doubt it), we might want to make this feature optional when building the TaxDb.
Once we sort out these details, it would be great to update the documentation.

mkuhn · 2024-09-11T14:44:08Z

No worries! I could go ahead with my own fork for now.

* The names `taxid2name`/`taxid2names` and `name`/`names` are a bit too similar, and I feel that users will associate the word "name" with "scientific name." Could we consider alternative names for the attribute? Perhaps something like `all_names`, though I’m not entirely sure if that solves the issue.
* I’m not very familiar with the non-scientific names in the taxdump. Are their types consistent? If they are, would it make more sense to use a dictionary for `names` (or whatever we decide to call this attribute) instead of a list of tuples?

Hmm, it's probably cleaner to have on internal dictionary, taxid2names which contains a tuple of names for each kind of name. E.g. for Sus scrofa, these are the names from names.dmp:

9823    |       pig     |               |       genbank common name     |
9823    |       pigs    |       pigs <Sus scrofa>       |       common name     |
9823    |       Sus scrofa Linnaeus, 1758       |               |       authority       |
9823    |       Sus scrofa      |               |       scientific name |
9823    |       swine   |               |       common name     |
9823    |       wild boar       |               |       common name

So there are two common names. To initialize the name of a species, one could then get it with self._name = taxdb.taxid2names[self.taxid]["scientific name"][0] and we could change the property names to all_names which returns a dictionary.

Here are the counts of name types:

25 genbank acronym
233 blast name
662 in-part
2139 acronym
14690 common name
30555 genbank common name
62480 equivalent name
85458 includes
277067 synonym
285547 type material
747958 authority
2608692 scientific name

So it's a limited list.

* Have you checked the additional memory usage for the `TaxDb` object with the extra names? If it’s substantial (though I doubt it), we might want to make this feature optional when building the `TaxDb`.

Just a rough estimate, names.dmp has 4.1m lines, 2.6m of which are scientific names. So the extra names do take up more memory, I don't think it's a problem.

* Once we sort out these details, it would be great to update the documentation.

Please let me know if I should go ahead and make the internal taxid2names dict and change the new names tuple to an all_names dict.

apcamargo · 2024-09-20T03:59:03Z

So, would taxid2names replace taxid2name? That's my assumption, since having both taxid2names[taxid]["scientific name"] and taxid2name[taxid] would be redundant. This change would cause an API break that I'd like to avoid as much as possible.

A potential solution is to convert taxid2name into a class that contains multiple dictionaries (e.g., scientific names, common names, etc.). By implementing the __getitem__ method, we can ensure that taxid2name[taxid] continues to return the scientific name. The other dictionaries could be exposed as properties, returning tuples to accommodate the possibility of multiple entries for common names and other types.

>>> taxid2name[9823]
"Sus scrofa"
>>> taxid2name.common_name[9823]
("pigs <Sus scrofa>", "swine", "wild boar")
>>> taxid2name.acronym[9823]
None  # or KeyError

Converting taxid2name to a class would pave the way for storing additional taxonomy-related data and implement some additional functionalities.

mkuhn · 2024-09-23T16:01:40Z

Ah, I thought that taxid2name was considered internal and not part of an API. I would also avoid breaking changes.

I find the code example a bit confusing. As a Python coder, I wouldn't expect a dictionary(-like object) to have an attribute that is then also a dictionary.

Which organization level makes more sense to you, kind - species or species - kind? The latter is more like the data is structured in the dumps, but actually what I want in my use case is all the common names. So querying it like you propose does make sense.

How about:

>>> taxid2name[9823]
"Sus scrofa"
>>> taxid2all_names["scientific name"][9823]
"Sus scrofa"
>>> taxid2all_names["common name"][9823]
("pigs <Sus scrofa>", "swine", "wild boar")
>>> taxid2all_names["acronym"][9823]
None  # or KeyError

With taxid2name = taxid2all_names["scientific name"] – so there's an additional name, but no overhead in terms of memory.

apcamargo · 2024-10-03T22:09:34Z

I find the code example a bit confusing. As a Python coder, I wouldn't expect a dictionary(-like object) to have an attribute that is then also a dictionary.

Good point.

I like your suggestion! Do you think you could update the PR with this API?

add list of all names to Taxon

d84d6da

sourcery-ai bot reviewed Sep 6, 2024

View reviewed changes

update types

8da90f5

apcamargo reviewed Sep 10, 2024

View reviewed changes

taxopy/core.py Show resolved Hide resolved

apcamargo reviewed Sep 10, 2024

View reviewed changes

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

add list of all names to Taxon #18

add list of all names to Taxon #18

mkuhn commented Sep 6, 2024 •

edited by sourcery-ai bot

Loading

sourcery-ai bot commented Sep 6, 2024 •

edited

Loading

sourcery-ai bot left a comment

sourcery-ai bot Sep 6, 2024

sourcery-ai bot Sep 6, 2024

apcamargo Sep 10, 2024

apcamargo commented Sep 10, 2024

mkuhn commented Sep 11, 2024

apcamargo commented Sep 20, 2024

mkuhn commented Sep 23, 2024

apcamargo commented Oct 3, 2024

	self._names = taxdb.taxid2names[self.taxid]
	self._names = taxdb.taxid2names.get(self.taxid, [])

add list of all names to Taxon #18

Are you sure you want to change the base?

add list of all names to Taxon #18

Conversation

mkuhn commented Sep 6, 2024 • edited by sourcery-ai bot Loading

Summary by Sourcery

sourcery-ai bot commented Sep 6, 2024 • edited Loading

Reviewer's Guide by Sourcery

File-Level Changes

sourcery-ai bot left a comment

Choose a reason for hiding this comment

sourcery-ai bot Sep 6, 2024

Choose a reason for hiding this comment

sourcery-ai bot Sep 6, 2024

Choose a reason for hiding this comment

apcamargo Sep 10, 2024

Choose a reason for hiding this comment

apcamargo commented Sep 10, 2024

mkuhn commented Sep 11, 2024

apcamargo commented Sep 20, 2024

mkuhn commented Sep 23, 2024

apcamargo commented Oct 3, 2024

mkuhn commented Sep 6, 2024 •

edited by sourcery-ai bot

Loading

sourcery-ai bot commented Sep 6, 2024 •

edited

Loading