Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

add list of all names to Taxon #18

Open
wants to merge 2 commits into
base: master
Choose a base branch
from
Open

Conversation

mkuhn
Copy link

@mkuhn mkuhn commented Sep 6, 2024

Hi Antonio, not sure if you will find this useful, but I needed a more complete list of names for a taxon in addition to the scientific name. I didn't update the docs yet. If you'd consider merging this, I can also add something to the documentation.

(Pdb) taxopy.Taxon(9606, ncbi_taxonomy).name
'Homo sapiens'

(Pdb) taxopy.Taxon(9606, ncbi_taxonomy).names
[('authority', 'Homo sapiens Linnaeus, 1758'), ('scientific name', 'Homo sapiens'), ('genbank common name', 'human')]

(Pdb) taxopy.Taxon(10090, ncbi_taxonomy).names
[('genbank common name', 'house mouse'), ('includes', 'LK3 transgenic mice'), ('common name', 'mouse'), ('authority', 'Mus musculus Linnaeus, 1758'), ('scientific name', 'Mus musculus'), ('includes', 'Mus sp. 129SV'), ('includes', 'nude mice'), ('includes', 'transgenic mice')]

(Pdb) taxopy.Taxon(9823, ncbi_taxonomy).names
[('genbank common name', 'pig'), ('common name', 'pigs'), ('authority', 'Sus scrofa Linnaeus, 1758'), ('scientific name', 'Sus scrofa'), ('common name', 'swine'), ('common name', 'wild boar')]

Summary by Sourcery

Introduce a new feature to the Taxon class that allows retrieval of a comprehensive list of names for a taxon, enhancing the existing functionality by including various name types such as scientific, common, and authority names.

New Features:

  • Add a new property names to the Taxon class that provides a list of all names associated with a taxon, including scientific, common, and authority names.

Enhancements:

  • Extend the _import_names method to populate a new dictionary taxid2names that stores all names associated with each taxon ID.

Copy link

sourcery-ai bot commented Sep 6, 2024

Reviewer's Guide by Sourcery

This pull request adds a new feature to the Taxon class in the taxopy library, introducing a 'names' property that provides a list of all names associated with a taxon, including scientific names, common names, and other variants. The implementation involves modifications to the TaxDb class to store and manage multiple names per taxon, and updates to the Taxon class to expose this new information.

File-Level Changes

Change Details Files
Added support for multiple names per taxon in TaxDb class
  • Modified _import_names method to store all names for each taxon
  • Added taxid2names property to TaxDb class
  • Updated init method to initialize taxid2names
taxopy/core.py
Enhanced Taxon class with new 'names' property
  • Added _names attribute to store all names for a taxon
  • Implemented names property to access all names
  • Updated class docstring to include information about the new names property
taxopy/core.py
Refactored code for better readability and maintainability
  • Improved formatting in _download_taxonomy method
  • Updated type hints and return values in relevant methods
taxopy/core.py

Tips
  • Trigger a new Sourcery review by commenting @sourcery-ai review on the pull request.
  • Continue your discussion with Sourcery by replying directly to review comments.
  • You can change your review settings at any time by accessing your dashboard:
    • Enable or disable the Sourcery-generated pull request summary or reviewer's guide;
    • Change the review language;
  • You can always contact us if you have any questions or feedback.

Copy link

@sourcery-ai sourcery-ai bot left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Hey @mkuhn - I've reviewed your changes - here's some feedback:

Overall Comments:

  • Please update the documentation to reflect the new names property and its usage. This will help users understand and utilize this new feature effectively.
  • Consider the performance implications of storing all names for every taxon. You might want to implement lazy loading of names or provide an option to disable this feature for users who don't need it, to minimize memory usage.
Here's what I looked at during the review
  • 🟡 General issues: 2 issues found
  • 🟢 Security: all looks good
  • 🟢 Testing: all looks good
  • 🟢 Complexity: all looks good
  • 🟢 Documentation: all looks good

Sourcery is free for open source - if you like our reviews please consider sharing them ✨
Help me be more useful! Please click 👍 or 👎 on each comment to tell me if it was helpful.

taxopy/core.py Outdated
@@ -136,6 +136,10 @@ def __init__(
def taxid2name(self) -> Dict[int, str]:
return self._taxid2name

@property
def taxid2names(self) -> Dict[int, str]:
Copy link

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

suggestion: Add a docstring for the taxid2names property

To maintain consistency with other properties in the class and improve documentation, consider adding a docstring to the taxid2names property. For example:

@property
def taxid2names(self) -> Dict[int, List[Tuple[str, str]]]:
    """Returns a dictionary mapping taxon IDs to lists of (kind, name) tuples."""
    return self._taxid2names

Also, note that the return type hint should be updated to reflect the actual structure of the returned dictionary.

    @property
    def taxid2names(self) -> Dict[int, List[Tuple[str, str]]]:
        """Returns a dictionary mapping taxon IDs to lists of (kind, name) tuples."""
        return self._taxid2names

@@ -302,6 +317,7 @@ def __init__(self, taxid: int, taxdb: TaxDb):
"The input integer is not a valid NCBI taxonomic identifier."
)
self._name = taxdb.taxid2name[self.taxid]
self._names = taxdb.taxid2names[self.taxid]
Copy link

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

suggestion: Improve robustness of names attribute assignment

To handle cases where a taxid might not be present in the taxid2names dictionary, consider using the dict.get() method with a default value. This would make the code more robust:

self._names = taxdb.taxid2names.get(self.taxid, [])

This approach ensures that even if a taxid is not found, an empty list is assigned instead of raising a KeyError.

Suggested change
self._names = taxdb.taxid2names[self.taxid]
self._names = taxdb.taxid2names.get(self.taxid, [])

if self._merged_dmp:
for oldtaxid, newtaxid in self._oldtaxid2newtaxid.items():
taxid2name[oldtaxid] = taxid2name[newtaxid]
return taxid2name
return taxid2name, taxid2names
Copy link
Owner

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Maybe here we could convert taxid2names to a dict just so both taxid2names and taxid2name are of the same class

@apcamargo
Copy link
Owner

Hi @mkuhn,

Sorry for the delay. I’ve been on post-congress vacation and will be back to work on Thursday.

Thanks for the PR! Let's merge it into the main branch :)

A few comments:

  • The names taxid2name/taxid2names and name/names are a bit too similar, and I feel that users will associate the word "name" with "scientific name." Could we consider alternative names for the attribute? Perhaps something like all_names, though I’m not entirely sure if that solves the issue.

  • I’m not very familiar with the non-scientific names in the taxdump. Are their types consistent? If they are, would it make more sense to use a dictionary for names (or whatever we decide to call this attribute) instead of a list of tuples?

  • Have you checked the additional memory usage for the TaxDb object with the extra names? If it’s substantial (though I doubt it), we might want to make this feature optional when building the TaxDb.

  • Once we sort out these details, it would be great to update the documentation.

@mkuhn
Copy link
Author

mkuhn commented Sep 11, 2024

No worries! I could go ahead with my own fork for now.

* The names `taxid2name`/`taxid2names` and `name`/`names` are a bit too similar, and I feel that users will associate the word "name" with "scientific name." Could we consider alternative names for the attribute? Perhaps something like `all_names`, though I’m not entirely sure if that solves the issue.
* I’m not very familiar with the non-scientific names in the taxdump. Are their types consistent? If they are, would it make more sense to use a dictionary for `names` (or whatever we decide to call this attribute) instead of a list of tuples?

Hmm, it's probably cleaner to have on internal dictionary, taxid2names which contains a tuple of names for each kind of name. E.g. for Sus scrofa, these are the names from names.dmp:

9823    |       pig     |               |       genbank common name     |
9823    |       pigs    |       pigs <Sus scrofa>       |       common name     |
9823    |       Sus scrofa Linnaeus, 1758       |               |       authority       |
9823    |       Sus scrofa      |               |       scientific name |
9823    |       swine   |               |       common name     |
9823    |       wild boar       |               |       common name

So there are two common names. To initialize the name of a species, one could then get it with self._name = taxdb.taxid2names[self.taxid]["scientific name"][0] and we could change the property names to all_names which returns a dictionary.

Here are the counts of name types:

25 genbank acronym
233 blast name
662 in-part
2139 acronym
14690 common name
30555 genbank common name
62480 equivalent name
85458 includes
277067 synonym
285547 type material
747958 authority
2608692 scientific name

So it's a limited list.

* Have you checked the additional memory usage for the `TaxDb` object with the extra names? If it’s substantial (though I doubt it), we might want to make this feature optional when building the `TaxDb`.

Just a rough estimate, names.dmp has 4.1m lines, 2.6m of which are scientific names. So the extra names do take up more memory, I don't think it's a problem.

* Once we sort out these details, it would be great to update the documentation.

Please let me know if I should go ahead and make the internal taxid2names dict and change the new names tuple to an all_names dict.

@apcamargo
Copy link
Owner

So, would taxid2names replace taxid2name? That's my assumption, since having both taxid2names[taxid]["scientific name"] and taxid2name[taxid] would be redundant. This change would cause an API break that I'd like to avoid as much as possible.

A potential solution is to convert taxid2name into a class that contains multiple dictionaries (e.g., scientific names, common names, etc.). By implementing the __getitem__ method, we can ensure that taxid2name[taxid] continues to return the scientific name. The other dictionaries could be exposed as properties, returning tuples to accommodate the possibility of multiple entries for common names and other types.

>>> taxid2name[9823]
"Sus scrofa"
>>> taxid2name.common_name[9823]
("pigs <Sus scrofa>", "swine", "wild boar")
>>> taxid2name.acronym[9823]
None  # or KeyError

Converting taxid2name to a class would pave the way for storing additional taxonomy-related data and implement some additional functionalities.

@mkuhn
Copy link
Author

mkuhn commented Sep 23, 2024

Ah, I thought that taxid2name was considered internal and not part of an API. I would also avoid breaking changes.

I find the code example a bit confusing. As a Python coder, I wouldn't expect a dictionary(-like object) to have an attribute that is then also a dictionary.

Which organization level makes more sense to you, kind - species or species - kind? The latter is more like the data is structured in the dumps, but actually what I want in my use case is all the common names. So querying it like you propose does make sense.

How about:

>>> taxid2name[9823]
"Sus scrofa"
>>> taxid2all_names["scientific name"][9823]
"Sus scrofa"
>>> taxid2all_names["common name"][9823]
("pigs <Sus scrofa>", "swine", "wild boar")
>>> taxid2all_names["acronym"][9823]
None  # or KeyError

With taxid2name = taxid2all_names["scientific name"] – so there's an additional name, but no overhead in terms of memory.

@apcamargo
Copy link
Owner

I find the code example a bit confusing. As a Python coder, I wouldn't expect a dictionary(-like object) to have an attribute that is then also a dictionary.

Good point.

I like your suggestion! Do you think you could update the PR with this API?

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

Successfully merging this pull request may close these issues.

2 participants