Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Add support for download of single instances instead of whole series #97

Open
DanielaSchacherer opened this issue Jul 3, 2024 · 13 comments
Labels
enhancement New feature or request

Comments

@DanielaSchacherer
Copy link
Contributor

For pathology use cases, it is crucial to be able to download only single levels (and safe the effort of downloading a huge amount of data not actually needed).

I think this could be added in the context of adding the instance-level slide-microscopy index (sm_index).

fedorov added a commit to ImagingDataCommons/idc-index-data that referenced this issue Jul 5, 2024
Follow up on #15 and ImagingDataCommons/idc-index#97

This adds selected attributes that vary across individual instances within the series,
tying them to the instances via SOPInstanceUID.

Series-level attributes (including storage bucket folder containing the series)
are expected to be accessible by JOIN on the included SeriesInstanceUID.
fedorov added a commit to ImagingDataCommons/idc-index-data that referenced this issue Jul 9, 2024
* enh: add instance-level SM index

Follow up on #15 and ImagingDataCommons/idc-index#97

This adds selected attributes that vary across individual instances within the series,
tying them to the instances via SOPInstanceUID.

Series-level attributes (including storage bucket folder containing the series)
are expected to be accessible by JOIN on the included SeriesInstanceUID.

* enh: added instance file size
@DanielaSchacherer
Copy link
Contributor Author

DanielaSchacherer commented Jul 10, 2024

Proposal for implementation in case the instance-level index file is added automatically in the same way as the series-level index during IDCClient instantiation:

  • add sopInstanceUID attribute to function IDCClient.download_from_selection()
  • add sopInstanceUID attribute to function IDCClient._safe_filter_by_selection()
  • add function IDCClient._filter_by_dicom_instance_uid()
  • additionally provide IDCClient.download_dicom_instance() function in the same fashion as e.g. download_dicom_series() is provided.

Proposal for implementation in case we don't add the instance-level index automatically.

  • Option 1: We follow the same approach as outlined above, however we request that the instance-level index file is added as another attribute (I personally don't like that approach as it's not good practice implementation-wise)
  • Option 2: We define IDCClient.download_dicom_instance() to be used in that case and require as additional attribute the instance-level index file.

Let me know what you think or if I've missed something.

@vkt1414
Copy link
Collaborator

vkt1414 commented Jul 10, 2024

@DanielaSchacherer
In my opinion, all features specific to pathology could live in a dedicated class for pathology, similar to our cli implementation. We will import IDCClient into it. Then during pathology class client initialization, we can fetch the pathology index.

The pathology client should be able to do everything that idc_index already does, simply by doing a join on series level index. Yes, we will need to rewrite all endpoints available in idc_index again but it should not be hard. After that we can have download_dicom_instance or anything wild you can think of with in the attributes available

@DanielaSchacherer
Copy link
Contributor Author

I like that idea, Vamsi.
@fedorov do you agree? And if so, would it be fine, if I go ahead, implement it in a fork and then make a pull request?

@fedorov
Copy link
Member

fedorov commented Jul 12, 2024

Sorry I missed this discussion earlier.

I would strongly prefer to not introduce additional class for pathology. Going forward, we will only have more of the subject-specific indices (clinical data, segmentations, measurements, CT, MR, RTSTRUCT), so the complexity will only increase, and often we will need to join multiple indices and I think it will make more sense to join them in the same class. Creation of new clients will also require unnecessary duplication of the main index. Also, I do not see any need to separate pathology-specific functionality. What is new is instance-level access, which may be justifiable in situations beyond pathology.

Instead, I suggest (as I proposed and discussed earlier on several occasions) to allow download of additional indices by the user on request. I have no concern adding endpoints @DanielaSchacherer proposed. Whenever instance-level access is requested, we could search all of the indices that are available locally, identify those that contain _instance_index in the name, find SOPInstanceUID in that index, require that instance-level index always includes SeriesInstanceUID and join by series with the main index. If no local index by instance is available - return error.

Along those lines, we would just need to add the following endpoints to the existing IDCClient:

  • list_indices(): this would return name, description, and availability (downloaded or not) for each index we have
  • fetch_indices([index_id]): download index by id if not available locally and load into client variable

@DanielaSchacherer
Copy link
Contributor Author

Which part of these tasks should I do, which one do you, @vkt1414 or @fedorov prefer to do yourself?

@vkt1414
Copy link
Collaborator

vkt1414 commented Jul 15, 2024

Which part of these tasks should I do, which one do you, @vkt1414 or @fedorov prefer to do yourself?

I don't have a good idea on how I'd implement this as there are too many variables. Please go ahead and take a swing if you can @DanielaSchacherer .

@vkt1414
Copy link
Collaborator

vkt1414 commented Jul 15, 2024

if you need any help with unnesting with duckdb, I can certainly help..Here's one foolproof way of unnesting with duckdb. It is slow but I guarantee the accuracy of unnesting.

import pandas as pd
import duckdb

sql=f'''
WITH pathology_index as
(
SELECT
DISTINCT
SOPInstanceUID,
SeriesInstanceUID,
embeddingMediumCodeMeaning,
embeddingMediumCodeDesignatorValueStr,
tissueFixativeCodeMeaning,
tissueFixativeCodeDesignatorValueStr,
stainingUsingSubstanceCodeMeaning,
stainingUsingSubstanceCodeDesignatorValueStr,
PixelSpacing_0,
ImageTypes, 
TransferSyntaxUID, 
instance_size,
crdc_instance_uuid
FROM
  'sm_instance_index.parquet' sii
LEFT JOIN UNNEST(embeddingMedium_CodeMeaning) t1(embeddingMediumCodeMeaning) ON TRUE
LEFT JOIN UNNEST(embeddingMedium_code_designator_value_str) t2(embeddingMediumCodeDesignatorValueStr) ON TRUE
LEFT JOIN UNNEST(tissueFixative_CodeMeaning) t3(tissueFixativeCodeMeaning) ON TRUE
LEFT JOIN UNNEST(tissueFixative_code_designator_value_str) t4(tissueFixativeCodeDesignatorValueStr) ON TRUE
LEFT JOIN UNNEST(staining_usingSubstance_CodeMeaning) t5(stainingUsingSubstanceCodeMeaning) ON TRUE
LEFT JOIN UNNEST(staining_usingSubstance_code_designator_value_str) t6(stainingUsingSubstanceCodeDesignatorValueStr) ON TRUE
LEFT JOIN UNNEST(ImageType) t7(ImageTypes) ON TRUE
)
SELECT 
DISTINCT
* FROM pathology_index
WHERE
    TransferSyntaxUID IN ('1.2.840.10008.1.2.4.50','1.2.840.10008.1.2.4.91')
    AND ImageTypes IN ('VOLUME')
    AND stainingUsingSubstanceCodeMeaning IN ('hematoxylin stain')
'''
df = duckdb.query(sql).to_df()

@DanielaSchacherer
Copy link
Contributor Author

I don't have much time left to work on IDC this month, so progress might be slow, but I can certainly work on it! :)

@DanielaSchacherer
Copy link
Contributor Author

DanielaSchacherer commented Jul 22, 2024

Along those lines, we would just need to add the following endpoints to the existing IDCClient:

  • list_indices(): this would return name, description, and availability (downloaded or not) for each index we have
  • fetch_indices([index_id]): download index by id if not available locally and load into client variable

Started with this in #101.
Also talked with Vamsi on how to best manage multiple index files. In the discussion we didn't see how to avoid having multiple classes for the multiple indices as we need to make sure that all current functionality still works properly.

@fedorov
Copy link
Member

fedorov commented Jul 22, 2024

I am seeing those comments just now, I was away last week. I will review the PR from Daniela after it clears the CI checks and comment then.

@fedorov fedorov added the enhancement New feature or request label Jul 31, 2024
@DanielaSchacherer
Copy link
Contributor Author

Here now my proposal (slightly adapted from the proposal in a previous comment) on how to proceed with the implementation of instance-level access for pathology slides:

  • add sopInstanceUID attribute to function IDCClient.download_from_selection()
    • function will also check for the existence of instance_level index file and throw error otherwise
    • if instance_level index file is available, it will be joined with the main index. The joined table is required as input to the following function IDCClient._safe_filter_by_selection()
  • add sopInstanceUID attribute to function IDCClient._safe_filter_by_selection()
  • add function IDCClient._filter_by_dicom_instance_uid()
  • additionally provide IDCClient.download_dicom_instance() function in the same fashion as e.g. download_dicom_series() is provided.

@fedorov
Copy link
Member

fedorov commented Aug 7, 2024

Sounds good to me overall.

if instance_level index file is available, it will be joined with the main index

Would it make sense to join it when instance index is loaded to include the attributes needed for building the folder hierarchy? Or join every time we need those attributes when download is requested?

@DanielaSchacherer
Copy link
Contributor Author

Would it make sense to join it when instance index is loaded to include the attributes needed for building the folder hierarchy? Or join every time we need those attributes when download is requested?

It would make very much sense. I did that.

Opened a new PR with some comments/things that we might need to discuss: #112

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
enhancement New feature or request
Projects
None yet
Development

No branches or pull requests

3 participants