Add support for download of single instances instead of whole series #97

DanielaSchacherer · 2024-07-03T13:35:23Z

For pathology use cases, it is crucial to be able to download only single levels (and safe the effort of downloading a huge amount of data not actually needed).

I think this could be added in the context of adding the instance-level slide-microscopy index (sm_index).

Follow up on #15 and ImagingDataCommons/idc-index#97 This adds selected attributes that vary across individual instances within the series, tying them to the instances via SOPInstanceUID. Series-level attributes (including storage bucket folder containing the series) are expected to be accessible by JOIN on the included SeriesInstanceUID.

* enh: add instance-level SM index Follow up on #15 and ImagingDataCommons/idc-index#97 This adds selected attributes that vary across individual instances within the series, tying them to the instances via SOPInstanceUID. Series-level attributes (including storage bucket folder containing the series) are expected to be accessible by JOIN on the included SeriesInstanceUID. * enh: added instance file size

DanielaSchacherer · 2024-07-10T15:01:59Z

Proposal for implementation in case the instance-level index file is added automatically in the same way as the series-level index during IDCClient instantiation:

add sopInstanceUID attribute to function IDCClient.download_from_selection()
add sopInstanceUID attribute to function IDCClient._safe_filter_by_selection()
add function IDCClient._filter_by_dicom_instance_uid()
additionally provide IDCClient.download_dicom_instance() function in the same fashion as e.g. download_dicom_series() is provided.

Proposal for implementation in case we don't add the instance-level index automatically.

Option 1: We follow the same approach as outlined above, however we request that the instance-level index file is added as another attribute (I personally don't like that approach as it's not good practice implementation-wise)
Option 2: We define IDCClient.download_dicom_instance() to be used in that case and require as additional attribute the instance-level index file.

Let me know what you think or if I've missed something.

vkt1414 · 2024-07-10T16:30:26Z

@DanielaSchacherer
In my opinion, all features specific to pathology could live in a dedicated class for pathology, similar to our cli implementation. We will import IDCClient into it. Then during pathology class client initialization, we can fetch the pathology index.

The pathology client should be able to do everything that idc_index already does, simply by doing a join on series level index. Yes, we will need to rewrite all endpoints available in idc_index again but it should not be hard. After that we can have download_dicom_instance or anything wild you can think of with in the attributes available

DanielaSchacherer · 2024-07-11T09:16:51Z

I like that idea, Vamsi.
@fedorov do you agree? And if so, would it be fine, if I go ahead, implement it in a fork and then make a pull request?

fedorov · 2024-07-12T15:59:43Z

Sorry I missed this discussion earlier.

I would strongly prefer to not introduce additional class for pathology. Going forward, we will only have more of the subject-specific indices (clinical data, segmentations, measurements, CT, MR, RTSTRUCT), so the complexity will only increase, and often we will need to join multiple indices and I think it will make more sense to join them in the same class. Creation of new clients will also require unnecessary duplication of the main index. Also, I do not see any need to separate pathology-specific functionality. What is new is instance-level access, which may be justifiable in situations beyond pathology.

Instead, I suggest (as I proposed and discussed earlier on several occasions) to allow download of additional indices by the user on request. I have no concern adding endpoints @DanielaSchacherer proposed. Whenever instance-level access is requested, we could search all of the indices that are available locally, identify those that contain _instance_index in the name, find SOPInstanceUID in that index, require that instance-level index always includes SeriesInstanceUID and join by series with the main index. If no local index by instance is available - return error.

Along those lines, we would just need to add the following endpoints to the existing IDCClient:

list_indices(): this would return name, description, and availability (downloaded or not) for each index we have
fetch_indices([index_id]): download index by id if not available locally and load into client variable

DanielaSchacherer · 2024-07-15T12:39:25Z

Which part of these tasks should I do, which one do you, @vkt1414 or @fedorov prefer to do yourself?

vkt1414 · 2024-07-15T12:43:11Z

Which part of these tasks should I do, which one do you, @vkt1414 or @fedorov prefer to do yourself?

I don't have a good idea on how I'd implement this as there are too many variables. Please go ahead and take a swing if you can @DanielaSchacherer .

vkt1414 · 2024-07-15T12:51:36Z

if you need any help with unnesting with duckdb, I can certainly help..Here's one foolproof way of unnesting with duckdb. It is slow but I guarantee the accuracy of unnesting.

import pandas as pd
import duckdb

sql=f'''
WITH pathology_index as
(
SELECT
DISTINCT
SOPInstanceUID,
SeriesInstanceUID,
embeddingMediumCodeMeaning,
embeddingMediumCodeDesignatorValueStr,
tissueFixativeCodeMeaning,
tissueFixativeCodeDesignatorValueStr,
stainingUsingSubstanceCodeMeaning,
stainingUsingSubstanceCodeDesignatorValueStr,
PixelSpacing_0,
ImageTypes, 
TransferSyntaxUID, 
instance_size,
crdc_instance_uuid
FROM
  'sm_instance_index.parquet' sii
LEFT JOIN UNNEST(embeddingMedium_CodeMeaning) t1(embeddingMediumCodeMeaning) ON TRUE
LEFT JOIN UNNEST(embeddingMedium_code_designator_value_str) t2(embeddingMediumCodeDesignatorValueStr) ON TRUE
LEFT JOIN UNNEST(tissueFixative_CodeMeaning) t3(tissueFixativeCodeMeaning) ON TRUE
LEFT JOIN UNNEST(tissueFixative_code_designator_value_str) t4(tissueFixativeCodeDesignatorValueStr) ON TRUE
LEFT JOIN UNNEST(staining_usingSubstance_CodeMeaning) t5(stainingUsingSubstanceCodeMeaning) ON TRUE
LEFT JOIN UNNEST(staining_usingSubstance_code_designator_value_str) t6(stainingUsingSubstanceCodeDesignatorValueStr) ON TRUE
LEFT JOIN UNNEST(ImageType) t7(ImageTypes) ON TRUE
)
SELECT 
DISTINCT
* FROM pathology_index
WHERE
    TransferSyntaxUID IN ('1.2.840.10008.1.2.4.50','1.2.840.10008.1.2.4.91')
    AND ImageTypes IN ('VOLUME')
    AND stainingUsingSubstanceCodeMeaning IN ('hematoxylin stain')
'''
df = duckdb.query(sql).to_df()

DanielaSchacherer · 2024-07-15T12:59:15Z

I don't have much time left to work on IDC this month, so progress might be slow, but I can certainly work on it! :)

DanielaSchacherer · 2024-07-22T12:58:59Z

Along those lines, we would just need to add the following endpoints to the existing IDCClient:

list_indices(): this would return name, description, and availability (downloaded or not) for each index we have

fetch_indices([index_id]): download index by id if not available locally and load into client variable

Started with this in #101.
Also talked with Vamsi on how to best manage multiple index files. In the discussion we didn't see how to avoid having multiple classes for the multiple indices as we need to make sure that all current functionality still works properly.

fedorov · 2024-07-22T13:44:11Z

I am seeing those comments just now, I was away last week. I will review the PR from Daniela after it clears the CI checks and comment then.

DanielaSchacherer · 2024-08-07T14:12:12Z

Here now my proposal (slightly adapted from the proposal in a previous comment) on how to proceed with the implementation of instance-level access for pathology slides:

add sopInstanceUID attribute to function IDCClient.download_from_selection()
- function will also check for the existence of instance_level index file and throw error otherwise
- if instance_level index file is available, it will be joined with the main index. The joined table is required as input to the following function IDCClient._safe_filter_by_selection()
add sopInstanceUID attribute to function IDCClient._safe_filter_by_selection()
add function IDCClient._filter_by_dicom_instance_uid()
additionally provide IDCClient.download_dicom_instance() function in the same fashion as e.g. download_dicom_series() is provided.

fedorov · 2024-08-07T22:38:00Z

Sounds good to me overall.

if instance_level index file is available, it will be joined with the main index

Would it make sense to join it when instance index is loaded to include the attributes needed for building the folder hierarchy? Or join every time we need those attributes when download is requested?

DanielaSchacherer · 2024-08-09T13:28:08Z

Would it make sense to join it when instance index is loaded to include the attributes needed for building the folder hierarchy? Or join every time we need those attributes when download is requested?

It would make very much sense. I did that.

Opened a new PR with some comments/things that we might need to discuss: #112

fedorov mentioned this issue Jul 5, 2024

Add slide microscopy (SM) index ImagingDataCommons/idc-index-data#15

Closed

fedorov mentioned this issue Jul 5, 2024

enh: add instance-level SM index ImagingDataCommons/idc-index-data#31

Merged

fedorov mentioned this issue Jul 22, 2024

list_indices and fetch_index implemented #101

Merged

fedorov added the enhancement New feature or request label Jul 31, 2024

DanielaSchacherer mentioned this issue Aug 9, 2024

Implementation of instance-level download #112

Closed

DanielaSchacherer mentioned this issue Aug 13, 2024

Implementation of instance-level download #113

Draft

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Add support for download of single instances instead of whole series #97

Add support for download of single instances instead of whole series #97

DanielaSchacherer commented Jul 3, 2024

DanielaSchacherer commented Jul 10, 2024 •

edited

Loading

vkt1414 commented Jul 10, 2024

DanielaSchacherer commented Jul 11, 2024

fedorov commented Jul 12, 2024

DanielaSchacherer commented Jul 15, 2024

vkt1414 commented Jul 15, 2024

vkt1414 commented Jul 15, 2024

DanielaSchacherer commented Jul 15, 2024

DanielaSchacherer commented Jul 22, 2024 •

edited

Loading

fedorov commented Jul 22, 2024

DanielaSchacherer commented Aug 7, 2024

fedorov commented Aug 7, 2024

DanielaSchacherer commented Aug 9, 2024

Add support for download of single instances instead of whole series #97

Add support for download of single instances instead of whole series #97

Comments

DanielaSchacherer commented Jul 3, 2024

DanielaSchacherer commented Jul 10, 2024 • edited Loading

vkt1414 commented Jul 10, 2024

DanielaSchacherer commented Jul 11, 2024

fedorov commented Jul 12, 2024

DanielaSchacherer commented Jul 15, 2024

vkt1414 commented Jul 15, 2024

vkt1414 commented Jul 15, 2024

DanielaSchacherer commented Jul 15, 2024

DanielaSchacherer commented Jul 22, 2024 • edited Loading

fedorov commented Jul 22, 2024

DanielaSchacherer commented Aug 7, 2024

fedorov commented Aug 7, 2024

DanielaSchacherer commented Aug 9, 2024

DanielaSchacherer commented Jul 10, 2024 •

edited

Loading

DanielaSchacherer commented Jul 22, 2024 •

edited

Loading