diff --git a/docs/datasets.md b/docs/datasets.md
index 58f6561..c6e82c6 100644
--- a/docs/datasets.md
+++ b/docs/datasets.md
@@ -3,48 +3,55 @@ Datasets
Datasets used in DeSide
***
-## scRNA-seq datasets
-
-| Dataset ID | Journal | DOI | Publish Date | Reported cells (total) | Organism | Tissue | Data location | Sequencing method | #patients |
-|-------------|----------------------|-----------------------------|--------------|------------------|------------------------|----------------------------------|-----------------------------------|-------------------------|-----------|
-| hnscc_cillo_01 | Immunity | 10.1016/j.immuni.2019.11.014 | 20200107 | 131,224 | Human | Head and Neck Cancer (HNSC) | GSE139324 | 10x Single Cell 3' v2 | 26 |
-| pdac_pengj_02 | Cell Res | 10.1038/s41422-019-0195-y | 20190704 | 57,530 | Human | Pancreatic Ductal Adenocarcinoma (PDAC)| [Link](https://bigd.big.ac.cn/gsa/browse/CRA001160) | 10x Single Cell 3' v2 | 22 |
-| hnscc_puram_03 | Cell | 10.1016/j.cell.2017.10.044 | 20171130 | 5,902 | Human | Head and Neck Cancer (HNSC) | GSE103322 | Smart-seq2 | 16 |
-| pdac_steele_04 | Nat Cancer | 10.1038/s43018-020-00121-4 | 20201026 | 124,898 | Human | Pancreatic Ductal Adenocarcinoma (PDAC)| GSE155698 | 10x Single Cell 3' v2 | 15 |
-| luad_kim_05 | Nat Commun | 10.1038/s41467-020-16164-1 | 20200508 | 208,506 | Human | Lung Adenocarcinoma (LUAD) | GSE131907 | 10x Single Cell 3' v2 | 13 |
-| nsclc_guo_06 | Nature Medicine | 10.1038/s41591-018-0045-3 | 20180625 | 12,346 | Human | Non-Small-Cell Lung Cancer (NSCLC) | GSE99254 | Smart-Seq2 | 13 |
-| pan_cancer_07 | Nat Genet | 10.1038/s41588-020-00726-6 | 20201030 | 53,513 | Human | Cancer cell lines | GSE157220 | Illumina NextSeq 500 | - |
-
-
-- The number of **reported cells** may include cells that don't originate from solid tumors, which were removed during integrating.
-
-## Merged datasets and Synthetic datasets
-
-| Dataset name | #samples | Sampling method | Filtering | #cell types | #genes | Input dataset | GEPs
(type, fortmat) | Dataset type | Notation |
-|:--------------------------------------:|----------|-----------------|-----------|-------------|--------|--------------------------------|:-------------------------------:|:-----------------------------:|:----------:|
-| TCGA | 7,699 | - | - | - | 19,712 | - | MCT, `TPM` | Downloaded from TCGA | DA |
-| merged_7_sc_datasets | 135,049 | - | - | 11 | 12,114 | 7 collected scRNA-seq datasets | Single cell,
`log2(TPM+1)` | Raw dataset from scRNA-seq | S0 |
-| SCT_POS_N10K | 110,000 | n_base=100 | - | 11 | 12,114 | S0 | SCT, `log2(TPM+1)` | Used to simulate MCT datasets | S1 |
-| Mixed_N100K_random | 100,000 | Random | No | 11 | 12,114 | S1 | MCT, `log2(TPM+1)` | Training set | D0 |
-| Mixed_N100K_segment | 100,000 | Segment | Yes | 11 | 6,168 | S1 | MCT, `log2(TPM+1)` | Training set | D1 |
-| Mixed_N100K_segment_
without_filtering | 100,000 | Segment | No | 11 | 12,114 | S1 | MCT, `log2(TPM+1)` | Training set | D2 |
-| Test_set_random | 3,000 | Random | No | 11 | 12,114 | S1 | MCT, `log2(TPM+1)` | Test set | T0 |
-| Test_set1 | 3,000 | Segment | Yes | 11 | 6,168 | S1 | MCT, `log2(TPM+1)` | Test set | T1 |
-| Test_set2 | 3,000 | Segment | No | 11 | 12,114 | S1 | MCT, `log2(TPM+1)` | Test set | T2 |
-| SCT_POS_N100 | 1100 | n_base=100 | - | 11 | 12,114 | S0 | SCT, `log2(TPM+1)` | Test set | T3 |
-
-- MCT: Bulk gene expression profile with multiple different cell types
-- SCT: Bulk gene expression profile with single cell type (scGEP)
+
+## Merged datasets and Synthetic datasets (Table S1)
+
+| Dataset name | #samples | Sampling method | Filtering | #cell types | #genes | Input dataset | GEPs
(type, fortmat) | Dataset type | Notation |
+|:------------------------------------------:|-------------|-----------------|-----------|-------------|--------|---------------------------------|:-------------------------------:|:-----------------------------:|:--------:|
+| TCGA | 7,699 | - | - | - | 19,712 | - | MCT, `TPM` | Downloaded from TCGA | DA |
+| merged_7_sc_datasets | 325,474 | - | - | 19 | 17,834 | 12 collected scRNA-seq datasets | Single cell,
`log2(TPM+1)` | Raw dataset from scRNA-seq | S0 |
+| SCT_POS_N10K | 10,000 x 16 | n_base=100 | - | 16 | 17,834 | S0 | SCT, `log2(TPM+1)` | Used to simulate MCT datasets | S1 |
+| Mixed_N100K_random | 100,000 | Random | No | 16 | 17,834 | S1 | MCT, `log2(TPM+1)` | Training set | D0 |
+| Mixed_N100K_segment | 100,000 | Segment | Yes | 16 | 9,028 | S1 | MCT, `log2(TPM+1)` | Training set | D1 |
+| Mixed_N100K_segment_
without_filtering | 100,000 | Segment | No | 16 | 17,834 | S1 | MCT, `log2(TPM+1)` | Training set | D2 |
+| Test_set_random | 3,000 | Random | No | 16 | 17,834 | S1 | MCT, `log2(TPM+1)` | Test set | T0 |
+| Test_set1 | 3,000 | Segment | Yes | 16 | 9,028 | S1 | MCT, `log2(TPM+1)` | Test set | T1 |
+| Test_set2 | 3,000 | Segment | No | 16 | 17,834 | S1 | MCT, `log2(TPM+1)` | Test set | T2 |
+| SCT_POS_N100 | 100 x 16 | n_base=100 | - | 16 | 17,834 | S0 | SCT, `log2(TPM+1)` | Test set | T3 |
+
+- MCT: Bulk gene expression profiles with multiple different cell types
+- SCT: Bulk gene expression profiles with single cell type (sctGEPs)
- GEPs: Gene expression profiles
+## Collected scRNA-seq datasets (Table S2)
+
+| Dataset ID | Journal | DOI | Publish Date | Reported cells (total)* | Integrated cells (used) | Organism | Tissue | Data location | Sequencing method | #patients** |
+|--------------------|-----------------|------------------------------|--------------|-------------------------|-------------------------|----------|-----------------------------------------|---------------------------------------------------------|---------------------------|-------------|
+| hnscc_cillo_01 | Immunity | 10.1016/j.immuni.2019.11.014 | 20200107 | 131,224 | 57,034 | Human | Head and Neck Cancer (HNSC) | GSE139324 | 10x Single Cell 3' v2 | 26 |
+| pdac_pengj_02 | Cell Res | 10.1038/s41422-019-0195-y | 20190704 | 57,530 | 37,079 | Human | Pancreatic Ductal Adenocarcinoma (PDAC) | [Link](https://bigd.big.ac.cn/gsa/browse/CRA001160) | 10x Single Cell 3' v2 | 22 |
+| hnscc_puram_03 | Cell | 10.1016/j.cell.2017.10.044 | 20171130 | 5,902 | 4,647 | Human | Head and Neck Cancer (HNSC) | GSE103322 | Smart-seq2 | 16 |
+| pdac_steele_04 | Nat Cancer | 10.1038/s43018-020-00121-4 | 20201026 | 124,898 | 32,062 | Human | Pancreatic Ductal Adenocarcinoma (PDAC) | GSE155698 | 10x Single Cell 3' v2 | 15 |
+| luad_kim_05 | Nat Commun | 10.1038/s41467-020-16164-1 | 20200508 | 208,506 | 49,959 | Human | Lung Adenocarcinoma (LUAD) | GSE131907 | 10x Single Cell 3' v2 | 13 |
+| nsclc_guo_06 | Nature Medicine | 10.1038/s41591-018-0045-3 | 20180625 | 12,346 | 4,050 | Human | Non-Small-Cell Lung Cancer (NSCLC) | GSE99254 | Smart-Seq2 | 13 |
+| pan_cancer_07 | Nat Genet | 10.1038/s41588-020-00726-6 | 20201030 | 53,513 | 30,681 | Human | Cancer cell lines | GSE157220 | Illumina NextSeq 500 | - |
+| prad_cheng_08 | Nat Cell Biol | 10.1038/s41556-020-00613-6 | 20211108 | 36,424 | 28,253 | Human | Prostate cancer (PRAD) | https://www.weizmann.ac.il/sites/3CA/prostate | 10X Genomics | 12 |
+| prad_dong_09 | Commun Biol | 10.1038/s42003-020-01476-1 | 20201216 | 21,292 | 16,472 | Human | Prostate cancer (PRAD) | https://www.weizmann.ac.il/sites/3CA/prostate | 10X Genomics | 6 |
+| hcc_sun_10 | Cell | 10.1016/j.cell.2020.11.041 | 20201123 | 16,498 | 11,365 | Human | Hepatocellular carcinoma (HCC) | https://www.weizmann.ac.il/sites/3CA/liverbiliary | 10X Genomics | 16 |
+| gbm_neftel_11 | Cell | 10.1016/j.cell.2019.06.024 | 20190618 | 24,131 | 16,835 | Human | Glioblastoma multiforme (GBM) | https://www.weizmann.ac.il/sites/3CA/brain (GSE131928) | 10X Genomics | 36 |
+| gbm_abdelfattah_12 | Nat Commun | 10.1038/s41467-022-28372-y | 20220909 | 201,986 | 37,037 | Human | Glioblastoma multiforme (GBM) | GSE182109 | 10× Chromium / HiSeq 4000 | 8 |
+
+- \* The number of **reported cells** may include cells that don't originate from solid tumors, which were removed during integrating.
+- \*\* The count considered only the number of patients (samples) in the data that were integrated into the final dataset.
+
+
## Download
-- TCGA: [download link](https://figshare.com/articles/dataset/Merged_gene_expression_profiles_TPM_/23047547)
-- merged_7_sc_datasets (S0): [download link](https://figshare.com/articles/dataset/Dataset_S0/23283908)
-- SCT_POS_N10K (S1): [download link](https://figshare.com/articles/dataset/Dataset_S1/23043560)
-- Mixed_N100K_random (D0): [download link](https://figshare.com/articles/dataset/Dataset_D0/23283932)
-- Mixed_N100K_segment (D1): [download link](https://figshare.com/articles/dataset/Dataset_D1/23047391)
-- Mixed_N100K_segment_without_filtering (D2): [download link](https://figshare.com/articles/dataset/Dataset_D2/23284256)
-- All Test Sets: [download link](https://figshare.com/articles/dataset/All_Test_Sets/23283884)
+- TCGA (DA): [merged_tpm.csv.zip](https://doi.org/10.6084/m9.figshare.23047547.v1)
+- merged_12_sc_datasets (S0): [merged_12_sc_datasets_231003.h5ad](https://doi.org/10.6084/m9.figshare.23283908.v2)
+- SCT_POS_N10K (S1): [simu_bulk_exp_SCT_N10K_S1_16sct.h5ad](https://doi.org/10.6084/m9.figshare.23043560.v2)
+- Mixed_N100K_random (D0): [simu_bulk_exp_Mixed_N100K_random_log2cpm1p.h5ad](https://doi.org/10.6084/m9.figshare.23283932.v2)
+- Mixed_N100K_segment (D1): [simu_bulk_exp_Mixed_N100K_D1.h5ad](https://doi.org/10.6084/m9.figshare.23047391.v2)
+- Mixed_N100K_segment_without_filtering (D2): [simu_bulk_exp_Mixed_N100K_D2.h5ad](https://doi.org/10.6084/m9.figshare.23284256.v2)
+- All Test Sets: [all_test_sets.zip](https://doi.org/10.6084/m9.figshare.23283884.v3)
- Test_set_random (T0)
- Test_set1 (T1)
- Test_set2 (T2)
diff --git a/docs/usage.md b/docs/usage.md
index 74ab6fb..02ad8b0 100644
--- a/docs/usage.md
+++ b/docs/usage.md
@@ -75,16 +75,22 @@ This package provides functions to plot the results of DeSide.
## DeSide model
There are two ways to use DeSide.
-Firstly, you can use the provided pre-trained model to directly predict cell proportions,
+Firstly, you can use the provided pre-trained model to predict cell proportions directly,
eliminating the need to train the model by yourself.
Alternatively, you can sequentially execute the `Dataset Simulation` and `Model Training` modules, training the model from scratch.
-Subsequently, you can use the self-trained model to predict cell proportions.
+Then use the self-trained model to predict cell proportions.
### Model Prediction
Using the pre-trained model or self-trained model, you can predict cell proportions in bulk gene expression profiles (bulk GEPs) by using the [`deside_model.predict`](https://deside.readthedocs.io/en/latest/func/deconvolution.html#deside.decon_cf.DeSide.predict) function.
```python
+import os
+import pandas as pd
+from deside.utility import check_dir
+from deside.decon_cf import DeSide
+from deside.utility.read_file import read_gene_set
+
# bulk gene expression profiles (GEPs) in TPM format
bulk_tpm_file_path = 'path/xx_TPM.csv'
bulk_tpm = pd.read_csv(bulk_tpm_file_path, index_col=0)
@@ -93,6 +99,26 @@ bulk_tpm = pd.read_csv(bulk_tpm_file_path, index_col=0)
result_dir = './results'
y_pred_file_path = os.path.join(result_dir, 'y_pred.csv')
check_dir(result_dir)
+dataset_dir = './datasets/'
+
+# hyper-parameters of the DNN model
+deside_parameters = {
+ 'architecture': ([200, 2000, 2000, 2000, 50], [0.05, 0.05, 0.05, 0.2, 0]),
+ 'architecture_for_pathway_network': ([50, 500, 500, 500, 50], [0, 0, 0, 0, 0]),
+ 'loss_function_alpha': 0.5, # alpha*mae + (1-alpha)*rmse, mae means mean absolute error
+ 'normalization': 'layer_normalization', # batch_normalization / layer_normalization / None
+ # 1 means to add a normalization layer, input | the first hidden layer | ... | output
+ 'normalization_layer': [0, 0, 1, 1, 1, 1], # 1 more parameter than the number of hidden layers
+ 'pathway_network': True, # using an independent pathway network
+ 'last_layer_activation': 'sigmoid', # sigmoid / softmax
+ 'learning_rate': 1e-4,
+ 'batch_size': 128}
+
+# read two gene sets as pathway mask
+gene_set_file_path1 = os.path.join(dataset_dir, 'gene_set', 'c2.cp.kegg.v2023.1.Hs.symbols.gmt')
+gene_set_file_path2 = os.path.join(dataset_dir, 'gene_set', 'c2.cp.reactome.v2023.1.Hs.symbols.gmt')
+all_pathway_files = [gene_set_file_path1, gene_set_file_path2]
+pathway_mask = read_gene_set(all_pathway_files) # genes by pathways
# read pre-trained DeSide model
model_dir = './DeSide_model/'
@@ -102,7 +128,9 @@ deside_model = DeSide(model_dir=model_dir)
deside_model.predict(input_file=bulk_tpm_file_path,
output_file_path=y_pred_file_path,
exp_type='TPM', transpose=True,
- scaling_by_sample=False, scaling_by_constant=True)
+ scaling_by_sample=False, scaling_by_constant=True,
+ hyper_params=deside_parameters,
+ pathway_mask=pathway_mask)
```
- A complete example in jupyter notebook can be found: [E1 - Using pre-trained model.ipynb](https://github.com/OnlyBelter/DeSide_mini_example/blob/main/E1%20-%20Using%20pre-trained%20model.ipynb).
@@ -110,25 +138,60 @@ deside_model.predict(input_file=bulk_tpm_file_path,
Training a model using the provided training set.
```python
+import os
+import pandas as pd
+from deside.decon_cf import DeSide
+from deside.utility import check_dir, sorted_cell_types
+from deside.utility.read_file import read_gene_set
+
# create output directory
result_dir = './results'
check_dir(result_dir)
+dataset_dir = './datasets/'
# using dataset D1 as the training set
training_set2file_path = {
- 'D1': './datasets/simulated_bulk_cell_dataset/simu_bulk_exp_Mixed_N100K_D1.h5ad',
+ 'D1': './datasets/simulated_bulk_cell_dataset/D1/simu_bulk_exp_Mixed_N100K_D1.h5ad',
}
-all_cell_types = sorted_cell_types
-
-# set hyper-parameters of the DNN model
-deside_parameters = {'architecture': ([100, 1000, 1000, 1000, 50],
- [0, 0, 0, 0.2, 0]),
- 'loss_function': 'mae+rmse',
- 'batch_normalization': False,
- 'last_layer_activation': 'sigmoid',
- 'learning_rate': 2e-5,
- 'batch_size': 128}
+cell_type2subtypes = {'B Cells': ['Non-plasma B cells', 'Plasma B cells'],
+ 'CD4 T': ['CD4 T'], 'CD8 T': ['CD8 T (GZMK high)', 'CD8 T effector'],
+ 'DC': ['DC'], 'Endothelial Cells': ['Endothelial Cells'],
+ 'Cancer Cells': ['Cancer Cells'],
+ 'Fibroblasts': ['CAFs', 'Myofibroblasts'], 'Macrophages': ['Macrophages'],
+ 'Mast Cells': ['Mast Cells'], 'NK': ['NK'], 'Neutrophils': ['Neutrophils'],
+ 'Double-neg-like T': ['Double-neg-like T'], 'Monocytes': ['Monocytes']}
+all_cell_types = sorted([i for v in cell_type2subtypes.values() for i in v])
+all_cell_types = [i for i in sorted_cell_types if i in all_cell_types]
+
+# set hyper-parameters of the DNN model and other parameters for training
+# hyper-parameters of the DNN model
+deside_parameters = {
+ 'architecture': ([200, 2000, 2000, 2000, 50], [0.05, 0.05, 0.05, 0.2, 0]),
+ 'architecture_for_pathway_network': ([50, 500, 500, 500, 50], [0, 0, 0, 0, 0]),
+ 'loss_function_alpha': 0.5, # alpha*mae + (1-alpha)*rmse, mae means mean absolute error
+ 'normalization': 'layer_normalization', # batch_normalization / layer_normalization / None
+ # 1 means to add a normalization layer, input | the first hidden layer | ... | output
+ 'normalization_layer': [0, 0, 1, 1, 1, 1], # 1 more parameter than the number of hidden layers
+ 'pathway_network': True, # using an independent pathway network
+ 'last_layer_activation': 'sigmoid', # sigmoid / softmax
+ 'learning_rate': 1e-4,
+ 'batch_size': 128}
+
+# read two gene sets as pathway mask
+gene_set_file_path1 = os.path.join(dataset_dir, 'gene_set', 'c2.cp.kegg.v2023.1.Hs.symbols.gmt')
+gene_set_file_path2 = os.path.join(dataset_dir, 'gene_set', 'c2.cp.reactome.v2023.1.Hs.symbols.gmt')
+all_pathway_files = [gene_set_file_path1, gene_set_file_path2]
+pathway_mask = read_gene_set(all_pathway_files) # genes by pathways
+
+# filtered gene list (gene-level filtering, filtered by correlation coefficients and quantiles)
+filtered_gene_list = None # for other datasets
+if list(training_set2file_path.keys())[0] == 'D1':
+ filtered_gene_file_path = os.path.join(dataset_dir, 'simulated_bulk_cell_dataset/D1/gene_list_filtered_by_high_corr_gene_and_quantile_range.csv')
+ filtered_gene_list = pd.read_csv(filtered_gene_file_path, index_col=0).index.to_list()
+
+# input gene list type for pathway profiles
+input_gene_list = 'filtered_genes'
# remove cancer cell during training process
remove_cancer_cell = True
@@ -144,7 +207,9 @@ deside_obj.train_model(training_set_file_path=[training_set2file_path['D1']],
hyper_params=deside_parameters, cell_types=all_cell_types,
scaling_by_constant=True, scaling_by_sample=False,
remove_cancer_cell=remove_cancer_cell,
- n_patience=100, n_epoch=3000, verbose=0)
+ n_patience=100, n_epoch=3000, verbose=0,
+ pathway_mask=pathway_mask, method_adding_pathway='add_to_end',
+ filtered_gene_list=filtered_gene_list, input_gene_list=input_gene_list)
```
- A complete example in jupyter notebook can be found: [E2 - Training a model from scratch.ipynb](https://github.com/OnlyBelter/DeSide_mini_example/blob/main/E2%20-%20Training%20a%20model%20from%20scratch.ipynb)
@@ -155,49 +220,88 @@ deside_obj.train_model(training_set_file_path=[training_set2file_path['D1']],
In this module, you can synthesize bulk tumors based on the dataset `S1`.
```python
+import os
+import pandas as pd
+from deside.utility.read_file import ReadH5AD, ReadExp
+from deside.utility import check_dir, sorted_cell_types
+from deside.simulation import (BulkGEPGenerator, get_gene_list_for_filtering,
+ filtering_by_gene_list_and_pca_plot)
+
# the list of single cell RNA-seq datasets
sc_dataset_ids = ['hnscc_cillo_01', 'pdac_pengj_02', 'hnscc_puram_03',
- 'pdac_steele_04', 'luad_kim_05', 'nsclc_guo_06', 'pan_cancer_07']
+ 'pdac_steele_04', 'luad_kim_05', 'nsclc_guo_06',
+ 'pan_cancer_07', 'prad_cheng_08', 'prad_dong_09',
+ 'hcc_sun_10', 'gbm_neftel_11', 'gbm_abdelfattah_12']
# the list of cancer types in the TCGA dataset
cancer_types = ['ACC', 'BLCA', 'BRCA', 'GBM', 'HNSC', 'LGG', 'LIHC', 'LUAD', 'PAAD', 'PRAD',
'CESC', 'COAD', 'KICH', 'KIRC', 'KIRP', 'LUSC', 'READ', 'THCA', 'UCEC']
+cancer_types_for_filtering = cancer_types.copy()
+
+# coefficient to correct the difference of total RNA abundance in different cell types
+# There is no effect to the final results if all the coefficients are set to 1
+alpha_total_rna_coefficient = {'B Cells': 1.0, 'CD4 T': 1.0, 'CD8 T': 1.0, 'DC': 1.0,
+ 'Endothelial Cells': 1.0, 'Cancer Cells': 1.0, 'Fibroblasts': 1.0,
+ 'Macrophages': 1.0, 'Mast Cells': 1.0, 'NK': 1.0, 'Neutrophils': 1.0,
+ 'Double-neg-like T': 1.0, 'Monocytes': 1.0}
+
+# cell types and the corresponding subtypes
+cell_type2subtypes = {'B Cells': ['Non-plasma B cells', 'Plasma B cells'],
+ 'CD4 T': ['CD4 T'], 'CD8 T': ['CD8 T (GZMK high)', 'CD8 T effector'],
+ 'DC': ['DC'], 'Endothelial Cells': ['Endothelial Cells'],
+ 'Cancer Cells': ['Cancer Cells'],
+ 'Fibroblasts': ['CAFs', 'Myofibroblasts'], 'Macrophages': ['Macrophages'],
+ 'Mast Cells': ['Mast Cells'], 'NK': ['NK'], 'Neutrophils': ['Neutrophils'],
+ 'Double-neg-like T': ['Double-neg-like T'], 'Monocytes': ['Monocytes']}
+
# the list of cell types
-all_cell_types = sorted_cell_types
+all_cell_types = sorted([i for v in cell_type2subtypes.values() for i in v])
+all_cell_types = [i for i in sorted_cell_types if i in all_cell_types]
# parameters
# for gene-level filtering
gene_list_type = 'high_corr_gene_and_quantile_range'
-gene_quantile_range = [0.05, 0.5, 0.95] # gene-level filtering
+gene_quantile_range = [0.005, 0.5, 0.995] # gene-level filtering
# for GEP-level filtering
gep_filtering_quantile = (0.0, 0.95) # GEP-level filtering, L1-norm threshold
+filtering_in_pca_space = True
+pca_n_components = 0.9
n_base = 100 # averaging 100 GEPs sampled from S1 to synthesize 1 bulk GEP, used by S1 generation
cell_prop_prior = None
dataset2parameters = {
'Mixed_N10K_segment': {
'sc_dataset_ids': sc_dataset_ids,
- 'cell_types': all_cell_types,
- 'n_samples': 10000,
+ 'cell_type2subtype': cell_type2subtypes,
+ 'n_samples': 8000,
'sampling_method': 'segment', # or `random` used by Scaden
'filtering': True,
}
}
# skipped steps here ...
-
+simu_bulk_exp_dir = './datasets/simulated_bulk_cell_dataset'
+sct_dataset_file_path = 'path/to/simu_bulk_exp_SCT_N10K_S1_16sct.h5ad'
+tcga2cancer_type_file_path = 'path/to/tcga_sample_id2cancer_type.csv'
+tcga_merged_tpm_file_path = 'path/to/merged_tpm.csv'
+high_corr_gene_file_path = 'path/to/gene_list_filtered_by_high_corr_gene.csv'
+high_corr_gene_list = pd.read_csv(high_corr_gene_file_path)
+high_corr_gene_list = high_corr_gene_list['gene_name'].to_list()
for dataset_name, params in dataset2parameters.items():
# skipped steps here ...
bulk_generator = BulkGEPGenerator(simu_bulk_dir=simu_bulk_exp_dir,
merged_sc_dataset_file_path=None,
- cell_types=params['cell_types'],
+ cell_type2subtype=params['cell_type2subtype'],
sc_dataset_ids=params['sc_dataset_ids'],
bulk_dataset_name=dataset_name,
sct_dataset_file_path=sct_dataset_file_path,
check_basic_info=False,
- tcga2cancer_type_file_path=tcga2cancer_type_file_path)
+ tcga2cancer_type_file_path=tcga2cancer_type_file_path,
+ total_rna_coefficient=alpha_total_rna_coefficient,
+ cell_type_col_name='cell_type',
+ subtype_col_name='cell_type')
# GEP-filtering will be performed during this generation process
generated_bulk_gep_fp = bulk_generator.generated_bulk_gep_fp
dataset2path[dataset_name] = generated_bulk_gep_fp
@@ -214,7 +318,9 @@ for dataset_name, params in dataset2parameters.items():
log_file_path=log_file_path,
show_filtering_info=False,
filtering_method='median_gep',
- cell_prop_prior=cell_prop_prior)
+ cell_prop_prior=cell_prop_prior,
+ filtering_in_pca_space=filtering_in_pca_space,
+ norm_ord=1, pca_n_components=pca_n_components)
# gene-level filtering that depends on the high correlation genes and quantile range (each dataset itself)
if params['filtering']:
@@ -268,4 +374,6 @@ for dataset_name, params in dataset2parameters.items():
If you want to use other scRNA-seq datasets to simulate GEPs, you can follow our workflow to preprocess single cell datasets and merge them together. The Python package `Scanpy` was used heavily in our workflow.
- Preprocessing a single dataset: [03deal_with_Puram et al Cell.ipynb](https://github.com/OnlyBelter/DeSide_mini_example/blob/main/single_cell_dataset_integration/03deal_with_Puram%20et%20al%20Cell.ipynb).
-- Merging multiple datasets together: [08filter_and_merge_01_06.py](https://github.com/OnlyBelter/DeSide_mini_example/blob/main/single_cell_dataset_integration/08filter_and_merge_01_06.py).
+- Merging multiple datasets together (part 1): [Merge_12_scRNA-seq_datasets_part1.ipynb](https://github.com/OnlyBelter/DeSide_mini_example/blob/main/single_cell_dataset_integration/Merge_12_scRNA-seq_datasets_part1.ipynb).
+- Merging multiple datasets together (part 2-round1): [Merge_12_scRNA-seq_datasets_part2_first_round.ipynb](https://github.com/OnlyBelter/DeSide_mini_example/blob/main/single_cell_dataset_integration/Merge_12_scRNA-seq_datasets_part2_first_round.ipynb)
+- Merging multiple datasets together (part 2-round2): [Merge_12_scRNA-seq_datasets_part2_second_round.ipynb](https://github.com/OnlyBelter/DeSide_mini_example/blob/main/single_cell_dataset_integration/Merge_12_scRNA-seq_datasets_part2_second_round.ipynb)