diff --git a/docs/datasets.md b/docs/datasets.md
index 58f6561..c6e82c6 100644
--- a/docs/datasets.md
+++ b/docs/datasets.md
@@ -3,48 +3,55 @@ Datasets
 
 Datasets used in DeSide
 ***
-## scRNA-seq datasets
-
-| Dataset ID  | Journal              | DOI                         | Publish Date | Reported cells (total) | Organism               | Tissue                           | Data location                     | Sequencing method       | #patients |
-|-------------|----------------------|-----------------------------|--------------|------------------|------------------------|----------------------------------|-----------------------------------|-------------------------|-----------|
-| hnscc_cillo_01 | Immunity             | 10.1016/j.immuni.2019.11.014 | 20200107     | 131,224          | Human                  | Head and Neck Cancer (HNSC)      | GSE139324                         | 10x Single Cell 3' v2    | 26        |
-| pdac_pengj_02 | Cell Res             | 10.1038/s41422-019-0195-y  | 20190704     | 57,530           | Human                  | Pancreatic Ductal Adenocarcinoma (PDAC)| [Link](https://bigd.big.ac.cn/gsa/browse/CRA001160) | 10x Single Cell 3' v2    | 22        |
-| hnscc_puram_03 | Cell                 | 10.1016/j.cell.2017.10.044 | 20171130     | 5,902            | Human                  | Head and Neck Cancer (HNSC)      | GSE103322                         | Smart-seq2               | 16        |
-| pdac_steele_04 | Nat Cancer           | 10.1038/s43018-020-00121-4 | 20201026     | 124,898          | Human                  | Pancreatic Ductal Adenocarcinoma (PDAC)| GSE155698                         | 10x Single Cell 3' v2    | 15        |
-| luad_kim_05 | Nat Commun           | 10.1038/s41467-020-16164-1 | 20200508     | 208,506          | Human                  | Lung Adenocarcinoma (LUAD)       | GSE131907                         | 10x Single Cell 3' v2    | 13        |
-| nsclc_guo_06 | Nature Medicine      | 10.1038/s41591-018-0045-3  | 20180625     | 12,346           | Human                  | Non-Small-Cell Lung Cancer (NSCLC) | GSE99254                          | Smart-Seq2               | 13        |
-| pan_cancer_07 | Nat Genet            | 10.1038/s41588-020-00726-6 | 20201030     | 53,513           | Human                  | Cancer cell lines                | GSE157220                         | Illumina NextSeq 500    | -         |
-
-
-- The number of **reported cells** may include cells that don't originate from solid tumors, which were removed during integrating.
-
-## Merged datasets and Synthetic datasets
-
-|              Dataset name              | #samples | Sampling method | Filtering | #cell types | #genes | Input dataset                  |      GEPs <br/>(type, fortmat)       |         Dataset type          |  Notation  |
-|:--------------------------------------:|----------|-----------------|-----------|-------------|--------|--------------------------------|:-------------------------------:|:-----------------------------:|:----------:|
-|                  TCGA                  | 7,699    | -               | -         | -           | 19,712 | -                              |           MCT, `TPM`            |     Downloaded from TCGA      |     DA     |
-|          merged_7_sc_datasets          | 135,049  | -               | -         | 11          | 12,114 | 7 collected scRNA-seq datasets | Single cell, <br/>`log2(TPM+1)` |  Raw dataset from scRNA-seq   |     S0     |
-|              SCT_POS_N10K              | 110,000  | n_base=100      | -         | 11          | 12,114 | S0                             |       SCT, `log2(TPM+1)`        | Used to simulate MCT datasets |     S1     |
-|           Mixed_N100K_random           | 100,000  | Random          | No        | 11          | 12,114 | S1                             |       MCT, `log2(TPM+1)`        |         Training set          |     D0     |
-|          Mixed_N100K_segment           | 100,000  | Segment         | Yes       | 11          | 6,168  | S1                             |       MCT, `log2(TPM+1)`        |         Training set          |     D1     |
-| Mixed_N100K_segment_<br/>without_filtering  | 100,000  | Segment   | No        | 11          | 12,114 | S1                             |       MCT, `log2(TPM+1)`        |         Training set          |     D2     |
-|            Test_set_random             | 3,000    | Random          | No        | 11          | 12,114 | S1                             |       MCT, `log2(TPM+1)`        |           Test set            |     T0     |
-|               Test_set1                | 3,000    | Segment         | Yes       | 11          | 6,168  | S1                             |       MCT, `log2(TPM+1)`        |           Test set            |     T1     |
-|               Test_set2                | 3,000    | Segment         | No        | 11          | 12,114 | S1                             |       MCT, `log2(TPM+1)`        |           Test set            |     T2     |
-|              SCT_POS_N100              | 1100     | n_base=100      | -         | 11          | 12,114 | S0                             |       SCT, `log2(TPM+1)`        |           Test set            |     T3     |
-
-- MCT: Bulk gene expression profile with multiple different cell types
-- SCT: Bulk gene expression profile with single cell type (scGEP)
+
+## Merged datasets and Synthetic datasets (Table S1)
+
+|                Dataset name                | #samples    | Sampling method | Filtering | #cell types | #genes | Input dataset                   |    GEPs <br/>(type, fortmat)    |         Dataset type          | Notation |
+|:------------------------------------------:|-------------|-----------------|-----------|-------------|--------|---------------------------------|:-------------------------------:|:-----------------------------:|:--------:|
+|                    TCGA                    | 7,699       | -               | -         | -           | 19,712 | -                               |           MCT, `TPM`            |     Downloaded from TCGA      |    DA    |
+|            merged_7_sc_datasets            | 325,474     | -               | -         | 19          | 17,834 | 12 collected scRNA-seq datasets | Single cell, <br/>`log2(TPM+1)` |  Raw dataset from scRNA-seq   |    S0    |
+|                SCT_POS_N10K                | 10,000 x 16 | n_base=100      | -         | 16          | 17,834 | S0                              |       SCT, `log2(TPM+1)`        | Used to simulate MCT datasets |    S1    |
+|             Mixed_N100K_random             | 100,000     | Random          | No        | 16          | 17,834 | S1                              |       MCT, `log2(TPM+1)`        |         Training set          |    D0    |
+|            Mixed_N100K_segment             | 100,000     | Segment         | Yes       | 16          | 9,028  | S1                              |       MCT, `log2(TPM+1)`        |         Training set          |    D1    |
+| Mixed_N100K_segment_<br/>without_filtering | 100,000     | Segment         | No        | 16          | 17,834 | S1                              |       MCT, `log2(TPM+1)`        |         Training set          |    D2    |
+|              Test_set_random               | 3,000       | Random          | No        | 16          | 17,834 | S1                              |       MCT, `log2(TPM+1)`        |           Test set            |    T0    |
+|                 Test_set1                  | 3,000       | Segment         | Yes       | 16          | 9,028  | S1                              |       MCT, `log2(TPM+1)`        |           Test set            |    T1    |
+|                 Test_set2                  | 3,000       | Segment         | No        | 16          | 17,834 | S1                              |       MCT, `log2(TPM+1)`        |           Test set            |    T2    |
+|                SCT_POS_N100                | 100 x 16    | n_base=100      | -         | 16          | 17,834 | S0                              |       SCT, `log2(TPM+1)`        |           Test set            |    T3    |
+
+- MCT: Bulk gene expression profiles with multiple different cell types
+- SCT: Bulk gene expression profiles with single cell type (sctGEPs)
 - GEPs: Gene expression profiles
 
+## Collected scRNA-seq datasets (Table S2)
+
+| Dataset ID         | Journal         | DOI                          | Publish Date | Reported cells (total)* | Integrated cells (used) | Organism | Tissue                                  | Data location                                           | Sequencing method         | #patients** |
+|--------------------|-----------------|------------------------------|--------------|-------------------------|-------------------------|----------|-----------------------------------------|---------------------------------------------------------|---------------------------|-------------|
+| hnscc_cillo_01     | Immunity        | 10.1016/j.immuni.2019.11.014 | 20200107     | 131,224                 | 57,034                  | Human    | Head and Neck Cancer (HNSC)             | GSE139324                                               | 10x Single Cell 3' v2     | 26          |
+| pdac_pengj_02      | Cell Res        | 10.1038/s41422-019-0195-y    | 20190704     | 57,530                  | 37,079                  | Human    | Pancreatic Ductal Adenocarcinoma (PDAC) | [Link](https://bigd.big.ac.cn/gsa/browse/CRA001160)     | 10x Single Cell 3' v2     | 22          |
+| hnscc_puram_03     | Cell            | 10.1016/j.cell.2017.10.044   | 20171130     | 5,902                   | 4,647                   | Human    | Head and Neck Cancer (HNSC)             | GSE103322                                               | Smart-seq2                | 16          |
+| pdac_steele_04     | Nat Cancer      | 10.1038/s43018-020-00121-4   | 20201026     | 124,898                 | 32,062                  | Human    | Pancreatic Ductal Adenocarcinoma (PDAC) | GSE155698                                               | 10x Single Cell 3' v2     | 15          |
+| luad_kim_05        | Nat Commun      | 10.1038/s41467-020-16164-1   | 20200508     | 208,506                 | 49,959                  | Human    | Lung Adenocarcinoma (LUAD)              | GSE131907                                               | 10x Single Cell 3' v2     | 13          |
+| nsclc_guo_06       | Nature Medicine | 10.1038/s41591-018-0045-3    | 20180625     | 12,346                  | 4,050                   | Human    | Non-Small-Cell Lung Cancer (NSCLC)      | GSE99254                                                | Smart-Seq2                | 13          |
+| pan_cancer_07      | Nat Genet       | 10.1038/s41588-020-00726-6   | 20201030     | 53,513                  | 30,681                  | Human    | Cancer cell lines                       | GSE157220                                               | Illumina NextSeq 500      | -           |
+| prad_cheng_08      | Nat Cell Biol   | 10.1038/s41556-020-00613-6   | 20211108     | 36,424                  | 28,253                  | Human    | Prostate cancer (PRAD)                  | https://www.weizmann.ac.il/sites/3CA/prostate           | 10X Genomics	             | 12          |
+| prad_dong_09	      | Commun Biol	    | 10.1038/s42003-020-01476-1   | 20201216     | 21,292                  | 16,472                  | Human    | Prostate cancer (PRAD)                  | https://www.weizmann.ac.il/sites/3CA/prostate           | 10X Genomics	             | 6           |
+| hcc_sun_10         | Cell            | 10.1016/j.cell.2020.11.041   | 20201123     | 16,498                  | 11,365                  | Human    | Hepatocellular carcinoma (HCC)          | https://www.weizmann.ac.il/sites/3CA/liverbiliary       | 10X Genomics	             | 16          |
+| gbm_neftel_11      | Cell            | 10.1016/j.cell.2019.06.024   | 20190618     | 24,131                  | 16,835                  | Human    | Glioblastoma multiforme (GBM)           | https://www.weizmann.ac.il/sites/3CA/brain (GSE131928)	 | 10X Genomics	             | 36          |
+| gbm_abdelfattah_12 | Nat Commun      | 10.1038/s41467-022-28372-y   | 20220909     | 201,986                 | 37,037                  | Human    | Glioblastoma multiforme (GBM)           | GSE182109                                               | 10× Chromium / HiSeq 4000 | 8           |
+
+- \* The number of **reported cells** may include cells that don't originate from solid tumors, which were removed during integrating.
+- \*\* The count considered only the number of patients (samples) in the data that were integrated into the final dataset.
+
+
 ## Download
-- TCGA: [download link](https://figshare.com/articles/dataset/Merged_gene_expression_profiles_TPM_/23047547)
-- merged_7_sc_datasets (S0): [download link](https://figshare.com/articles/dataset/Dataset_S0/23283908)
-- SCT_POS_N10K (S1): [download link](https://figshare.com/articles/dataset/Dataset_S1/23043560)
-- Mixed_N100K_random (D0): [download link](https://figshare.com/articles/dataset/Dataset_D0/23283932)
-- Mixed_N100K_segment (D1): [download link](https://figshare.com/articles/dataset/Dataset_D1/23047391)
-- Mixed_N100K_segment_without_filtering (D2): [download link](https://figshare.com/articles/dataset/Dataset_D2/23284256)
-- All Test Sets: [download link](https://figshare.com/articles/dataset/All_Test_Sets/23283884)
+- TCGA (DA): [merged_tpm.csv.zip](https://doi.org/10.6084/m9.figshare.23047547.v1)
+- merged_12_sc_datasets (S0): [merged_12_sc_datasets_231003.h5ad](https://doi.org/10.6084/m9.figshare.23283908.v2)
+- SCT_POS_N10K (S1): [simu_bulk_exp_SCT_N10K_S1_16sct.h5ad](https://doi.org/10.6084/m9.figshare.23043560.v2)
+- Mixed_N100K_random (D0): [simu_bulk_exp_Mixed_N100K_random_log2cpm1p.h5ad](https://doi.org/10.6084/m9.figshare.23283932.v2)
+- Mixed_N100K_segment (D1): [simu_bulk_exp_Mixed_N100K_D1.h5ad](https://doi.org/10.6084/m9.figshare.23047391.v2)
+- Mixed_N100K_segment_without_filtering (D2): [simu_bulk_exp_Mixed_N100K_D2.h5ad](https://doi.org/10.6084/m9.figshare.23284256.v2)
+- All Test Sets: [all_test_sets.zip](https://doi.org/10.6084/m9.figshare.23283884.v3)
   - Test_set_random (T0)
   - Test_set1 (T1)
   - Test_set2 (T2)
diff --git a/docs/usage.md b/docs/usage.md
index 74ab6fb..02ad8b0 100644
--- a/docs/usage.md
+++ b/docs/usage.md
@@ -75,16 +75,22 @@ This package provides functions to plot the results of DeSide.
 
 ## DeSide model
 There are two ways to use DeSide. 
-Firstly, you can use the provided pre-trained model to directly predict cell proportions, 
+Firstly, you can use the provided pre-trained model to predict cell proportions directly, 
 eliminating the need to train the model by yourself. 
 Alternatively, you can sequentially execute the `Dataset Simulation` and `Model Training` modules, training the model from scratch. 
-Subsequently, you can use the self-trained model to predict cell proportions.
+Then use the self-trained model to predict cell proportions.
 
 ### Model Prediction
 
 Using the pre-trained model or self-trained model, you can predict cell proportions in bulk gene expression profiles (bulk GEPs) by using the [`deside_model.predict`](https://deside.readthedocs.io/en/latest/func/deconvolution.html#deside.decon_cf.DeSide.predict) function.
 
 ```python
+import os
+import pandas as pd
+from deside.utility import check_dir
+from deside.decon_cf import DeSide
+from deside.utility.read_file import read_gene_set
+
 # bulk gene expression profiles (GEPs) in TPM format
 bulk_tpm_file_path = 'path/xx_TPM.csv'
 bulk_tpm = pd.read_csv(bulk_tpm_file_path, index_col=0)
@@ -93,6 +99,26 @@ bulk_tpm = pd.read_csv(bulk_tpm_file_path, index_col=0)
 result_dir = './results'
 y_pred_file_path = os.path.join(result_dir, 'y_pred.csv')
 check_dir(result_dir)
+dataset_dir = './datasets/'
+
+# hyper-parameters of the DNN model
+deside_parameters = {
+    'architecture': ([200, 2000, 2000, 2000, 50], [0.05, 0.05, 0.05, 0.2, 0]),
+    'architecture_for_pathway_network': ([50, 500, 500, 500, 50], [0, 0, 0, 0, 0]),
+    'loss_function_alpha': 0.5,  # alpha*mae + (1-alpha)*rmse, mae means mean absolute error
+    'normalization': 'layer_normalization',  # batch_normalization / layer_normalization / None
+     # 1 means to add a normalization layer, input | the first hidden layer | ... | output
+    'normalization_layer': [0, 0, 1, 1, 1, 1],  # 1 more parameter than the number of hidden layers
+    'pathway_network': True,  # using an independent pathway network
+    'last_layer_activation': 'sigmoid',  # sigmoid / softmax
+    'learning_rate': 1e-4,
+    'batch_size': 128}
+
+# read two gene sets as pathway mask
+gene_set_file_path1 = os.path.join(dataset_dir, 'gene_set', 'c2.cp.kegg.v2023.1.Hs.symbols.gmt')
+gene_set_file_path2 = os.path.join(dataset_dir, 'gene_set', 'c2.cp.reactome.v2023.1.Hs.symbols.gmt')
+all_pathway_files = [gene_set_file_path1, gene_set_file_path2]
+pathway_mask = read_gene_set(all_pathway_files)  # genes by pathways
 
 # read pre-trained DeSide model
 model_dir = './DeSide_model/'
@@ -102,7 +128,9 @@ deside_model = DeSide(model_dir=model_dir)
 deside_model.predict(input_file=bulk_tpm_file_path, 
                      output_file_path=y_pred_file_path, 
                      exp_type='TPM', transpose=True,
-                     scaling_by_sample=False, scaling_by_constant=True)
+                     scaling_by_sample=False, scaling_by_constant=True,
+                     hyper_params=deside_parameters,
+                     pathway_mask=pathway_mask)
 ```
 - A complete example in jupyter notebook can be found: [E1 - Using pre-trained model.ipynb](https://github.com/OnlyBelter/DeSide_mini_example/blob/main/E1%20-%20Using%20pre-trained%20model.ipynb).
 
@@ -110,25 +138,60 @@ deside_model.predict(input_file=bulk_tpm_file_path,
 
 Training a model using the provided training set.
 ```python
+import os
+import pandas as pd
+from deside.decon_cf import DeSide
+from deside.utility import check_dir, sorted_cell_types
+from deside.utility.read_file import read_gene_set
+
 # create output directory
 result_dir = './results'
 check_dir(result_dir)
+dataset_dir = './datasets/'
 
 # using dataset D1 as the training set
 training_set2file_path = {
-    'D1': './datasets/simulated_bulk_cell_dataset/simu_bulk_exp_Mixed_N100K_D1.h5ad',
+    'D1': './datasets/simulated_bulk_cell_dataset/D1/simu_bulk_exp_Mixed_N100K_D1.h5ad',
 }
 
-all_cell_types = sorted_cell_types
-
-# set hyper-parameters of the DNN model
-deside_parameters = {'architecture': ([100, 1000, 1000, 1000, 50],
-                                      [0, 0, 0, 0.2, 0]),
-                     'loss_function': 'mae+rmse',
-                     'batch_normalization': False,
-                     'last_layer_activation': 'sigmoid',
-                     'learning_rate': 2e-5,
-                     'batch_size': 128}
+cell_type2subtypes = {'B Cells': ['Non-plasma B cells', 'Plasma B cells'],
+                      'CD4 T': ['CD4 T'], 'CD8 T': ['CD8 T (GZMK high)', 'CD8 T effector'],
+                      'DC': ['DC'], 'Endothelial Cells': ['Endothelial Cells'],
+                      'Cancer Cells': ['Cancer Cells'],
+                      'Fibroblasts': ['CAFs', 'Myofibroblasts'], 'Macrophages': ['Macrophages'],
+                      'Mast Cells': ['Mast Cells'], 'NK': ['NK'], 'Neutrophils': ['Neutrophils'],
+                      'Double-neg-like T': ['Double-neg-like T'], 'Monocytes': ['Monocytes']}
+all_cell_types = sorted([i for v in cell_type2subtypes.values() for i in v])
+all_cell_types = [i for i in sorted_cell_types if i in all_cell_types]
+
+# set hyper-parameters of the DNN model and other parameters for training
+# hyper-parameters of the DNN model
+deside_parameters = {
+    'architecture': ([200, 2000, 2000, 2000, 50], [0.05, 0.05, 0.05, 0.2, 0]),
+    'architecture_for_pathway_network': ([50, 500, 500, 500, 50], [0, 0, 0, 0, 0]),
+    'loss_function_alpha': 0.5,  # alpha*mae + (1-alpha)*rmse, mae means mean absolute error
+    'normalization': 'layer_normalization',  # batch_normalization / layer_normalization / None
+     # 1 means to add a normalization layer, input | the first hidden layer | ... | output
+    'normalization_layer': [0, 0, 1, 1, 1, 1],  # 1 more parameter than the number of hidden layers
+    'pathway_network': True,  # using an independent pathway network
+    'last_layer_activation': 'sigmoid',  # sigmoid / softmax
+    'learning_rate': 1e-4,
+    'batch_size': 128}
+
+# read two gene sets as pathway mask
+gene_set_file_path1 = os.path.join(dataset_dir, 'gene_set', 'c2.cp.kegg.v2023.1.Hs.symbols.gmt')
+gene_set_file_path2 = os.path.join(dataset_dir, 'gene_set', 'c2.cp.reactome.v2023.1.Hs.symbols.gmt')
+all_pathway_files = [gene_set_file_path1, gene_set_file_path2]
+pathway_mask = read_gene_set(all_pathway_files)  # genes by pathways
+
+# filtered gene list (gene-level filtering, filtered by correlation coefficients and quantiles)
+filtered_gene_list = None  # for other datasets
+if list(training_set2file_path.keys())[0] == 'D1':
+    filtered_gene_file_path = os.path.join(dataset_dir, 'simulated_bulk_cell_dataset/D1/gene_list_filtered_by_high_corr_gene_and_quantile_range.csv')
+    filtered_gene_list = pd.read_csv(filtered_gene_file_path, index_col=0).index.to_list()
+
+# input gene list type for pathway profiles
+input_gene_list = 'filtered_genes'
 
 # remove cancer cell during training process
 remove_cancer_cell = True
@@ -144,7 +207,9 @@ deside_obj.train_model(training_set_file_path=[training_set2file_path['D1']],
                        hyper_params=deside_parameters, cell_types=all_cell_types,
                        scaling_by_constant=True, scaling_by_sample=False,
                        remove_cancer_cell=remove_cancer_cell,
-                       n_patience=100, n_epoch=3000, verbose=0)
+                       n_patience=100, n_epoch=3000, verbose=0,
+                        pathway_mask=pathway_mask, method_adding_pathway='add_to_end', 
+                        filtered_gene_list=filtered_gene_list, input_gene_list=input_gene_list)
 ```
 - A complete example in jupyter notebook can be found: [E2 - Training a model from scratch.ipynb](https://github.com/OnlyBelter/DeSide_mini_example/blob/main/E2%20-%20Training%20a%20model%20from%20scratch.ipynb)
 
@@ -155,49 +220,88 @@ deside_obj.train_model(training_set_file_path=[training_set2file_path['D1']],
 In this module, you can synthesize bulk tumors based on the dataset `S1`.
 
 ```python
+import os
+import pandas as pd
+from deside.utility.read_file import ReadH5AD, ReadExp
+from deside.utility import check_dir, sorted_cell_types
+from deside.simulation import (BulkGEPGenerator, get_gene_list_for_filtering, 
+                               filtering_by_gene_list_and_pca_plot)
+
 # the list of single cell RNA-seq datasets
 sc_dataset_ids = ['hnscc_cillo_01', 'pdac_pengj_02', 'hnscc_puram_03',
-                  'pdac_steele_04', 'luad_kim_05', 'nsclc_guo_06', 'pan_cancer_07']
+                  'pdac_steele_04', 'luad_kim_05', 'nsclc_guo_06', 
+                  'pan_cancer_07', 'prad_cheng_08', 'prad_dong_09', 
+                  'hcc_sun_10', 'gbm_neftel_11', 'gbm_abdelfattah_12']
 
 # the list of cancer types in the TCGA dataset
 cancer_types = ['ACC', 'BLCA', 'BRCA', 'GBM', 'HNSC', 'LGG', 'LIHC', 'LUAD', 'PAAD', 'PRAD',
                 'CESC', 'COAD', 'KICH', 'KIRC', 'KIRP', 'LUSC', 'READ', 'THCA', 'UCEC']
 
+cancer_types_for_filtering = cancer_types.copy()
+
+# coefficient to correct the difference of total RNA abundance in different cell types
+# There is no effect to the final results if all the coefficients are set to 1
+alpha_total_rna_coefficient = {'B Cells': 1.0, 'CD4 T': 1.0, 'CD8 T': 1.0, 'DC': 1.0,
+                               'Endothelial Cells': 1.0, 'Cancer Cells': 1.0, 'Fibroblasts': 1.0,
+                               'Macrophages': 1.0, 'Mast Cells': 1.0, 'NK': 1.0, 'Neutrophils': 1.0,
+                               'Double-neg-like T': 1.0, 'Monocytes': 1.0}
+
+# cell types and the corresponding subtypes
+cell_type2subtypes = {'B Cells': ['Non-plasma B cells', 'Plasma B cells'],
+                      'CD4 T': ['CD4 T'], 'CD8 T': ['CD8 T (GZMK high)', 'CD8 T effector'],
+                      'DC': ['DC'], 'Endothelial Cells': ['Endothelial Cells'],
+                      'Cancer Cells': ['Cancer Cells'],
+                      'Fibroblasts': ['CAFs', 'Myofibroblasts'], 'Macrophages': ['Macrophages'],
+                      'Mast Cells': ['Mast Cells'], 'NK': ['NK'], 'Neutrophils': ['Neutrophils'],
+                      'Double-neg-like T': ['Double-neg-like T'], 'Monocytes': ['Monocytes']}
+
 # the list of cell types
-all_cell_types = sorted_cell_types
+all_cell_types = sorted([i for v in cell_type2subtypes.values() for i in v])
+all_cell_types = [i for i in sorted_cell_types if i in all_cell_types]
 
 # parameters
 # for gene-level filtering
 gene_list_type = 'high_corr_gene_and_quantile_range'
-gene_quantile_range = [0.05, 0.5, 0.95]  # gene-level filtering
+gene_quantile_range = [0.005, 0.5, 0.995]  # gene-level filtering
 
 # for GEP-level filtering
 gep_filtering_quantile = (0.0, 0.95)  # GEP-level filtering, L1-norm threshold
+filtering_in_pca_space = True
+pca_n_components = 0.9
 n_base = 100  # averaging 100 GEPs sampled from S1 to synthesize 1 bulk GEP, used by S1 generation
 
 cell_prop_prior = None
 dataset2parameters = {
     'Mixed_N10K_segment': {
         'sc_dataset_ids': sc_dataset_ids,
-        'cell_types': all_cell_types,
-        'n_samples': 10000,
+        'cell_type2subtype': cell_type2subtypes,
+        'n_samples': 8000,
         'sampling_method': 'segment', # or `random` used by Scaden
         'filtering': True,
     }
 }
 
 # skipped steps here ...
-
+simu_bulk_exp_dir = './datasets/simulated_bulk_cell_dataset'
+sct_dataset_file_path = 'path/to/simu_bulk_exp_SCT_N10K_S1_16sct.h5ad'
+tcga2cancer_type_file_path = 'path/to/tcga_sample_id2cancer_type.csv'
+tcga_merged_tpm_file_path = 'path/to/merged_tpm.csv'
+high_corr_gene_file_path = 'path/to/gene_list_filtered_by_high_corr_gene.csv'
+high_corr_gene_list = pd.read_csv(high_corr_gene_file_path)
+high_corr_gene_list = high_corr_gene_list['gene_name'].to_list()
 for dataset_name, params in dataset2parameters.items():
     # skipped steps here ...
     bulk_generator = BulkGEPGenerator(simu_bulk_dir=simu_bulk_exp_dir,
                                       merged_sc_dataset_file_path=None,
-                                      cell_types=params['cell_types'],
+                                      cell_type2subtype=params['cell_type2subtype'],
                                       sc_dataset_ids=params['sc_dataset_ids'],
                                       bulk_dataset_name=dataset_name,
                                       sct_dataset_file_path=sct_dataset_file_path,
                                       check_basic_info=False,
-                                      tcga2cancer_type_file_path=tcga2cancer_type_file_path)
+                                      tcga2cancer_type_file_path=tcga2cancer_type_file_path,
+                                      total_rna_coefficient=alpha_total_rna_coefficient,
+                                      cell_type_col_name='cell_type',
+                                      subtype_col_name='cell_type')
     # GEP-filtering will be performed during this generation process
     generated_bulk_gep_fp = bulk_generator.generated_bulk_gep_fp
     dataset2path[dataset_name] = generated_bulk_gep_fp
@@ -214,7 +318,9 @@ for dataset_name, params in dataset2parameters.items():
                                     log_file_path=log_file_path,
                                     show_filtering_info=False,
                                     filtering_method='median_gep',
-                                    cell_prop_prior=cell_prop_prior)
+                                    cell_prop_prior=cell_prop_prior,
+                                    filtering_in_pca_space=filtering_in_pca_space,
+                                    norm_ord=1, pca_n_components=pca_n_components)
 
     # gene-level filtering that depends on the high correlation genes and quantile range (each dataset itself)
     if params['filtering']:
@@ -268,4 +374,6 @@ for dataset_name, params in dataset2parameters.items():
 If you want to use other scRNA-seq datasets to simulate GEPs, you can follow our workflow to preprocess single cell datasets and merge them together. The Python package `Scanpy` was used heavily in our workflow.
 
 - Preprocessing a single dataset: [03deal_with_Puram et al Cell.ipynb](https://github.com/OnlyBelter/DeSide_mini_example/blob/main/single_cell_dataset_integration/03deal_with_Puram%20et%20al%20Cell.ipynb).
-- Merging multiple datasets together: [08filter_and_merge_01_06.py](https://github.com/OnlyBelter/DeSide_mini_example/blob/main/single_cell_dataset_integration/08filter_and_merge_01_06.py).
+- Merging multiple datasets together (part 1): [Merge_12_scRNA-seq_datasets_part1.ipynb](https://github.com/OnlyBelter/DeSide_mini_example/blob/main/single_cell_dataset_integration/Merge_12_scRNA-seq_datasets_part1.ipynb).
+- Merging multiple datasets together (part 2-round1): [Merge_12_scRNA-seq_datasets_part2_first_round.ipynb](https://github.com/OnlyBelter/DeSide_mini_example/blob/main/single_cell_dataset_integration/Merge_12_scRNA-seq_datasets_part2_first_round.ipynb)
+- Merging multiple datasets together (part 2-round2): [Merge_12_scRNA-seq_datasets_part2_second_round.ipynb](https://github.com/OnlyBelter/DeSide_mini_example/blob/main/single_cell_dataset_integration/Merge_12_scRNA-seq_datasets_part2_second_round.ipynb)