Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

qc broken b.c. it expects auxiliary folder to simultaneously exist and not exist #160

Open
bytewife opened this issue Dec 9, 2023 · 11 comments · May be fixed by #163
Open

qc broken b.c. it expects auxiliary folder to simultaneously exist and not exist #160

bytewife opened this issue Dec 9, 2023 · 11 comments · May be fixed by #163

Comments

@bytewife
Copy link
Member

bytewife commented Dec 9, 2023

The qc subcommand requires an 'auxiliary' folder to exist to work. This folder is intended to be provided by train. However, chrombpnet will not allow for this folder to exist due to the usage of os.makedirs(..., exist_ok=False) in chrombpnet_qc() and in the block that handles qc as an input subcommand. Thus it's impossible for qc to work correctly. I recommend changing that flag to True within that block, and in chrombpnet_qc().

For completeness, here's the error when the train folders are provided:

Traceback (most recent call last):
  File "/opt/conda/bin/chrombpnet", line 33, in <module>
    sys.exit(load_entry_point('chrombpnet', 'console_scripts', 'chrombpnet')())
  File "/scratch/chrombpnet/chrombpnet/CHROMBPNET.py", line 26, in main
    os.makedirs(os.path.join(args.output_dir,"auxiliary"), exist_ok=False)
  File "/opt/conda/lib/python3.9/os.py", line 225, in makedirs
    mkdir(name, mode)
FileExistsError: [Errno 17] File exists: '/projectTest/output/auxiliary'
Exit code: 1

and here's when they're not provided:

Traceback (most recent call last):
got the model
loading peaks...
  File "/opt/conda/bin/chrombpnet", line 33, in <module>
    sys.exit(load_entry_point('chrombpnet', 'console_scripts', 'chrombpnet')())
  File "/scratch/chrombpnet/chrombpnet/CHROMBPNET.py", line 29, in main
    pipelines.chrombpnet_qc(args)
  File "/scratch/chrombpnet/chrombpnet/pipelines.py", line 196, in chrombpnet_qc
    predict.main(args_copy)
  File "/scratch/chrombpnet/chrombpnet/training/predict.py", line 105, in main
    test_generator = initializers.initialize_generators(args, mode="test", parameters=None, return_coords=True)
  File "/scratch/chrombpnet/chrombpnet/training/data_generators/initializers.py", line 69, in initialize_generators
    peak_regions=pd.read_csv(args.peaks,header=None,sep='\t',names=NARROWPEAK_SCHEMA)
  File "/opt/conda/lib/python3.9/site-packages/pandas/io/parsers/readers.py", line 948, in read_csv
    return _read(filepath_or_buffer, kwds)
  File "/opt/conda/lib/python3.9/site-packages/pandas/io/parsers/readers.py", line 611, in _read
    parser = TextFileReader(filepath_or_buffer, **kwds)
  File "/opt/conda/lib/python3.9/site-packages/pandas/io/parsers/readers.py", line 1448, in __init__
    self._engine = self._make_engine(f, self.engine)
  File "/opt/conda/lib/python3.9/site-packages/pandas/io/parsers/readers.py", line 1705, in _make_engine
    self.handles = get_handle(
  File "/opt/conda/lib/python3.9/site-packages/pandas/io/common.py", line 863, in get_handle
    handle = open(
FileNotFoundError: [Errno 2] No such file or directory: '/projectTest/output/auxiliary/filtered.peaks.bed'
Exit code: 1
Model: "model"
@bytewife bytewife changed the title qc broken because it qc broken b.c. it expects auxiliary folder to simultaneously exist and not exist Dec 9, 2023
@akundaje
Copy link

akundaje commented Dec 9, 2023 via email

@bytewife
Copy link
Member Author

@akundaje I already made this PR in the link above

@panushri25
Copy link
Collaborator

@ivyraine is your output_dir provided to chrombpnet qc same as that for chrombpnet train ?

@panushri25
Copy link
Collaborator

Can you provide the exact commands you used to run chrombpnet qc and chrombpnet train ?

@panushri25
Copy link
Collaborator

I think you are trying to use the same output dir path for both the commands and hence you are seeing this error. Is there a reason why you are using same path?

@bytewife
Copy link
Member Author

bytewife commented Dec 12, 2023

No- this is from using two different output paths, one for each command.
train cmd:

                chrombpnet train \
                  -itag /mnt/volume/oak/stanford/projects/igvf/Y2AVE/E2G_Predictions/inputs/230601_iPSC_art_ven_EC_10xmultiome_Cluster0/230601_iPSC_art_ven_EC_10Xmultiome_Cluster0.atac.filter.cutsites.hg38.tagAlign.gz \
                  -d "ATAC" \
                  -g /mnt/volume/oak/stanford/groups/akundaje/soumyak/refs/hg38/GRCh38_no_alt_analysis_set_GCA_000001405.15.fasta \
                  -c /mnt/volume/oak/stanford/groups/akundaje/soumyak/refs/hg38/GRCh38_EBV.chrom.sizes.tsv \
                  -p /mnt/volume/oak/stanford/groups/engreitz/Projects/IGVF-Y2AVE/outputs/230601_iPSC_art_ven_EC_10xmultiome_Cluster0/Peaks/macs2_peaks.narrowPeak.sorted.candidateRegions.bed \
                  -n /mnt/volume/oak/stanford/groups/engreitz/Projects/IGVF-Y2AVE/outputs/230601_iPSC_art_ven_EC_10xmultiome_Cluster0/fold_0/_negatives.bed \
                  -fl /mnt/volume/oak/stanford/groups/akundaje/anusri/chrombpnet_data/input_files/folds/fold_0.json \
                  -b /mnt/volume/oak/stanford/groups/akundaje/anusri/chrombpnet_data/input_files/bias_models/ATAC/ENCSR868FGK_bias_fold_0.h5 \
                  -o /mnt/volume/oak/stanford/groups/engreitz/Projects/IGVF-Y2AVE/outputs/230601_iPSC_art_ven_EC_10xmultiome_Cluster0/fold_0/chrombpnet_model/ \
                  | tee "$output_file"

qc cmd:

                chrombpnet qc \
                  -bw /mnt/volume/oak/stanford/groups/engreitz/Projects/IGVF-Y2AVE/outputs/230601_iPSC_art_ven_EC_10xmultiome_Cluster0/fold_0/auxiliary/data_unstranded.bw \
                  -cm /mnt/volume/oak/stanford/groups/engreitz/Projects/IGVF-Y2AVE/outputs/230601_iPSC_art_ven_EC_10xmultiome_Cluster0/fold_0/chrombpnet_model/models/chrombpnet.h5 \
                  -cmb /mnt/volume/oak/stanford/groups/engreitz/Projects/IGVF-Y2AVE/outputs/230601_iPSC_art_ven_EC_10xmultiome_Cluster0/fold_0/chrombpnet_model/models/chrombpnet_nobias.h5 \
                  -d "ATAC" \
                  -g /mnt/volume/oak/stanford/groups/akundaje/soumyak/refs/hg38/GRCh38_no_alt_analysis_set_GCA_000001405.15.fasta \
                  -c /mnt/volume/oak/stanford/groups/akundaje/soumyak/refs/hg38/GRCh38_EBV.chrom.sizes.tsv \
                  -p /mnt/volume/oak/stanford/groups/engreitz/Projects/IGVF-Y2AVE/outputs/230601_iPSC_art_ven_EC_10xmultiome_Cluster0/Peaks/macs2_peaks.narrowPeak.sorted.candidateRegions.bed \
                  -n /mnt/volume/oak/stanford/groups/engreitz/Projects/IGVF-Y2AVE/outputs/230601_iPSC_art_ven_EC_10xmultiome_Cluster0/fold_0/_negatives.bed \
                  -fl /mnt/volume/oak/stanford/groups/akundaje/anusri/chrombpnet_data/input_files/folds/fold_0.json \
                  -o /mnt/volume/oak/stanford/groups/engreitz/Projects/IGVF-Y2AVE/outputs/230601_iPSC_art_ven_EC_10xmultiome_Cluster0/fold_0/pipeline_output/ \
                  | tee "$output_file"

then it leads to the following error:

FileNotFoundError: [Errno 2] No such file or directory: '/mnt/volume/oak/stanford/groups/engreitz/Projects/IGVF-Y2AVE/outputs/230601_iPSC_art_ven_EC_1
0xmultiome_Cluster0/fold_2/pipeline_output/auxiliary/filtered.peaks.bed'

Please allow me to save both your and my time. The reason why the code doesn't work is as I provided in the first comment. qc is expecting the output files from train to exist, but the exist_ok=False flag of makedirs() prevents that from working. See my PR for the fixes.

@panushri25
Copy link
Collaborator

Hello @ivyraine, I appreciate your intention to save both your and my time. But your PR is suggesting a fix that is trying to by-pass a folder existing check which is important to prevent overwriting of existing folders/files.

Allow me some time to reproduce this and fix it.

@bytewife
Copy link
Member Author

Gotcha- TY

@panushri25
Copy link
Collaborator

Also your fix wont work - the filtered.peaks.bed from chrombpnet train command will be at /mnt/volume/oak/stanford/groups/engreitz/Projects/IGVF-Y2AVE/outputs/230601_iPSC_art_ven_EC_10xmultiome_Cluster0/fold_0/chrombpnet_model/auxiliary/filtered.peaks.bed

But chrombpnet qc is looking for it here - /mnt/volume/oak/stanford/groups/engreitz/Projects/IGVF-Y2AVE/outputs/230601_iPSC_art_ven_EC_1 0xmultiome_Cluster0/fold_2/pipeline_output/auxiliary/filtered.peaks.bed

Was your fix to change exists_ok to True and just pass the output dir from train to qc ?

@bytewife
Copy link
Member Author

I just symlinked the subdirs produced in the train output dir into the qc output dir. But you're right, it would be better if it was clear that the user needs to provide the train outputs as well. Perhaps it would be best if qc had another required flag --train-output, which would be the output dir of the train command. What do you think?

@panushri25
Copy link
Collaborator

I think chrombpnet qc command needs to be restructured a bit based on some utilities added recently (re. filtering of peaks at the edge), will think about how to do this and get back.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

Successfully merging a pull request may close this issue.

3 participants