Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

feat: porting ngsderive #35

Open
wants to merge 91 commits into
base: main
Choose a base branch
from
Open
Show file tree
Hide file tree
Changes from all commits
Commits
Show all changes
91 commits
Select commit Hold shift + click to select a range
0c01978
feat(derive): new readlen subcommand
a-frantz Dec 3, 2023
6bb2c2a
fix(derive/command/readlen): proper error message
a-frantz Dec 4, 2023
23d8a6c
[WIP] protoype for endedness and skeleton of encoding
a-frantz Dec 4, 2023
5ed13e6
[WIP]: doesn't compile. Begin implementing RPT calculations
a-frantz Dec 5, 2023
c2fe1d8
revise: applies Clay's edits
claymcleod Dec 5, 2023
b4ce76b
[WIP]: Broken, not compiling. starting to calc RPT.
a-frantz Dec 6, 2023
9aeebf4
[WIP]
a-frantz Dec 6, 2023
137beea
feat: working derive endedness subcommand
a-frantz Dec 7, 2023
d1cac4d
chore: better comments/test name/log message
a-frantz Dec 8, 2023
d92368f
fix: apply some of Clay's performance suggestions
a-frantz Dec 9, 2023
7e6ad22
fix(derive/endedness/compute.rs): test updates
a-frantz Dec 9, 2023
34bf08d
chore(derive/readlen/compute): cleanup
a-frantz Dec 9, 2023
43aaec3
fix(derive/command/endedness): try using HashMap instead of Trie
a-frantz Dec 10, 2023
553fd68
feat(derive/command/endedness): lazy record reading
a-frantz Dec 11, 2023
1b1117c
Revert "feat(derive/command/endedness): lazy record reading"
a-frantz Dec 14, 2023
b1e9e86
tests(derive/endedness/comput): reimplement tests
a-frantz Dec 14, 2023
ae133d6
[WIP] suggestions from @zaeleus. Compiler error
a-frantz Dec 14, 2023
e3b8e85
tests(derive/command/endedness): rewrite tests with latest changes
a-frantz Dec 15, 2023
8b23de0
chore(endedness): handle error where RG tag can't be parsed as_str()
a-frantz Dec 15, 2023
65ea501
fix(endedness): remove all Arcs and lazy_statics
a-frantz Dec 15, 2023
76305f2
refactor(derive/readlen): move num_samples counting to outer func
a-frantz Dec 15, 2023
d63eb2a
perf(derive/readlen): don't iterate through all read_lengths
a-frantz Dec 15, 2023
cb1d2f5
feat(derive/endedness): add `validate_read_group_info()` call
a-frantz Dec 15, 2023
9b45316
fix: typos
a-frantz Dec 15, 2023
520ae46
revert
a-frantz Dec 15, 2023
f280c0f
revert
a-frantz Dec 15, 2023
12f3f66
fix: corrections made after previous reverts
a-frantz Dec 15, 2023
81511e2
tests(derive/endedness): reimplement tests
a-frantz Dec 15, 2023
55ee258
chore(derive): disable index checking when not needed
a-frantz Dec 16, 2023
35ccc3f
chore:(derive/readlen): return an anyhow::Ok instead of plain Ok
a-frantz Dec 16, 2023
e8c4816
docs(derive/readlen): correction in module name
a-frantz Dec 16, 2023
7e1578b
docs(derive/endedness): fix docs referring to wrong subcommand
a-frantz Dec 16, 2023
eab7cf9
fix: cap QNAME warnings to 100 QNAMES
a-frantz Dec 18, 2023
019d61d
feat(src/derive): use NumberOfRecords and RecordCounter structs
a-frantz Dec 20, 2023
dab1dca
style: make arg_in_range() nicer everywhere its used
a-frantz Dec 20, 2023
489c2bb
[WIP]: junction annotation
a-frantz Dec 22, 2023
14db813
feat(derive): junction-annotation subcommand
a-frantz Dec 26, 2023
bf31648
chore: remove radix_trie dep
a-frantz Dec 26, 2023
390dc1f
feat(derive/junction-annotation): better results reporting
a-frantz Dec 26, 2023
3a7051f
docs(derive/junction_annotation): be more clear in results docs
a-frantz Dec 27, 2023
5e3cb89
feat(derive/junction_annotation): add short options to params
a-frantz Dec 28, 2023
1ea4154
chore: return anyhow::Ok where appropriate
a-frantz Dec 28, 2023
ee77abd
tests: add a test for process() and summarize()
a-frantz Dec 28, 2023
65c4935
chore: typos
a-frantz Dec 28, 2023
6fcbd60
feat: remove fuzzy searching ability. Boosts performance as well.
a-frantz Jan 10, 2024
d872c82
feat: first pass implementation of `encoding`
a-frantz Feb 6, 2024
89b60e0
[WIP] to share code. Partial strandedness implementation
a-frantz Feb 8, 2024
d5b5b5c
fix(derive/endedness): add logic for 0x1 bit
a-frantz Feb 8, 2024
fe6eabd
[WIP] pushing to share. Partial strandedness implementation
a-frantz Feb 8, 2024
aaaf1ac
style: rename ignored_flags to filtered_by_flags
a-frantz Feb 8, 2024
9b4f6ea
feat: derive strandedness (prototype)
a-frantz Feb 8, 2024
33f5afc
style: much prettier code. One broken test.
a-frantz Feb 9, 2024
3c15c10
refactor: make read_groups util nicer
a-frantz Feb 9, 2024
7f95d34
refactor(derive/strandedness): separate out results from compute
a-frantz Feb 9, 2024
d8d9413
fix(strandedness): break when successful
a-frantz Feb 9, 2024
b084993
docs: junction_annotation to junction-annotation
a-frantz Feb 9, 2024
7767da4
[WIP]: switch min_mapq to a proper MappingQuality
a-frantz Feb 9, 2024
64318f3
fix: rework MappingQuality argument to work
a-frantz Feb 10, 2024
2280e15
style: f32 -> f64 and code cleanup
a-frantz Feb 10, 2024
8f90308
feat: add a log_every param to RecordCounter
a-frantz Feb 10, 2024
85d9f19
style: code clean up
a-frantz Feb 10, 2024
54dc217
style: use custom Strand enum more
a-frantz Feb 10, 2024
725b936
fix: junction-annotation works again
a-frantz Feb 10, 2024
9668a6d
feat(derive/instrument): behave more like other derive commands
a-frantz Feb 11, 2024
f569b38
fix(derive): print!(output) -> println!(output)
a-frantz Feb 11, 2024
6a23773
fix(derive/strandedness): move RG validation out of compute
a-frantz Feb 11, 2024
9be3850
tests(derive/endedness): fix the broken tests
a-frantz Feb 11, 2024
2eac8d7
style(derive/junction-annotation): group reported junctions by contig
a-frantz Feb 11, 2024
b67bac8
fix: lots of code cleanup
a-frantz Feb 11, 2024
e1473a6
tests(derive): all derive commands have tests
a-frantz Feb 11, 2024
42d09dc
fix: consistently return Options for String results
a-frantz Feb 12, 2024
55b3bc7
style: prettify imports
a-frantz Feb 12, 2024
ded9a89
style: wrap optional variable in Option
a-frantz Feb 12, 2024
11dc18d
docs: fix intra links
a-frantz Feb 12, 2024
4225ec0
tests: fix broken gene test
a-frantz Feb 12, 2024
780d3bc
feat(derive/instrument): more debug statements
a-frantz Feb 12, 2024
09a2be2
feat(derive/instrument): in output, report found unique names
a-frantz Feb 13, 2024
05a53fc
feat: more info in JSON report. Not complete yet. Open TODOs
a-frantz Feb 13, 2024
eae95c7
style(derive/instrument): a bit of code clean up
a-frantz Feb 13, 2024
00999c1
docs: filling in TODOs
a-frantz Feb 13, 2024
2661300
chore: delete dead code
a-frantz Feb 14, 2024
bfac8a1
tests(derive/junction-annotation): rewrite tests more modular
a-frantz Feb 14, 2024
7ab84f2
stlye: Michael M. feedback
a-frantz Feb 22, 2024
68f999c
Apply suggestions from code review
a-frantz Mar 22, 2024
451a2a4
fix(derive/endedness): complete rename from last commit
a-frantz Mar 22, 2024
aa0880d
feat: use NonZeroUsize for Number of Records
a-frantz Mar 22, 2024
406a7b8
feat(utils/args): improved behavior for NumberOfRecords CL utility
a-frantz Mar 23, 2024
f83015f
feat(derive): report by read group where appropriate and feasible
a-frantz Mar 23, 2024
cf3a869
chore: removing dead code
a-frantz Mar 25, 2024
7bbb7e5
tests(derive/instrument): assert that read groups succeed
a-frantz Mar 25, 2024
6091af4
fix(instrument): properly init flowcell entries
a-frantz Aug 26, 2024
File filter

Filter by extension

Filter by extension


Conversations
Failed to load comments.
Loading
Jump to
Jump to file
Failed to load files.
Loading
Diff view
Diff view
1 change: 1 addition & 0 deletions Cargo.lock

Some generated files are not rendered by default. Learn more about how customized files appear on GitHub.

1 change: 1 addition & 0 deletions Cargo.toml
Original file line number Diff line number Diff line change
Expand Up @@ -19,6 +19,7 @@ git-testament = "0.2.1"
indexmap = "1.9.1"
indicatif = "0.16.2"
itertools = "0.10.5"
lazy_static = "1.4.0"
noodles = { version = "0.34.0", features = [
"async",
"bam",
Expand Down
4 changes: 2 additions & 2 deletions src/convert/bam.rs
Original file line number Diff line number Diff line change
Expand Up @@ -45,7 +45,7 @@ pub async fn to_sam_async(
.await
.with_context(|| "writing SAM header")?;

let mut counter = RecordCounter::new();
let mut counter = RecordCounter::default();
let mut record = Record::default();

// (4) Write each record in the BAM file to the SAM file.
Expand Down Expand Up @@ -131,7 +131,7 @@ pub async fn to_cram_async(
.await
.with_context(|| "writing CRAM file header")?;

let mut counter = RecordCounter::new();
let mut counter = RecordCounter::default();
let mut record = Record::default();

// (6) Write each record in the BAM file to the CRAM file.
Expand Down
11 changes: 8 additions & 3 deletions src/convert/command.rs
Original file line number Diff line number Diff line change
Expand Up @@ -31,8 +31,13 @@ pub struct ConvertArgs {
to: PathBuf,

/// Number of records to process before exiting the conversion.
#[arg(short = 'n', long, value_name = "USIZE")]
num_records: Option<usize>,
#[arg(
short,
long,
default_value_t,
value_name = "'all' or a positive, non-zero integer"
)]
num_records: NumberOfRecords,

/// If available, the FASTA reference file used to generate the file.
#[arg(short, long)]
Expand Down Expand Up @@ -91,7 +96,7 @@ pub fn convert(args: ConvertArgs) -> anyhow::Result<()> {
// Number of Records //
//===================//

let max_records = NumberOfRecords::from(args.num_records);
let max_records = args.num_records;

//==========================//
// Bioinformatics File Pair //
Expand Down
4 changes: 2 additions & 2 deletions src/convert/cram.rs
Original file line number Diff line number Diff line change
Expand Up @@ -52,7 +52,7 @@ pub async fn to_sam_async(
.with_context(|| "writing SAM header")?;

// (5) Write each record in the CRAM file to the SAM file.
let mut counter = RecordCounter::new();
let mut counter = RecordCounter::default();
let mut records = reader.records(&repository, &header.parsed);

while let Some(record) = records
Expand Down Expand Up @@ -125,7 +125,7 @@ pub async fn to_bam_async(
.with_context(|| "writing BAM reference sequences")?;

// (6) Write each record in the CRAM file to the BAM file.
let mut counter = RecordCounter::new();
let mut counter = RecordCounter::default();
let mut records = reader.records(&repository, &header.parsed);

while let Some(record) = records
Expand Down
4 changes: 2 additions & 2 deletions src/convert/sam.rs
Original file line number Diff line number Diff line change
Expand Up @@ -58,7 +58,7 @@ pub async fn to_bam_async(
.await
.with_context(|| "writing BAM reference sequences")?;

let mut counter = RecordCounter::new();
let mut counter = RecordCounter::default();
let mut record = Record::default();

// (5) Write each record in the BAM file to the SAM file.
Expand Down Expand Up @@ -151,7 +151,7 @@ pub async fn to_cram_async(
.await
.with_context(|| "writing CRAM file header")?;

let mut counter = RecordCounter::new();
let mut counter = RecordCounter::default();
let mut record = Record::default();

// (6) Write each record in the SAM file to the CRAM file.
Expand Down
5 changes: 5 additions & 0 deletions src/derive.rs
Original file line number Diff line number Diff line change
@@ -1,4 +1,9 @@
//! Functionality related to the `ngs derive` subcommand.

pub mod command;
pub mod encoding;
pub mod endedness;
pub mod instrument;
pub mod junction_annotation;
pub mod readlen;
pub mod strandedness;
22 changes: 22 additions & 0 deletions src/derive/command.rs
Original file line number Diff line number Diff line change
@@ -1,6 +1,11 @@
//! Functionality related to the `ngs derive` subcommand itself.

pub mod encoding;
pub mod endedness;
pub mod instrument;
pub mod junction_annotation;
pub mod readlen;
pub mod strandedness;

use clap::Args;
use clap::Subcommand;
Expand All @@ -20,6 +25,23 @@ pub struct DeriveArgs {
/// All possible subcommands for `ngs derive`.
#[derive(Subcommand)]
pub enum DeriveSubcommand {
/// Derives the quality score encoding used to produce the file.
Encoding(self::encoding::DeriveEncodingArgs),
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

FYI, we're going to change how this is named slightly in the future.


/// Derives the endedness of the file.
Endedness(self::endedness::DeriveEndednessArgs),

/// Derives the instrument used to produce the file.
Instrument(self::instrument::DeriveInstrumentArgs),

/// Derives the read length of the file.
Readlen(self::readlen::DeriveReadlenArgs),

/// Derives the strandedness of the RNA-Seq file.
Strandedness(self::strandedness::DeriveStrandednessArgs),

/// Annotates junctions in the file.
/// Note that, technically, this command doesn't derive anything—it will moved in the future to a better home.
/// convenience.
JunctionAnnotation(self::junction_annotation::JunctionAnnotationArgs),
}
78 changes: 78 additions & 0 deletions src/derive/command/encoding.rs
Original file line number Diff line number Diff line change
@@ -0,0 +1,78 @@
//! Functionality relating to the `ngs derive encoding` subcommand itself.

use anyhow::{Context, Ok};
use clap::Args;
use noodles::bam;
use num_format::{Locale, ToFormattedString};
use std::collections::HashSet;
use std::io::BufReader;
use std::path::PathBuf;
use tracing::info;

use crate::derive::encoding::compute;
use crate::utils::args::NumberOfRecords;
use crate::utils::display::RecordCounter;

/// Clap arguments for the `ngs derive encoding` subcommand.
#[derive(Args)]
pub struct DeriveEncodingArgs {
/// Source BAM.
#[arg(value_name = "BAM")]
src: PathBuf,

/// Examine the first `n` records in the file.
#[arg(
short,
long,
default_value_t,
value_name = "'all' or a positive, non-zero integer"
)]
num_records: NumberOfRecords,
}

/// Main function for the `ngs derive encoding` subcommand.
pub fn derive(args: DeriveEncodingArgs) -> anyhow::Result<()> {
info!("Starting derive encoding subcommand.");

let file = std::fs::File::open(args.src);
let reader = file
.map(BufReader::new)
.with_context(|| "opening BAM file")?;
let mut reader = bam::Reader::new(reader);
let _header: String = reader.read_header()?.parse()?;
reader.read_reference_sequences()?;

let mut score_set: HashSet<u8> = HashSet::new();

// (1) Collect quality scores from reads within the
// file. Support for sampling only a portion of the reads is provided.
let mut counter = RecordCounter::default();
for result in reader.lazy_records() {
let record = result?;

for i in 0..record.quality_scores().len() {
let score = record.quality_scores().as_ref()[i];
score_set.insert(score);
}

counter.inc();
if counter.time_to_break(&args.num_records) {
break;
}
}

info!(
"Processed {} records.",
counter.get().to_formatted_string(&Locale::en)
);

// (2) Derive encoding from the observed quality scores
let result = compute::predict(score_set)?;

// (3) Print the output to stdout as JSON (more support for different output
// types may be added in the future, but for now, only JSON).
let output = serde_json::to_string_pretty(&result).unwrap();
println!("{}", output);

Ok(())
a-frantz marked this conversation as resolved.
Show resolved Hide resolved
}
164 changes: 164 additions & 0 deletions src/derive/command/endedness.rs
Original file line number Diff line number Diff line change
@@ -0,0 +1,164 @@
//! Functionality relating to the `ngs derive endedness` subcommand itself.

use anyhow::Context;
use clap::Args;
use num_format::{Locale, ToFormattedString};
use std::collections::{HashMap, HashSet};
use std::path::PathBuf;
use std::sync::Arc;
use tracing::{info, trace};

use crate::derive::endedness::compute;
use crate::derive::endedness::compute::OrderingFlagsCounts;
use crate::utils::args::arg_in_range as deviance_in_range;
use crate::utils::args::NumberOfRecords;
use crate::utils::display::RecordCounter;
use crate::utils::formats::bam::ParsedBAMFile;
use crate::utils::formats::utils::IndexCheck;
use crate::utils::read_groups::{get_read_group, validate_read_group_info, ReadGroupPtr};

/// Clap arguments for the `ngs derive endedness` subcommand.
#[derive(Args)]
pub struct DeriveEndednessArgs {
/// Source BAM.
#[arg(value_name = "BAM")]
src: PathBuf,

/// Examine the first `n` records in the file.
#[arg(
short,
long,
default_value_t,
value_name = "'all' or a positive, non-zero integer"
)]
num_records: NumberOfRecords,

/// Distance from 0.5 split between number of f+l- reads and f-l+ reads
/// allowed to be called 'Paired-End'. The default value of `0.0` is only appropriate
/// if the whole file is being processed.
#[arg(long, value_name = "F64", default_value = "0.0")]
paired_deviance: f64,

/// Calculate and output Reads-Per-Template. This will produce a more
/// sophisticated estimate for endedness, but uses substantially more memory.
#[arg(long, default_value = "false")]
calculate_reads_per_template: bool,

/// Round RPT to the nearest INT before comparing to expected values.
/// Appropriate if using `-n` > 0. Unrounded value is reported in output.
#[arg(long, default_value = "false")]
round_reads_per_template: bool,
}

/// Main function for the `ngs derive endedness` subcommand.
pub fn derive(args: DeriveEndednessArgs) -> anyhow::Result<()> {
// (0) Parse arguments needed for subcommand.
let paired_deviance = deviance_in_range(args.paired_deviance, 0.0..=0.5)
.with_context(|| "Paired deviance is not within acceptable range")?;

info!("Starting derive endedness subcommand.");

let mut found_rgs = HashSet::new();

let mut ordering_flags: HashMap<ReadGroupPtr, OrderingFlagsCounts> = HashMap::new();

// only used if args.calc_rpt is true
let mut read_names: Option<HashMap<String, Vec<ReadGroupPtr>>> = None;

let ParsedBAMFile {
mut reader, header, ..
} = crate::utils::formats::bam::open_and_parse(args.src, IndexCheck::None)?;

// (1) Collect ordering flags (and QNAMEs) from reads within the
// file. Support for sampling only a portion of the reads is provided.
let mut counter = RecordCounter::default();
for result in reader.records(&header.parsed) {
let record = result?;

// Only count primary alignments and unmapped reads.
if (record.flags().is_secondary() || record.flags().is_supplementary())
&& !record.flags().is_unmapped()
{
continue;
}

let read_group = get_read_group(&record, Some(&mut found_rgs));

if args.calculate_reads_per_template {
let read_name_map = read_names.get_or_insert_with(HashMap::new);
match record.read_name() {
Some(rn) => {
let rn = rn.to_string();
let rg_vec = read_name_map.get_mut(&rn);

match rg_vec {
Some(rg_vec) => {
rg_vec.push(Arc::clone(&read_group));
}
None => {
read_name_map.insert(rn, vec![(Arc::clone(&read_group))]);
}
}
}
None => {
trace!("Could not parse a QNAME from a read in the file.");
trace!("Skipping this read and proceeding.");
continue;
}
}
}

match (
record.flags().is_segmented(),
record.flags().is_first_segment(),
record.flags().is_last_segment(),
) {
(false, _, _) => {
ordering_flags.entry(read_group).or_default().unsegmented += 1;
}
(true, true, false) => {
ordering_flags.entry(read_group).or_default().first += 1;
}
(true, false, true) => {
ordering_flags.entry(read_group).or_default().last += 1;
}
(true, true, true) => {
ordering_flags.entry(read_group).or_default().both += 1;
}
(true, false, false) => {
ordering_flags.entry(read_group).or_default().neither += 1;
}
}

counter.inc();
if counter.time_to_break(&args.num_records) {
break;
}
}

info!(
"Processed {} records.",
counter.get().to_formatted_string(&Locale::en)
);

// (2) Validate the read group information.
let rgs_in_header_not_records = validate_read_group_info(&found_rgs, &header.parsed);
for rg_id in rgs_in_header_not_records {
ordering_flags.insert(Arc::new(rg_id), OrderingFlagsCounts::new());
}

// (3) Derive the endedness based on the ordering flags gathered.
let result = compute::predict(
ordering_flags,
read_names,
paired_deviance,
args.round_reads_per_template,
);

// (4) Print the output to stdout as JSON (more support for different output
// types may be added in the future, but for now, only JSON).
let output = serde_json::to_string_pretty(&result).unwrap();
println!("{}", output);

anyhow::Ok(())
}
Loading
Loading