Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Require sequence dictionary MD5s when writing CRAM. #1612

Draft
wants to merge 1 commit into
base: master
Choose a base branch
from

Conversation

cmnbroad
Copy link
Collaborator

@cmnbroad cmnbroad commented Jun 1, 2022

Fixes #506 (and by extension will also fix broadinstitute/gatk#5195). If the SAM header used to create a CRAM doesn't have a sequence dictionary with MD5s, the code will attempt to use consult the reference's dictionary to inject them if possible. Otherwise, the write operation will fail with an error message.

@cmnbroad cmnbroad marked this pull request as ready for review October 4, 2022 16:34
@cmnbroad cmnbroad marked this pull request as draft October 4, 2022 16:38
Copy link
Member

@lbergelson lbergelson left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

@cmnbroad I wasn't sure if you re-marked it as draft by accident or on purpose. It looks like it needs some touching up though. There's a bunch of stuff that I think should be changed though. Sorry if you're already aware / working on that.

// CRAMs are required to have a SAMHeader with a sequence dictionary that contains MD5
// checksums for each sequence. Check the proposed dictionary to see if it has MD5s, and
// if not, attempt to update it with MD5s from the reference dictionary, if one exists.
repairedHeader = repairSequenceMD5s(header, new HtsPath(referenceFasta.toUri().toString()));
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

An aside, should we add direct HtsPath(Path) HtsPath(File) constructors?

repairedHeader.setSequenceDictionary(newDictionary);
return repairedHeader;
} catch (final IOException | RuntimeException e) {
throw new RuntimeException(
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Should this be a CRAM or SAMException instead?

"%s Attempting to use the reference dictionary to repair missing MD5s required for CRAM output.",
missingMD5Message));
final IOPath referenceDictionary = SAMSequenceDictionary.getFastaDictionaryFileName(referenceFasta);
try (final InputStream fastaDictionaryStream = referenceDictionary.getInputStream()) {
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

This seems really awkward:

  1. We already have SAMSequenceDictionaryExtractor and this reimplements what it does although it might be more efficient.
  2. Immediately after this method returns we open a ReferenceSource which owns a ReferenceSequenceFile which can be queried directly for a sequence dictionary.
  3. This is IOPath aware, which is good, but it will fail in any complex cases in the future like using RefGt where there's no simple way to go from a path name to a dictionary file.
  4. I think the reference fasta can be null in the case the program wants to use the default one.

I think we should eliminate these lines and the new methods on SAMSequenceDictionary and instead add an accessor for the sequence dictionary to ReferenceSource. Then do the repair call after opening the reference source and get the sequence dictionary directly from the ReferenceSource.

/**
* Given a fasta filename, return the name of the corresponding dictionary file.
*/
public static IOPath getFastaDictionaryFileName(IOPath fastaFile) {
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I don't think we should be adding more filename guessing and munging here. I think we can rely on ReferenceSequenceFile to find and provide the SequenceDictionary.

* @param fastaDictionaryStream InputStream connected to a fasta dictionary
* @return the SAMSequenceDictionary from the fastaDictionaryStream
*/
public static SAMSequenceDictionary loadSAMSequenceDictionary( final InputStream fastaDictionaryStream ) {
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

This is probably a good thing to have here. Can you update SAMSequenceDictionaryExtractor to call this method?

@@ -148,6 +148,23 @@ public SAMSequenceRecord setMd5(final String value) {
return this;
}

/**
* Set the M5 attribute for this sequence using the provided sequenceBases.
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

typo: M5 -> MD5

public SAMSequenceRecord setComputedMd5(final byte[] sequenceBases) {
try {
final MessageDigest md5Digester = MessageDigest.getInstance("MD5");
if (md5Digester != null) {
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

This method seems broken. It doesn't actually digest the bases. Also, the bases should be normalized.

Why not use SequenceUtil.calculateMD5String()?

/**
* @return all sequences in this dictionary that do not have an MD5 attribute
*/
List<SAMSequenceRecord> getSequencesWithMissingMD5s() {
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

These seem like very specific methods. It's a bit funny to have them here as package private. I wonder if it makes more sense to move them to the callsite.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
2 participants