Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

VCFHeader and VCFHeaderLine rewrite/refactoring #1581

Open
wants to merge 12 commits into
base: master
Choose a base branch
from

Conversation

@codecov-commenter
Copy link

codecov-commenter commented Nov 9, 2021

Codecov Report

Merging #1581 (3084903) into master (347c0ac) will increase coverage by 0.022%.
The diff coverage is 79.088%.

❗ Current head 3084903 differs from pull request most recent head a4c529d. Consider uploading reports for the commit a4c529d to get more accurate results

❗ Your organization is not using the GitHub App Integration. As a result you may experience degraded service beginning May 15th. Please install the Github App Integration for your organization. Read more.

@@               Coverage Diff               @@
##              master     #1581       +/-   ##
===============================================
+ Coverage     69.856%   69.878%   +0.022%     
- Complexity      9695      9832      +137     
===============================================
  Files            703       706        +3     
  Lines          37772     38022      +250     
  Branches        6139      6155       +16     
===============================================
+ Hits           26386     26569      +183     
- Misses          8929      8965       +36     
- Partials        2457      2488       +31     
Impacted Files Coverage Δ
...in/java/htsjdk/samtools/SAMSequenceDictionary.java 75.000% <ø> (ø)
...c/main/java/htsjdk/variant/vcf/VCFRecordCodec.java 0.000% <0.000%> (ø)
src/main/java/htsjdk/variant/vcf/VCFCodec.java 25.000% <22.222%> (-52.273%) ⬇️
src/main/java/htsjdk/variant/vcf/VCFUtils.java 45.570% <28.571%> (-3.706%) ⬇️
src/main/java/htsjdk/variant/vcf/VCF3Codec.java 57.895% <50.000%> (-5.520%) ⬇️
...n/java/htsjdk/variant/vcf/VCFFormatHeaderLine.java 66.667% <57.895%> (-3.333%) ⬇️
...tsjdk/variant/variantcontext/writer/VCFWriter.java 77.451% <65.000%> (-5.345%) ⬇️
...main/java/htsjdk/variant/vcf/VCFHeaderVersion.java 75.000% <66.667%> (+3.125%) ⬆️
...n/java/htsjdk/variant/vcf/VCFSampleHeaderLine.java 66.667% <66.667%> (-33.333%) ⬇️
...main/java/htsjdk/variant/vcf/VCFAltHeaderLine.java 72.727% <68.421%> (-27.273%) ⬇️
... and 22 more

... and 69 files with indirect coverage changes

Copy link
Contributor

@andersleung andersleung left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

A little nitpicky detail, some uses of "its" and "it's" in comments are backwards.

for (final VCFHeader sourceHeader : headers) {
for (final VCFHeaderLine line : sourceHeader.getMetaDataInSortedOrder()) {
final String key = line.getKey();
if (VCFHeaderVersion.isFormatString(key) || key.equals(VCFHeader.CONTIG_KEY)) {
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I know we've had this conversation before, but just putting this here to see what the other reviewers think of this idea. This seems like another case where internally representing the fileformat line as just another header line doesn't seem to make much sense, because its semantics are special.

VCFHeader::getMetaDataIn{Input,Sorted}Order returns the fileformat line along with all other lines, but it needs to be dropped, special-cased, or converted back into a VCFHeaderVersion for a lot of our applications. A similar pattern (scanning for the fileformat line and dropping it) occurs in VCFMetaDataLines::validateMetaDataLines and VCFWriter:: writeHeader. The functionality of storing the version is duplicated by the VCFHeaderVersion field as well, and the two have to be kept in sync (technically triplicated also by the version field in VCFHeader).

Not storing it as a line and only as aVCFHeaderVersion field, and converting fileformat lines to version objects when they’re added to the header might be cleaner in my opinion. VCFHeader::getMetaDataIn{Input,Sorted}Order would still have to include the fileformat line to keep behaviour the same.

Obviously we'll still have to accept fileformat lines at the API boundary.

Copy link
Collaborator Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Removing the redundant VCFHeaderVersion from VCFHeader makes sense now I think - we'll just need to query VCFMetaDataLines before and after every change, but thats cheap.

Changing VCFMetaData lines to not model the version as a line in the list will result in awkward semantics and implementation for things like removeMetaDataLine, findEquivalentHeaderLine, getOtherHeaderLines, but I don't have a strong opinion on that - as long as we keep the awkward public header API, there will be awkward semantics one way or another. I'll make the change and we can see which way we think is better.

None of that will address the external consumers though, such as VCFWriter, which is still going to have to strip the file format line manually, since we need to hand them out at the public boundary to be compatible with past behavior. The right long term fix is to have a real codec that encodes as well as decodes, such as we have in the CZI package.

dict.add(idLine.getID());
seen.add(idLine.getID());
if (!line.isIDHeaderLine()) {
//is there a better way to ensure that shouldBeAddedToDictionary==true only when isIDHeaderLine==true
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I'm wondering whether shouldBeAddedToDictionary should be removed from VCFHeaderLine and replaced with something like a static method here, since it's a BCF 2.2 specific implementation detail.

Copy link
Collaborator Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Yes!

Copy link
Collaborator Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

We should do that in the BCF branch.

return Optional.of(validationFailure);
} else {
// warn for older versions - this line can't be used as a v4.3 line
logger.warn(validationFailure.getFailureMessage());
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I'm wondering whether it makes more sense to move the logging (and checking validation strictness) out of these functions making them pure, and into the functions that actually consume them and have side effects, which can choose to throw or log based on the validation strictness.
These functions also aren't totally consistent, the one in VCFHeaderLine doesn't check validation strictness.

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I like the idea of passing back the error and letting the caller handle it, but it looks like the superclass handler is intended to ignore validation strictness so I'm not sure that it would be easy to redesign.

I would lean towards making the validation stringency a parameter instead of grabbing it through defaults because it makes that dependency much less hidden and also easier to test.

// to retain backward compatibility with previous implementations, we accept (and repair) and the line here.
updateGenericField(NUMBER_ATTRIBUTE, "0");
lineCount = 0;
logger.warn(String.format("FLAG fields must have a count value of 0, but saw count %d for header line %s. A value of 0 will be used",
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Small detail, but this logging should come before lineCount is updated, otherwise the message will print the contradictory warning "FLAG fields must have a count value of 0, but saw count 0".

Copy link
Collaborator Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Yeah good catch - I've been meaning to fix that since it looks weird every time I see it.

// We only allow going forward to a newer version, not backwards to an older one, since there
// is really no way to validate old header lines (pre vcfV4.2). The only way to create a header with
// an old version is to create it that way from the start.
// to be created with the old version from the start.
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Comment looks like an old line got left in accidentally

Copy link
Collaborator Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Done.

final Set<VCFHeaderVersion> headerVersions = new HashSet<>(2);
// a global mutable static - is there an alternative ?
// there isn't any other reasonable place to keep this state
private static boolean vcfStrictVersionValidation = true;
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I think it might be better to have this in Defaults making it immutable over the course of one process because the semantics of turning it on or off during one run might be unexpected. Starting with strict validation off then turning it on doesn't trigger validation of existing header lines for example, so we might still be outputting sloppily validated lines.

Copy link
Collaborator Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

This setting is already hooked up to Defaults, but I forgot to delete this variable - so this just needs to be removed.

Copy link
Collaborator Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Done (removed).

@@ -330,6 +326,7 @@ public void TestWritingLargeVCF(final String extension) throws FileNotFoundExcep
@DataProvider(name = "vcfExtensionsDataProvider")
public Object[][]vcfExtensionsDataProvider() {
return new Object[][] {
//TODO: fix this BCF problem!
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I'll make a note to fix these tests in my BCF branch, but from what I can tell, these failures are expected, 2 of these (TestWritingLargeVCF, testBasicWriteAndRead) make sense to fix and are from a missing standard header line (DP), because BCF requires all VC attributes to have a corresponding line in the header.

The other 2 (testWriteAndReadVCFHeaderless and testWriteWithEmptyHeader) don't make sense at all to test for BCF, because headerless or empty header BCFs aren't well formed.

{ "key=<", new VCFHeaderLine("key", "<") },
// taken from Funcotator test file as ##ID=<Description="ClinVar Variation ID">
// technically, this is invalid due to the lack of an "ID" attribute, but it should still parse
// into a VCFHeaderLine (but noa VCFSimpleHeaderLine
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Typo in "noa", missing closing )

Copy link
Collaborator Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Fixed.

return null;
} else if (lineList.size() > 1) {
throw new TribbleException(
String.format(
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

This format string is missing a %s placeholder for the list of duplicated "other" lines argument

@cmnbroad
Copy link
Collaborator Author

Note: this needs to be resolved/updated to account for the final disposition of structured vs. unstructured lines, attribute ordering, and quoting, as described in samtools/hts-specs#642 #1610, and also samtools/hts-specs#620.

@ohofmann
Copy link

ohofmann commented Mar 2, 2023

We've been discussing the lack of support for VCF 4.4 in IGV (igvteam/igv#1289) and started looking at the current htsjdk / VCF interaction. This PR looks like a major step forward to resolve issues around VCF 4.3 - thanks for all your work on this! Any chance of merging this soon so we can consider what changes would be required for 4.4?

@cmnbroad
Copy link
Collaborator Author

cmnbroad commented Mar 6, 2023

@ohofmann Hi Oliver - yes, this PR addresses quite a few issues with VCF support in htsjdk, and is a prerequisite for a couple of other large PRs for VCF/BCF support. The team has been sidetracked for quite a while with upgrading several of these repos to Java 17, which turned out to be a lot more work than anticipated, but we finally got the main PR for that merged last week. We discussed this PR in our team meeting this morning, and we do plan to get it merged sometime in the next quarter. It has had a fair amount of review/discussion already, but its quite large and will require a major version release, and has significant downstream test impact on projects such as GATK (I haven't tried to build IGV with it, but I expect the IGV fallout to be minimal). Hope that help.s

@ohofmann
Copy link

ohofmann commented Mar 6, 2023

@cmnbroad Thanks for the update! It does help, though we may have to consider postponing the VCF 4.4 announcement / PR drive. I wouldn't want the first experience of many users to be an htsjdk error. Will keep an eye on this PR in the meantime.

@cmnbroad
Copy link
Collaborator Author

@lbergelson I've rebased this on master/Java 17.

@cmnbroad cmnbroad mentioned this pull request Nov 7, 2023
5 tasks
Copy link
Member

@lbergelson lbergelson left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

@cmnbroad I have a lot of comments. Some are more important than others. Some of them are much older than others also... There's a bunch of places where we can make use of new 17 (and soon 21) styles stuff to clean up code and I've mentioned some of them. Feel free to ignore that though, we can always revisit.

I think it looks good overall. I found some things that I think need fixing but I think the overall approach is a huge improvement.

@@ -134,6 +139,7 @@ public class Defaults {
SAM_FLAG_FIELD_FORMAT = SamFlagField.valueOf(getStringProperty("sam_flag_field_format", SamFlagField.DECIMAL.name()));
SRA_LIBRARIES_DOWNLOAD = getBooleanProperty("sra_libraries_download", false);
DISABLE_SNAPPY_COMPRESSOR = getBooleanProperty(DISABLE_SNAPPY_PROPERTY_NAME, false);
STRICT_VCF_VERSION_VALIDATION = getBooleanProperty("strict_version_validation", true);
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I would make the property name also "strict_vcf_version_validation".

* order in the list
*/
//TODO: this method ignores (and actually mutates) the sequenceRecord's contig index to make it match
// the record's relative placement in the dictionary's internal list
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Wow, that's a nasty side effect that we've been living with forever.

}
writer.write(header.getHeaderFields().stream()
.map(f -> f.name())
.collect(Collectors.joining(VCFConstants.FIELD_SEPARATOR)).toString());
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I think this too string is unecessary?

// collect metadata lines until we hit the required header line, or a non-metadata line,
// in which case throw since there was no header line
while (lineIterator.hasNext()) {
final String line = lineIterator.peek();
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

BUG! This goes into an infinite loop if it hits a line that doesn't start with either of these.

return this.header;
}
}
throw new TribbleException.InvalidHeader(
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

It would be nice if there was information available at this point to say WHICH vcf file had the error but I don't see anyway to get that at the moment.


// List of expected tags (for this base class, its ID only; subclasses with more required tags
// should use a custom tag order if more required tags are expected
protected static final List<String> expectedTagOrder = Collections.unmodifiableList(
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

yay for List.of()

* A class representing a VCF validation failure.
* @param <T> a type representing the object that is being validated
*/
class VCFValidationFailure<T> {
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

This is returned by the public method validateForVersion so probably it should be public. Or maybe the validate method should be protected.

// have to be able to "repair" header lines (via a call to updateGenericField) during constructor validation.
//
// Otherwise the values here should never change during the lifetime of the header line.
private final Map<String, String> genericFields = new LinkedHashMap();
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

<>

// without string literal escaping and quoting the regex would be: replaceAll( ([^\])" , $1\" )
// ie replace: something that's not a backslash ([^\]) followed by a double quote
// with: the thing that wasn't a backslash ($1), followed by a backslash, followed by a double quote
return value.replaceAll("([^\\\\])\"", "$1\\\\\"");
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

It might be possible to write this more legibly with text blocks nowadays?


@Override
public Optional<VCFValidationFailure<VCFHeaderLine>> validateForVersion(final VCFHeaderVersion vcfTargetVersion) {
if (!vcfTargetVersion.isAtLeastAsRecentAs(VCFHeaderVersion.VCF4_0)) {
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

You use the pattern here of either warning or returning an optional here. It seems like it's done a number of times.
It seem a bit weird, an extra layer of optionalness. Should the decision to log a warning or not be left up to the caller? If some options are only errors during strict validation maybe that should be recorded in the ValidationFailure object and then the caller can decide what to do with all of them. If you don't like that idea, maybe pull out a method that either returns the option or warns and swallows it.

I guess the reason behind it is probably because it's necessary since the super call might return a higher priority error, but that could be reworked so it's called first or the two are compared, or even both are returned.

Not super important, but it seems awkward.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

Successfully merging this pull request may close these issues.

5 participants