Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Add Name Tokenization Codec (Update CRAM Codecs to CRAM 3.1) #1663

Open
wants to merge 83 commits into
base: master
Choose a base branch
from
Open
Show file tree
Hide file tree
Changes from all commits
Commits
Show all changes
83 commits
Select commit Hold shift + click to select a range
75171ee
adding comments to Frequencies.java
Jan 14, 2022
7821469
separate encode and decode classes
Mar 1, 2022
defc174
Add Frequency methods to encode and decode classes
Mar 1, 2022
f3734ca
clean up rans tests and add separate packages for rans 4x8 and nx16
Mar 7, 2022
8582ab8
filter out extra column from q40+dir file
Mar 8, 2022
0769ecc
rans nx16 order 1 freq tables + refactor
Mar 18, 2022
faf7c10
clean up
Mar 18, 2022
720357b
Update RAN test method names.
cmnbroad Apr 20, 2022
03773c6
Remove unncessary params arg from uncompress methods (params are embe…
cmnbroad Apr 20, 2022
4a41948
Remove unnecessary RANSNx16Params state.
cmnbroad Apr 20, 2022
ba088c6
Fix bug in the case where the cat bit is set.
cmnbroad Apr 20, 2022
ed68e3b
Reduce unncessary buffer allocation.
cmnbroad Apr 20, 2022
0ce9080
Thread RANSNx16 params through RANSNx16 implementation.
cmnbroad Apr 20, 2022
671d21f
Dont initialize RANSNx16 decoding structures unless we're going to us…
cmnbroad Apr 20, 2022
9cd168a
Move/inline RANS Nx16 D0N uncompress method into RANSNx16Decode.
cmnbroad Apr 22, 2022
2688906
Move/inline RANS Nx16 D1N uncompress method into RANSNx16Decode.
cmnbroad Apr 25, 2022
e01b08e
Move/inline RANS Nx16 E0N compress method into RANSNx16Encode.
cmnbroad Apr 25, 2022
3c7ebb8
Move/inline RANS Nx16 E1N compress method into RANSNx16Encode.
cmnbroad Apr 25, 2022
56f2b86
Suppress spotbugs warnings.
cmnbroad Apr 25, 2022
55e290d
Don't initialize RANS4x8 decoding structure unless we're going to use…
Apr 27, 2022
b89a222
Move/inline RANS 4x8 E04 compress method into RANS4x8Encode.
Apr 28, 2022
e23a7e3
Move/inline RANS 4x8 E14 compress method into RANS4x8Encode.
Apr 28, 2022
0b3fd27
Move/inline RANS 4x8 D04 uncompress method into RANS4x8Decode.
Apr 28, 2022
e53f109
Move/inline RANS 4x8 D14 uncompress method into RANS4x8Decode.
Apr 28, 2022
d0279aa
Fix normalized Frequency (4096), add normalize Frequency using bit sh…
May 17, 2022
c2cac35
Add ransNx16 for format flags = 1,4,5 (N=32) and replace division wit…
Jun 3, 2022
5034915
When CAT is true, add limit and rewind the outBuffer before returning…
Jun 6, 2022
c966eec
Add RANSTest with formatflags = 32, 33, 36, 37
Jun 6, 2022
890940e
Remove initialization of alphabet array.
Jun 6, 2022
c3dd46d
Add RLE Encode and Decode. Works as expected for RANSNx16 Order 0
Jul 25, 2022
5020477
Move declaration of variables used within the for loop to inside the …
Jul 25, 2022
c22cd8b
Convert symbols from int to byte
Jul 25, 2022
b457c74
rename getInterleaveSize to getNumInterleavedRANSStates in RANSNx16Pa…
Jul 25, 2022
03e9297
RLE encode and decode works as expected for RANSNx16 Order 1
Jul 26, 2022
1fb6800
add encode and decode Pack. Add test cases for pack
Aug 11, 2022
d79a4cf
rename variable for better readability
Aug 16, 2022
68994c0
add exception when num of distinct symbols = 0 or > 16
Aug 19, 2022
8d93534
Add Decode Stripe to RANS Nx16. Add getFormatFlags() to RANSParams
Aug 23, 2022
d523eac
Add test for Encoding when Stripe Flag is set
Aug 26, 2022
14df7b2
Fix Spot Bugs warn - Use && for logical and
Aug 26, 2022
7f7e613
Addressing the feedback from Aug 30, 2022
Sep 6, 2022
8b38306
Use the Interop Test files from samtools-1.14/htslib-1.14/htscodecs/t…
Sep 8, 2022
cec6e3e
Replace hex literals with bit flag masks in RANSInteropTest Data Prov…
Sep 8, 2022
577d5d1
Addressing the feedback so far
Sep 9, 2022
2db0878
debug CI test failure
Sep 9, 2022
c8fb550
Fix the htscodecs path
Sep 13, 2022
d51f9ce
rename methods that return boolean to start with 'is' instead of 'get'
Oct 1, 2022
789f50c
debug
Oct 17, 2022
6b0c4f1
Addressing the feedback from 10/25/22
Oct 31, 2022
e72147a
undo inadvertent deletion of RANSInterop roundtrip test logic
Nov 22, 2022
850280d
debug - add decodePack and decodeRLE on top of CAT flag
Mar 21, 2023
3f84b2a
rewind outBuffer before it is returned
Mar 21, 2023
43145d4
remove duplicate outBuffer creation
Mar 21, 2023
f7e6c57
Addressing the feedback from oct 11, 2023 except implementing the Str…
Oct 19, 2023
f9041e8
Move common methods to CRAMInteropTestUtils class
Oct 26, 2023
7126507
Addressing the feedback from Nov 7 and Nov 20 - part 1
Dec 1, 2023
d2802b1
Addressing the feedback from Nov 7 and Nov 20 - part 2
Dec 5, 2023
e6b06a5
Addressing the feedback from Nov 7 and Nov 20 - part 3
Dec 6, 2023
1a89cb4
Addressing the feedback from Nov 7 and Nov 20 - part 4
Dec 13, 2023
f4fd67c
Addressing the feedback from Nov 7 and Nov 20 - part 5
Dec 18, 2023
b095b1c
Addressing the feedback from Nov 7 and Nov 20 - part 6
Dec 20, 2023
b2187c3
Addressing the feedback from Nov 7 and Nov 20 - part 6
Dec 21, 2023
52549f5
Move common code to CompressionUtils
Jan 10, 2024
2db77e9
add Range Encode
Oct 17, 2022
43a68c9
Fix RangeEncode for order 0 and formatflags=0x00
Oct 28, 2022
49eac0e
rebase - Add Range Codec, RangeTest, RangeInteropTest for order, rle,…
Dec 12, 2022
7b454d6
Add uncompressEXT and decodePack to RangeDecode
Dec 14, 2022
81bcac7
add Pack flag to tests
Dec 16, 2022
5de4036
Add Range encode and decode for EXT flag
Jan 12, 2023
fc6227d
debug spotbugs error
Jan 12, 2023
249db30
debug - add decodePack on top of CAT flag
Mar 21, 2023
e2d5a37
Addressing format related feedback from RANS PR that applies to Range…
Oct 25, 2023
58ace69
Rebase on RANS branch and use common methods from CRAMInteropTestUtil…
Oct 26, 2023
f9b066c
Addressing feedback from nov 21 - part 1
Jan 4, 2024
4a416f7
Addressing feedback from nov 21 - part 2
Jan 24, 2024
55f6086
Addressing feedback from nov 21, nov 28 - part 3
Jan 29, 2024
6c0387c
Add NameTokenization Decoder
Mar 2, 2023
902ca33
Add NameTokenization Encoder
Jun 12, 2023
d3ae09d
add descriptive variable names
Jun 12, 2023
1c0cb8d
Add unittests
Jul 26, 2023
eff17cf
Use List<Token> instead of TokenStreams in NameTokenisationEncoder
Jul 26, 2023
67ad384
Addressing feedback from dec 5 - part 1
Jan 30, 2024
3233e39
Addressing feedback from dec 5 - part 2
Jan 31, 2024
File filter

Filter by extension

Filter by extension

Conversations
Failed to load comments.
Loading
Jump to
Jump to file
Failed to load files.
Loading
Diff view
Diff view
3 changes: 2 additions & 1 deletion scripts/install-samtools.sh
Original file line number Diff line number Diff line change
@@ -1,5 +1,6 @@
#!/bin/sh
set -ex
wget https://github.com/samtools/samtools/releases/download/1.14/samtools-1.14.tar.bz2
# Note that the CRAM Interop Tests are dependent on the test files in samtools-1.14/htslib-1.14/htscodecs/tests/dat
tar -xjvf samtools-1.14.tar.bz2
cd samtools-1.14 && ./configure --prefix=/usr && make && sudo make install
cd samtools-1.14 && ./configure --prefix=/usr && make && sudo make install
177 changes: 177 additions & 0 deletions src/main/java/htsjdk/samtools/cram/compression/CompressionUtils.java
Original file line number Diff line number Diff line change
@@ -0,0 +1,177 @@
package htsjdk.samtools.cram.compression;

import htsjdk.samtools.cram.CRAMException;
import htsjdk.samtools.cram.compression.rans.Constants;

import java.nio.ByteBuffer;
import java.nio.ByteOrder;

public class CompressionUtils {
public static void writeUint7(final int i, final ByteBuffer cp) {
int s = 0;
int X = i;
do {
s += 7;
X >>= 7;
} while (X > 0);
do {
s -= 7;
//writeByte
final int s_ = (s > 0) ? 1 : 0;
cp.put((byte) (((i >> s) & 0x7f) + (s_ << 7)));
} while (s > 0);
}

public static int readUint7(final ByteBuffer cp) {
int i = 0;
int c;
do {
//read byte
c = cp.get();
i = (i << 7) | (c & 0x7f);
} while ((c & 0x80) != 0);
return i;
}

public static ByteBuffer encodePack(
final ByteBuffer inBuffer,
final ByteBuffer outBuffer,
final int[] frequencyTable,
final int[] packMappingTable,
final int numSymbols){
final int inSize = inBuffer.remaining();
final ByteBuffer encodedBuffer;
if (numSymbols <= 1) {
encodedBuffer = CompressionUtils.allocateByteBuffer(0);
} else if (numSymbols <= 2) {

// 1 bit per value
final int encodedBufferSize = (int) Math.ceil((double) inSize/8);
encodedBuffer = CompressionUtils.allocateByteBuffer(encodedBufferSize);
int j = -1;
for (int i = 0; i < inSize; i ++) {
if (i % 8 == 0) {
encodedBuffer.put(++j, (byte) 0);
}
encodedBuffer.put(j, (byte) (encodedBuffer.get(j) + (packMappingTable[inBuffer.get(i) & 0xFF] << (i % 8))));
}
} else if (numSymbols <= 4) {

// 2 bits per value
final int encodedBufferSize = (int) Math.ceil((double) inSize/4);
encodedBuffer = CompressionUtils.allocateByteBuffer(encodedBufferSize);
int j = -1;
for (int i = 0; i < inSize; i ++) {
if (i % 4 == 0) {
encodedBuffer.put(++j, (byte) 0);
}
encodedBuffer.put(j, (byte) (encodedBuffer.get(j) + (packMappingTable[inBuffer.get(i) & 0xFF] << ((i % 4) * 2))));
}
} else {

// 4 bits per value
final int encodedBufferSize = (int) Math.ceil((double)inSize/2);
encodedBuffer = CompressionUtils.allocateByteBuffer(encodedBufferSize);
int j = -1;
for (int i = 0; i < inSize; i ++) {
if (i % 2 == 0) {
encodedBuffer.put(++j, (byte) 0);
}
encodedBuffer.put(j, (byte) (encodedBuffer.get(j) + (packMappingTable[inBuffer.get(i) & 0xFF] << ((i % 2) * 4))));
}
}

// write numSymbols
outBuffer.put((byte) numSymbols);

// write mapping table "packMappingTable" that converts mapped value to original symbol
for(int i = 0; i < Constants.NUMBER_OF_SYMBOLS; i ++) {
if (frequencyTable[i] > 0) {
outBuffer.put((byte) i);
}
}

// write the length of data
CompressionUtils.writeUint7(encodedBuffer.limit(), outBuffer);
return encodedBuffer; // Here position = 0 since we have always accessed the data buffer using index
}

public static ByteBuffer decodePack(
final ByteBuffer inBuffer,
final byte[] packMappingTable,
final int numSymbols,
final int uncompressedPackOutputLength) {
final ByteBuffer outBufferPack = CompressionUtils.allocateByteBuffer(uncompressedPackOutputLength);
int j = 0;
if (numSymbols <= 1) {
for (int i=0; i < uncompressedPackOutputLength; i++){
outBufferPack.put(i, packMappingTable[0]);
}
}

// 1 bit per value
else if (numSymbols <= 2) {
int v = 0;
for (int i=0; i < uncompressedPackOutputLength; i++){
if (i % 8 == 0){
v = inBuffer.get(j++);
}
outBufferPack.put(i, packMappingTable[v & 1]);
v >>=1;
}
}

// 2 bits per value
else if (numSymbols <= 4){
int v = 0;
for(int i=0; i < uncompressedPackOutputLength; i++){
if (i % 4 == 0){
v = inBuffer.get(j++);
}
outBufferPack.put(i, packMappingTable[v & 3]);
v >>=2;
}
}

// 4 bits per value
else if (numSymbols <= 16){
int v = 0;
for(int i=0; i < uncompressedPackOutputLength; i++){
if (i % 2 == 0){
v = inBuffer.get(j++);
}
outBufferPack.put(i, packMappingTable[v & 15]);
v >>=4;
}
}
return outBufferPack;
}

public static ByteBuffer allocateOutputBuffer(final int inSize) {
// This calculation is identical to the one in samtools rANS_static.c
// Presumably the frequency table (always big enough for order 1) = 257*257,
// then * 3 for each entry (byte->symbol, 2 bytes -> scaled frequency),
// + 9 for the header (order byte, and 2 int lengths for compressed/uncompressed lengths).
final int compressedSize = (int) (inSize + 257 * 257 * 3 + 9);
final ByteBuffer outputBuffer = allocateByteBuffer(compressedSize);
if (outputBuffer.remaining() < compressedSize) {
throw new CRAMException("Failed to allocate sufficient buffer size for RANS coder.");
}
return outputBuffer;
}

// returns a new LITTLE_ENDIAN ByteBuffer of size = bufferSize
public static ByteBuffer allocateByteBuffer(final int bufferSize){
return ByteBuffer.allocate(bufferSize).order(ByteOrder.LITTLE_ENDIAN);
}

// returns a LITTLE_ENDIAN ByteBuffer that is created by wrapping a byte[]
public static ByteBuffer wrap(final byte[] inputBytes){
return ByteBuffer.wrap(inputBytes).order(ByteOrder.LITTLE_ENDIAN);
}

// returns a LITTLE_ENDIAN ByteBuffer that is created by inputBuffer.slice()
public static ByteBuffer slice(final ByteBuffer inputBuffer){
return inputBuffer.slice().order(ByteOrder.LITTLE_ENDIAN);
}
}
Original file line number Diff line number Diff line change
@@ -1,6 +1,9 @@
package htsjdk.samtools.cram.compression;

import htsjdk.samtools.cram.compression.rans.RANS;
import htsjdk.samtools.cram.compression.range.RangeDecode;
import htsjdk.samtools.cram.compression.range.RangeEncode;
import htsjdk.samtools.cram.compression.rans.rans4x8.RANS4x8Decode;
import htsjdk.samtools.cram.compression.rans.rans4x8.RANS4x8Encode;
import htsjdk.samtools.cram.structure.block.BlockCompressionMethod;
import htsjdk.utils.ValidationUtils;

Expand Down Expand Up @@ -71,8 +74,13 @@ public static ExternalCompressor getCompressorForMethod(

case RANS:
return compressorSpecificArg == NO_COMPRESSION_ARG ?
new RANSExternalCompressor(new RANS()) :
new RANSExternalCompressor(compressorSpecificArg, new RANS());
new RANSExternalCompressor(new RANS4x8Encode(), new RANS4x8Decode()) :
new RANSExternalCompressor(compressorSpecificArg, new RANS4x8Encode(), new RANS4x8Decode());

case RANGE:
return compressorSpecificArg == NO_COMPRESSION_ARG ?
new RangeExternalCompressor(new RangeEncode(), new RangeDecode()) :
new RangeExternalCompressor(compressorSpecificArg, new RangeEncode(), new RangeDecode());

case BZIP2:
ValidationUtils.validateArg(
Expand All @@ -85,5 +93,4 @@ public static ExternalCompressor getCompressorForMethod(
}
}

}

}
Original file line number Diff line number Diff line change
Expand Up @@ -24,48 +24,60 @@
*/
package htsjdk.samtools.cram.compression;

import htsjdk.samtools.cram.compression.rans.RANS;
import htsjdk.samtools.cram.compression.rans.RANSParams;
import htsjdk.samtools.cram.compression.rans.rans4x8.RANS4x8Decode;
import htsjdk.samtools.cram.compression.rans.rans4x8.RANS4x8Encode;
import htsjdk.samtools.cram.compression.rans.rans4x8.RANS4x8Params;
import htsjdk.samtools.cram.structure.block.BlockCompressionMethod;

import java.nio.ByteBuffer;
import java.util.Objects;

public final class RANSExternalCompressor extends ExternalCompressor {
private final RANS.ORDER order;
private final RANS rans;
private final RANSParams.ORDER order;
private final RANS4x8Encode ransEncode;
private final RANS4x8Decode ransDecode;

/**
* We use a shared RANS instance for all compressors.
* @param rans
*/
public RANSExternalCompressor(final RANS rans) {
this(RANS.ORDER.ZERO, rans);
public RANSExternalCompressor(
final RANS4x8Encode ransEncode,
final RANS4x8Decode ransDecode) {
this(RANSParams.ORDER.ZERO, ransEncode, ransDecode);
}

public RANSExternalCompressor(final int order, final RANS rans) {
this(RANS.ORDER.fromInt(order), rans);
public RANSExternalCompressor(
final int order,
final RANS4x8Encode ransEncode,
final RANS4x8Decode ransDecode) {
this(RANSParams.ORDER.fromInt(order), ransEncode, ransDecode);
}

public RANSExternalCompressor(final RANS.ORDER order, final RANS rans) {
public RANSExternalCompressor(
final RANSParams.ORDER order,
final RANS4x8Encode ransEncode,
final RANS4x8Decode ransDecode) {
super(BlockCompressionMethod.RANS);
this.rans = rans;
this.ransEncode = ransEncode;
this.ransDecode = ransDecode;
this.order = order;
}

@Override
public byte[] compress(final byte[] data) {
final ByteBuffer buffer = rans.compress(ByteBuffer.wrap(data), order);
final RANS4x8Params params = new RANS4x8Params(order);
final ByteBuffer buffer = ransEncode.compress(CompressionUtils.wrap(data), params);
return toByteArray(buffer);
}

@Override
public byte[] uncompress(byte[] data) {
final ByteBuffer buf = rans.uncompress(ByteBuffer.wrap(data));
final ByteBuffer buf = ransDecode.uncompress(CompressionUtils.wrap(data));
return toByteArray(buf);
}

public RANS.ORDER getOrder() { return order; }

@Override
public String toString() {
return String.format("%s(%s)", this.getMethod(), order);
Expand Down Expand Up @@ -96,4 +108,4 @@ private byte[] toByteArray(final ByteBuffer buffer) {
return bytes;
}

}
}
Original file line number Diff line number Diff line change
@@ -0,0 +1,61 @@
package htsjdk.samtools.cram.compression;

import htsjdk.samtools.cram.compression.range.RangeDecode;
import htsjdk.samtools.cram.compression.range.RangeEncode;
import htsjdk.samtools.cram.compression.range.RangeParams;
import htsjdk.samtools.cram.structure.block.BlockCompressionMethod;

import java.nio.ByteBuffer;

public class RangeExternalCompressor extends ExternalCompressor{

private final int formatFlags;
private final RangeEncode rangeEncode;
private final RangeDecode rangeDecode;

public RangeExternalCompressor(
final RangeEncode rangeEncode,
final RangeDecode rangeDecode) {
this(0, rangeEncode, rangeDecode);
}

public RangeExternalCompressor(
final int formatFlags,
final RangeEncode rangeEncode,
final RangeDecode rangeDecode) {
super(BlockCompressionMethod.RANGE);
this.rangeEncode = rangeEncode;
this.rangeDecode = rangeDecode;
this.formatFlags = formatFlags;
}

@Override
public byte[] compress(byte[] data) {
final RangeParams params = new RangeParams(formatFlags);
final ByteBuffer buffer = rangeEncode.compress(CompressionUtils.wrap(data), params);
return toByteArray(buffer);
}

@Override
public byte[] uncompress(byte[] data) {
final ByteBuffer buf = rangeDecode.uncompress(CompressionUtils.wrap(data));
return toByteArray(buf);
}

@Override
public String toString() {
return String.format("%s(%s)", this.getMethod(),formatFlags);
}

private byte[] toByteArray(final ByteBuffer buffer) {
if (buffer.hasArray() && buffer.arrayOffset() == 0 && buffer.array().length == buffer.limit()) {
return buffer.array();
}

final byte[] bytes = new byte[buffer.remaining()];
buffer.get(bytes);
return bytes;
}


}
Loading
Loading