Skip to content

Commit

Permalink
Spec change: never quote-encode cells whose genotypes aren't referenc…
Browse files Browse the repository at this point in the history
…e-identical or non-called

This ensures all non-reference genotypes on a row can be read without reference to any previous row, a significant convenience in exchange for a negligible size increase.
  • Loading branch information
mlin committed Oct 2, 2018
1 parent 4b9e325 commit 140334c
Show file tree
Hide file tree
Showing 3 changed files with 44 additions and 12 deletions.
8 changes: 5 additions & 3 deletions doc/SPEC.md
Original file line number Diff line number Diff line change
Expand Up @@ -10,15 +10,17 @@ Project VCF (pVCF; aka multi-sample VCF) is the prevailing file format for small

spVCF adopts from pVCF the tab-delimited text format with header, and the first nine columns providing all variant-level details. The sparse encoding concerns the genotype matrix `V[i,j]`, *i* indexing variant sites and *j* indexing the *N* samples, written across tab-delimited columns ten through 9+*N* of the pVCF text file. Each cell `V[i,j]` is a colon-delimited text string including the genotype and various QC measures (DP, AD, PL, etc.).

In the spVCF encoding, cells are first replaced with a double-quotation mark `"` if they're identical to the cell *above*:
In the spVCF encoding, cells are first replaced with a double-quotation mark `"` if they're (i) identical to the cell *above*, and (ii) their GT field is reference-identical or non-called:

```
S[i,j] := " if i>0 and V[i,j] == V[i-1,j],
S[i,j] := " if i>0 and V[i,j] == V[i-1,j] and V[i,j]["GT"] in ["0/0","0|0","./.",".|."],
V[i,j] otherwise.
```

Here 'identical' covers all QC measures exactly. Such exact repetition is common in pVCF produced using tools like [GATK GenotypeGVCFs](https://software.broadinstitute.org/gatk/documentation/tooldocs/3.8-0/org_broadinstitute_gatk_tools_walkers_variantutils_GenotypeGVCFs.php) and [GLnexus](https://github.com/dnanexus-rnd/GLnexus), which merge gVCF or similar files summarizing reference coverage in lengthy bands.

For clarity, the list of "quotable" GTs enumerated above shows diploid genotypes only. In general, quotable GTs are those whose constituent allele calls are either all reference (0), or all non-called (.).

Second, within each row of `S`, consecutive runs of quotation marks are abbreviated with a text integer, so for example a horizontal run of 42 quotes is written `"42` and tab-delimited from adjacent cells. The result is a ragged, tab-delimited matrix.

**Worked example**
Expand Down Expand Up @@ -59,7 +61,7 @@ With checkpoints, it's possible to reuse the familiar `bgzip` and `tabix` utilit

Lastly, spVCF suggests the following convention to remove typically-unneeded detail from the matrix, and increase the compressibility of what remains, prior to the sparse encoding. In any cell with QC measures indicating zero non-reference reads (typically `AD=d,0` for some *d*, but this depends on how the pVCF-generating pipeline expresses non-reference read depth), report only `GT` and `DP` and omit any other fields. Also, round `DP` down to a power of two (0, 1, 2, 4, 8, 16, ...).

This "squeezing" requires the encoder to reorder the colon-delimited fields in each cell so that `GT` and `DP` precede any other fields. Then it's valid for a subset of cells to omit remaining fields completely, as permitted by VCF. The FORMAT specification in column 9 of each line must reflect this reordering.
This "squeezing" requires the encoder to reorder the colon-delimited fields in each cell so that `GT` and `DP` precede any other fields. Then it's valid for a subset of cells to omit remaining fields completely, as permitted by VCF. The FORMAT specification in column 9 of each line must reflect this reordering. Notice that not all reference-identical genotype calls are necessarily squeezed, namely if the QC data indicate even one non-reference read.

The optional squeezing transformation can be applied to any pVCF, usually to great benefit, whether or not the spVCF sparse encoding is also used.

Expand Down
40 changes: 35 additions & 5 deletions src/spVCF.cc
Original file line number Diff line number Diff line change
Expand Up @@ -158,6 +158,7 @@ class EncoderImpl : public TranscoderBase {
const char* ProcessLine(char* input_line) override;

private:
bool unquotableGT(const char* entry);
void Squeeze(const vector<char*>& line);

uint64_t checkpoint_period_ = 0;
Expand All @@ -172,6 +173,7 @@ class EncoderImpl : public TranscoderBase {
vector<string> roundDP_table_;
};

#include <iostream>
const char* EncoderImpl::ProcessLine(char* input_line) {
++line_number_;
// Pass through header lines
Expand All @@ -187,8 +189,11 @@ const char* EncoderImpl::ProcessLine(char* input_line) {
if (tokens.size() < 10) {
fail("Invalid: fewer than 10 columns");
}
uint64_t N = tokens.size() - 9;
if (strncmp(tokens[8], "GT:", 3) && strcmp(tokens[8], "GT")) {
fail("cells don't start with genotype (GT)");
}

uint64_t N = tokens.size() - 9;
if (dense_entries_.empty()){ // First line: allocate the dense entries
dense_entries_.resize(N);
stats_.N = N;
Expand Down Expand Up @@ -248,7 +253,7 @@ const char* EncoderImpl::ProcessLine(char* input_line) {
if (*t == '"') {
fail("Input seems to be sparse-encoded already");
}
if (m.empty() || strcmp(m.c_str(), t) != 0) {
if (m.empty() || strcmp(m.c_str(), t) != 0 || unquotableGT(t)) {
// Entry doesn't match the last one recorded densely for this
// column. Output any accumulated run of quotes in the current row,
// then this new entry, and update the state appropriately.
Expand Down Expand Up @@ -323,6 +328,33 @@ const char* EncoderImpl::ProcessLine(char* input_line) {
return buffer_.Get();
}

// Determine if the entry's GT makes it "unquotable", meaning the called
// allele(s) don't consist of all 0 or all .
// A half-call like ./0 is considered unquotable.
// ASSUMES GT is the first FORMAT field (as required by VCF)
bool EncoderImpl::unquotableGT(const char* entry) {
bool zero = false, dot = false;
if (*entry == 0 || *entry == ':') {
fail("missing GT entry");
}
for (; *entry && *entry != ':'; entry++) {
switch (*entry) {
case '0':
zero = true;
break;
case '.':
dot = true;
break;
case '/':
case '|':
break;
default:
return true;
}
}
return (zero == dot);
}

// Truncate cells to GT:DP, and round DP down to a power of two, if
// - AD is present and indicates zero read depth for alternate alleles;
// - VR is present and zero
Expand All @@ -347,9 +379,7 @@ void EncoderImpl::Squeeze(const vector<char*>& line) {
size_t formatsz = split(line[8], ':', back_inserter(format));

// locate fields of interest
if (format[0] != "GT") {
fail("cells don't start with genotype (GT)");
}
assert(format[0] == "GT");
int iDP = -1;
auto pDP = find(format.begin(), format.end(), "DP");
if (pDP != format.end()) {
Expand Down
8 changes: 4 additions & 4 deletions test/spVCF.t
Original file line number Diff line number Diff line change
Expand Up @@ -15,11 +15,11 @@ plan tests 25
pigz -dc "$HERE/data/small.vcf.gz" > $D/small.vcf
"$EXE" encode -o $D/small.spvcf $D/small.vcf
is "$?" "0" "filename I/O"
is "$(cat $D/small.spvcf | wc -c)" "36976928" "filename I/O output size"
is "$(cat $D/small.spvcf | wc -c)" "37095284" "filename I/O output size"

pigz -dc "$HERE/data/small.vcf.gz" | "$EXE" encode -q > $D/small.spvcf
is "$?" "0" "piped I/O"
is "$(cat $D/small.spvcf | wc -c)" "36976928" "piped I/O output size"
is "$(cat $D/small.spvcf | wc -c)" "37095284" "piped I/O output size"

"$EXE" decode -o $D/small.roundtrip.vcf $D/small.spvcf
is "$?" "0" "decode"
Expand All @@ -35,7 +35,7 @@ is "$(egrep -o "spVCF_checkpointPOS=[0-9]+" $D/small.spvcf | uniq | cut -f2 -d =
"$EXE" encode -S -p 500 -o $D/small.squeezed.spvcf $D/small.vcf
is "$?" "0" "squeeze"
is "$(cat $D/small.squeezed.spvcf | wc -c)" "17447427" "squeezed output size"
is "$(cat $D/small.squeezed.spvcf | wc -c)" "17553478" "squeezed output size"
"$EXE" decode -q -o $D/small.squeezed.roundtrip.vcf $D/small.squeezed.spvcf
is "$?" "0" "squeezed roundtrip decode"
Expand Down Expand Up @@ -79,7 +79,7 @@ is "$(cat $D/small.squeezed.slice_chr21.spvcf | sha256sum)" \
pigz -dc "$HERE/data/small.vcf.gz" | "$EXE" encode -t $(nproc) - > $D/small.mt.spvcf
is "$?" "0" "multithreaded encode"
is "$(cat $D/small.mt.spvcf | wc -c)" "36973335" "multithreaded output size"
is "$(cat $D/small.mt.spvcf | wc -c)" "37091691" "multithreaded output size"
time "$EXE" decode -o $D/small.mt.roundtrip.vcf $D/small.mt.spvcf
is "$?" "0" "decode from multithreaded"
Expand Down

0 comments on commit 140334c

Please sign in to comment.