-
Notifications
You must be signed in to change notification settings - Fork 65
Home
Hi, if you found this page you are probably looking for some additional information on WISECONDOR.
The algorithm used is described in the paper which can be found here:
WISECONDOR: detection of fetal aberrations from shallow sequencing maternal plasma based on a within-sample comparison scheme
For now, This page will just show some questions I've received on the script, which I'll answer to the best I can.
Plotting Z-Scores
Traceback (most recent call last):
File "test.py", line 562, in
plotResults(sample,markedBins,kept,kept2,outputBase,zScoresDict,zSmoothDict,blindsDict)
File "test.py", line 344, in plotResults
ax = plt.figure(2)
File "/usr/lib/pymodules/python2.6/matplotlib/pyplot.py", line 254, in figure
**kwargs)
File "/usr/lib/pymodules/python2.6/matplotlib/backends/backend_tkagg.py", line 90, in new_figure_manager
window = Tk.Tk()
File "/usr/lib/python2.6/lib-tk/Tkinter.py", line 1646, in init
self.tk = _tkinter.create(screenName, baseName, className, interactive, wantobjects, useTk, sync, use)
_tkinter.TclError: no display name and no $DISPLAY environment variable
This appears to show up on computers that do not have the Tcl/Tk toolkit installed, which is often the case on compute nodes (headless servers). An easy fix for this is to tell matplotlib to use another backend for its graphical work:
- Open test.py
- Find the list of imports, somewhere about line 25 to 35
- Add these lines to the list:
import matplotlib
matplotlib.use('Agg')
As the plots seem to differ slightly using this back-end I preferred not to make it the default until everything is tested to work correctly using this. This fix was suggested by W.Y.Leung, thank you!
When specifiying the output for test.py, it assumes you provide it a path including a filename without an extension. The plots will take this path/file combination, add a dot, and add the plot type and pdf extension to it. As a result, providing a path such as ./output/
will result in output files such as ./output/.zscores.pdf
, which, in unix systems, end up being hidden because they start with a dot. Simply add a basic file name to the output path such as ./output/filename
and your files will show up. If you really need to see what hidden files you created, try pressing ctrl-h in a file browser to show hidden files.
The script was written rather rapidly and this never really seemed a problem to me. If you want to save this text output, the usual Unix approach (directing the output to a file) should suffice:
python test.py ./input.gcc ./output ./reference > ./output.txt
There is read depth normalization, but it's done implicitly by the LOWESS GC-Correction. WISECONDOR previously had a seperate step to normalize the data (which was only useful after applying the RETRO-filter due to the read-towers or spikes in the data), but I decided to remove it as I applied LOWESS using a division:
correctedValue = sample[chrom][bin]/lowessCurve.pop(0)
.
As the actual read count of any bin gets scaled to about 1 using this step, the separate normalization step became obsolete, results with and without normalizing the data prior to GC-Correction showed no noticable differences.
The data we obtained from our lab contains enough reads to allow us to use this setting. We prefer using only the highest reliable data we can get over more, less reliably mapped data. Of course, you are free to test WISECONDOR with mismatches allowed, just make sure your reference set is build using the same settings as your test samples. WISECONDOR does not care for these mismatches, it only counts reads based on their position on the genome.
Not much, WISECONDOR will probably not report any aberrated bins (not ones that are fetal anyway) but it won't check for fetal percentages, it simply assumes there is enough.
I built a reference using relatively messy samples, what will a test on a good sample show me using this reference?
Hard to predict, depends on what you call messy as well. If you use mostly the same protocol to prepare the samples in the laboratory , WISECONDOR may happily get along with your messy reference set: Any structurally comparable behaviour among bins can still be identified when building the reference set and if the tested sample has completely different read depths over all bins, but the bins identified in the reference still do behave structurally alike, the calls are not necessarily bad.
The problem occurs when the read depths over bins start to behave differently, which may happen when the workflow in a laboratory changes although even that may be less influential than expected. Still, the general Bio-Informatics rule applies here:
- Rubbish in is rubbish out.
Indeed, there is a bit of a list:
- Single bin, bin test
- Single bin, aneuploidy test
- Windowed, bin test
- Windowed, aneuploidy test
- Chromosome wide, aneuploidy test
And well, it depends on how securely you are looking at the data. In general, the list right after Windowed, aneuploidy test
should give the most reliable results for aneuploidy calling, while the Windowed, bin test
attempts to provide a list of the areas on the chromosomes that are actually aberrated. The Single bin
methods are not very reliable as they are not sensitive enough for most samples, instead of combining the power of several bins, just a single bin is taken into account. Calls made are usually strongly deviating values, while most fetal aberrations are not deviating that much.
The last one, Chromosome wide, aneuploidy test
appears to be way to sensitive for calling aneuploidy cases but has shown its use in a set of samples that had far too little fetal DNA: It somehow was able to point us to the right samples as their aberrated chromosomes showed up as more deviating values than usual using this approach.
This line:
Average allowed deviation: XX.XX% WARNING: High value (>5%) calls are unreliable
often pops up in even the better samples. It is basically a single value that tries to give an impression of how reliable the sample is by calculating how much the read frequency needs to deviate on average before a call is made. If we were to calculate the StandardDeviation for all tested bins (compared to the reference bins for each of them), triple it, determine the percentual difference in read depth this StandardDeviation equals to and average this value over all bins within the sample, we get this number. It was meant to tell whether the applied tests made sense compared to how much the sample deviates from the reference set in general.
Right now, the warning is a bit too sensitive, it provides a warning even if the results are perfectly fine. Do take care though, values of 14% and higher usually mean the sample does give WISECONDOR some trouble. A better value has yet to be determined.
The number increases if the tested samples read depth behaviour structurally differs from the reference set. A 'messy' reference set and a 'stable', do not necessarily provide high numbers here, it just tells you how much the sample structurally differs from the tested set.
A high number may be caused by a very stable reference set and a messy sample though, as a stable set may provide wrong information to WISECONDOR: bins that appear to behave the same only behave the same within this stable set.
True, during development I just wanted a quick overview of my results without wasting any space so all chromosomes have their lengths scaled. An alternative was proposed by S.Ghesquiere, which shows every chromosome by it's real length and additional, detailed plots for chromosomes 13 18 and 21 next to their cytobands. We are looking into this approach and may incorporate this in the future. In the meanwhile, you are free to fork this project and make a pull-request, all your input is welcome.
The deviation for any bin compared to its reference. The blue one shows the per bin tested deviations. To get a rough idea of the meaning of this, consider it this way: a high spike shows that that bin is strongly increased when compared to what it should be and what deviation is expected. A bin for which its reference set of bins has a lot of variation will therefore show a smaller spike than a bin that increased just as much but has a more stable set of reference bins. It is not a direct comparison, and it does not show the actual increase for any area, it shows how WISECONDOR looks at that area. Plots that show actual read frequencies for bins do not provide a lot of information as the small change in read depth caused by a fetal aberration often gets completely overruled by natural fluctuations in the read depth data.
The red line shows the windowed approach, which basically just combines a set of results shown in the blue line and determines how much this set deviates. A group of deviating bins shown in blue will therefore result in a strongly deviating red line, hence the visual correlation between the two.
That is considered an artifact. If it shows up in just one sample for that area, the pregnant woman is likely the cause of this:
If a maternal CNV is large enough to cover more than one bin and makes up for a relatively large part of the bins total covered area, it will appear as an aberrated area in WISECONDOR. The windowed method removes the highest bin from it's window but two subsequent spiking bins will leave the window with a strongly deviating value, which will then influence the total Z-Score for all bins close to it.
If the spike does show up in several samples, the reference set used may seem more stable in this area than the tested samples. This structural artifact can be removed by adding more samples to the reference set, allowing WISECONDOR to learn about the spikyness in this area.
If you run into issues, please create a ticket so I can take care of it.
If you have other troubles running WISECONDOR or any related questions, feel free to contact me through the e-mail adress on my GitHub page.