-
Notifications
You must be signed in to change notification settings - Fork 5
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Duration and Heterozygosity #22
Comments
Before starting over from scratch, you should check if the long read mapping, short read mapping and variant calling steps were over. If so, then you can save a lot of time by running If you have any output in the Phased folder, that's a good sign that all of the previous steps are over. If you're unsure how to determine if the pre-processing steps ended I can look into it later and give you some pretty clear ways of identifying if something was interrupted partway through |
It looks like there is data in both the long read and short read mapping folders but there are two empty folders (FastQ and Plots) in the Phased folder, which I'm guessing means that it cut off right in the middle of the phasing step? what would the command look like to just do the partial run? If you can look into that later and let me know some ways to see its progress that would be really helpful! |
If the long reads are mapped and the short reads are mapped and variant called, then you can run
Where MAPPED_LONG_READ_FILE looks like If the variant calling wasn't over but the short reads were mapped, then it's
where MAPPED_SHORT_READ_FILE looks like |
Hello! I'm so sorry for the long wait - I started working through other projects but I have finally circled back to this issue and I have some updates. I went to run nphase partial using the mapped long read SAM and VCF files from the resulting run when I used
And then this is what was in the log file:
Oof! I know this turned in to a long message but is there anything I can do to fix this? It's doing everything BUT phasing my sample and I don't know what I'm doing wrong. Thank you in advance!! Other info: |
Based on your output it looks like, for some reason, the GATK variant selection is happening twice simultaneously. Is there any possible way you somehow ran the same command line twice? I'll be checking the code to see if there's a bug no one's ran into before. What I think is most efficient and definitely what I would do, is download the example data in https://github.com/OmarOakheart/nPhase/tree/master/example, then run it and check that it works properly (it should be over in a matter of minutes). If it does work fine, then my recommendation is to just start over with your raw. There's still a risk that by accidentally restarting in the middle of a run you had an incomplete BAM file or something of that nature and now we're seeing weird things happen because the input is weird, better to rule that out quickly in my opinion. If you're strapped for computing time let me know and I can try to come up with another recommendation |
I ran through the pipeline with the example data and it went all the way through and finished fine so I am going to go back and rerun it. I'll let you know what happens! |
Hello again! I reran the pipeline and it ended at virtually the same spot: This was printed in the STDOUT:
And this is the log file:
I'm sorry this is such a long post but I am so confused why it ends the way it does - and consistently every time I run it. I'm going to try and run it with another sample today just to make sure it's not this sample that's giving it grief. Thank you for all your help so far!!! |
So, looking at the output log, you can see that a lot of lines are duplicated, which shouldn't happen and I also presume didn't happen with the example dataset?
Could you show me what command line you type to launch nPhase? I'm really not seeing a situation that would cause this behavior. Could you also check if when you launch the example dataset it does the same thing of seemingly running the entire pipeline several times in parallel? |
I noticed that too! I thought it was weird! I'm glad I wasn't seeing things... I used conda to activate polyploidPhasing and then I use this command (with nohup and & so it can continue running on the server without me needing to be logged in:
When I ran the example dataset it didn't seem to run the pipeline several times either. |
I usually use |
It writes the STDOUT (and whatever prints to the command line - so I'm assuming the STDERR as well) to a txt file called: nohup.out It should work the same way as screen? But I could try using that instead and see if that makes a difference. |
I think there's the slightest little chance that STDERR isn't redirected to nohup.out and that it's hiding an error from us that would help us understand what's going on here. I do see the documentation for nohup seems to say that STDERR should be in nohup.out but maybe doing Screen is a bit weird to use, if you're not familiar with it. Sorry this is such trouble, I really don't see what could cause it to run several times in parallel. I suppose I've also never tested running nphase in the background with & but that shouldn't break anything. I'm thinking there's an error being output that we're not seeing. The example dataset is tiny, so maybe we wouldn't see anything, but did you test it with nohup or without? |
Hi again! I ran it again with the example data using nohup and & and the log files didn't show any signs of duplications in the process. So I'm back to square one and I can't figure out what to do! Hopefully we can figure this out together - it would be great to use and in case someone else has a problem like this they can use this to help! |
Okay, let's troubleshoot. First making sure we're on the same page. Initially, the issue was that your machine restarted by accident in the middle of a run and my advice was to run nphase partial. Running nphase partial, for some reason, failed silently. The example data works without issue, but when you tried to run the entire pipeline (in a new folder, starting over completely), it fails again silently at the same spot. We also are seeing a weird glitch where the heterozygous SNPs are being extracted twice, as shown in the logs, but it doesn't happen with the example data. Maybe let's look at the data then and see what's going on. Hopefully you still have the data from the last failed attempt at running nphase pipeline. Do you have a .hetPositions.SNPxLongReads.validated.tsv file in your VariantCalls/longReads folder? Does it show SNPs in chromosomes I to XVI? |
Okay the weirdest thing happened...it worked??? I reran the SAME code last night with my sample and it went all the way through and gave me phased fastq files? I have no idea what was different/what changed but it went all the way through. Maybe it had something to do with the server I was on? I'm going to use another sample now and see if it runs all the way or stops at the same place but everything you said was right: my machine restarted by accident in the middle of the run, trying to run nphase partial to pick up where it left off failed silently, the example data worked fine with no duplicated processes, but when I go to run it with my data the same errors (pretty much at the same place) come up. Depending on what happens with the next sample maybe it was just something to do with just using Now that I have data I have some questions on how to interpret it. In the resulting /Phased/FastQ folder there are 834 cluster fastq files and 430 merged clusters fastq files. Is that normal for some (yeast) data? I know these are haploTIGs and not full haploTYPEs so how do I combine them into 4 haplotypes if I think the sample is supposed to be tetraploid? Is this where I would go to the Phasing-Toolkit to combine the necessary clusters together to generate the 4 files I am looking for? My next steps with this data would be to take the 4 haplotypes and do a de novo assembly, polish with short reads, and scaffold them to the reference to then do other downstream analysis with the processed data. So my next step would be to take these 834 clusters.fq and combine them into 4 fastq files - I'm just not sure how to do that now. |
Haha, I mean, from my perspective it's weird that it wasn't working before! Hopefully it all runs fine again with your other sample. Are you saying you have a total of 1200 fastQ? The naming convention of the fastQs being output doesn't have any particular meaning. This sounds to me like nPhase didn't do a good job at phasing and is giving you way too many haplotigs. Would you be able to share the _covVis.png output? It should be easy for me to see quickly if your data was phased well and there's just a lot of noise or if if the data didn't phase well at all. Then I can try to give you advice depending on the case |
I hope so too! I have a feeling it might've been a fluke though just based on the number of fastq files it resulted in... This is what the _covVis.png looks like: I'm also adding in the _phasedVis.png just to show that as well: I also ran this sample using only chromosome IX as the reference sequence (it's just the area I am focused on right now so I wanted to see if only using that specific chromosome would make it go faster and have more accurate results) and that resulted in this _covVis.png and _phasedVis.png images: Running with only chr9 also only resulted in 4 clustered fastq files and 2 merged cluster fastq files - does that mean for this area it picked up on 4 haplotypes (each cluster fastq = 1 haplotype) showing that it's tetraploid? All in all, what does the data look like to you? |
What's your total coverage? Based on the total covVis it looks like it's around 10X? But then your chr9 looks like you have about 100X The full run looks extremely fragmented, which makes sense if you only have 10-20X total, but if you have around 100X then maybe it's a read length distribution issue. I find that contiguity depends mostly on three factors: coverage (I recommend 10X per haplotype minimum, so 40X for a 4n), read length distribution (generally a mean read length > 15kb will lead to good results), and heterozygosity level. The less heterozygosity you have, the more the other two parameters need to make up for it |
I believe the coverage should be about 170X - but I'm not exactly sure how to check. I ran both using the same raw long reads (ONT). Is there a way to specify the coverage in the full run? If I were to try this again with this sample what would the command look like to take all of that into account? |
I found this in one of your outputs:
So that says 120 626 reads at an average of 7898 bp Let's say 120 000 reads * 7900 bp / 12 500 000 bp for a coverage of about 75X. So the good news is that the coverage is high, but the read length is low for a lot of these reads. I would downsample to 40 or 50 X, trying to maximize read length. You can use a tool called filtlong for that https://github.com/rrwick/Filtlong Something like Keep_percent can be a bit higher, too, or lower. Downsample, then check the mean read length (you could use nanoplot to generate those stats). If it's at least 12 kb, ideally more than 15kb, run nphase again, otherwise downsample further. I wouldn't go below 30X coverage. Running it again with a better read length profile should improve the results a lot. Right now the results you show are not interpretable, too fragmented, I would assume because of the read length. Although I still find it confusing that the output is showing a coverage of only about 3X per haplotig |
Okay I ran that command with the raw reads in Filtlong and was able to come up with 17,754kb as the mean read length, which comes up to ~48X coverage! I just wrote the command and starting running it on my server so hopefully by tomorrow I'll have results and we can reconvene. Thank you so much for all your help so far I really really appreciate it. Here's to this run working!!!!!!!! |
I'm very confused by these results. If I saw that as an output I would start double checking everything. So I would check my command line, in case I accidentally input the wrong long read file. Then I would use nanoplot to check the statistics of the long read file used as input, to check if the mean read quality, or length, or coverage are different than I expected. Then I would check the coverage of the .sam file in Mapped/LongReads to make sure that the coverage is what I expect along the entire genome. The output is showing us that the heterozygous SNPs are evenly distributed along the genome (with some small exceptions, for example the beginning of chromosomeVI appears to be homozygous (though maybe that's not true and just an artefact of whatever is going wrong here)). So I wouldn't immediately check the short reads or the GATK variant calling. That appears to have run fairly smoothly. If everything above is fine, then another possibility is that the long read sample and the short read sample aren't the same, and therefore nPhase is finding too few long reads that are heterozygous in the same positions as the short read data (perhaps even errors in the long read data, which accounts for very low coverage). I'm not sure that this is what it would look like if this happened, but I suppose it's something to consider. Maybe you have a way of double checking that on your side, if not you could use something like https://github.com/WGLab/NanoCaller to variant call the long reads and then compare the variants in the short read vcf and the long read vcf. Or I could try to advise you on a way to analyze the contextdepths file to determine if the majority variants are absent/low coverage in your long read data, which would be a sign that the long read sample does not present the variants found in the short read, either due to it being a different sample or due to issues in long read quality. Finally, maybe I messed something up with my latest update to nPhase a few months ago. If it's none of the above we could test that by testing an earlier version of nPhase in a new environment. I'm sure we can figure this out, I'm sorry using nPhase hasn't been the painless experience I was hoping it would be for everyone |
One last thing, when you tested ChrIX by itself, that output looks normal (not perfect, though I think if it was run a second time with the subsampled data it would provide better results). So maybe the easiest way to figure out what's going on is to check for any differences between that run and the full-genome run? The reference sequence shouldn't the issue, maybe you used a different sample? Or maybe there were memory issues and the full-genome run is getting killed? Though I don't believe the plots would still be generated if that was happening. If the chrIX was done with different samples, I recommend trying them again for a full-genome run (and subsample them first), this way we would know if the problem is with nPhase or with the data. |
I completely agree…these results are confusing me as well. I went back to the command I used and made sure it was the corrected long read file, which it was. I included a summary of the stats after running Nanoplot on the filtered long reads below:
One thing to note is that I did not filter the short reads with Filtlong - is that something I should take into account as well? I just filtered the long reads because I felt that would be enough and the short reads would fill in the rest of the data. I’ll have a look at the short read variants. I’ll probably end up calling variants with the long reads and compare the two VCF files as well as the differences between the full genome and the ChrIX call. When I ran the command using just the ChrIX it was the same sample so I think there is either something wrong with the way its being run on my machine or there is something fundamentally wrong from the sequencing run. I wish there was a way to send you the samples so you could try the run but you’re suggestions have helped a ton already so far so I’m sure we’ll find a solution. |
But then, if chrIX was generated using the the same long read file, why aren't we seeing nearly the same level of coverage on chrIX when running the full genome? We should expect to obtain at least something similar. You did it correctly, the short reads shouldn't be subsampled, only the long reads since they're the basis for the clustering. You could upload the VCF generated in VariantCalls/ShortReads/ and the contextDepths file in the Overlaps folder to a google drive folder for me to take a look. You can email me a link at omaroakheart@gmail.com I don't have access to a server to do a full run unfortunately |
Hi! I left you an email earlier today but in case it's easier to respond here I have some results with the new run I was able to finish! I think these results are much cleaner and make some more sense to me. However, I have ~186 files in the fastq folder from the run. Does this show the total number of haplotigs/haplotypes? If I predicted this sample to be tetraploid how to I extract out 4 fastq files that represent each haplotype? What are your thoughts with these new plots from this run? |
Hey, I've responded in detail through email, but just in case someone else sees this issue and is curious, here are some short answers: One thing you can do now is run "nphase cleaning", which will attempt to merge together complementary haplotigs and get rid of small, inconsequential haplotigs such as the light blue haplotig at the beginning of "contig_47" in the bottom right, which is very lowly covered and does not cover many heterozygous SNPs. Each of the 186 fastQ files represents the reads associated with one haplotig. For example, chrI only has three, and the coverage plot shows that they are all roughly equally covered. I would check if there is an aneuploidy in chrI which leaves it with only three copies (and three associated haplotypes). ChrI of S. cerevisiae is usually rather messy, but if you do find evidence of aneuploidy (for example if the short read data shows evidence for a 1/3, 2/3 distribution of heterozygous SNPs), then that would be great corroborating evidence. You should use the discordanceVis plot to help with interpretation. It will show you if there are any haplotigs which have an unexpected allele frequency distribution. We would expect the allele frequency distribution to show nearly 90% of SNPs are in agreement, with a margin of about 10% for sequencing errors since we're looking at long reads. If there's a cluster of SNPs with an allele frequency closer to 50%, this is evidence that nphase has mistakenly mixed reads from different haplotypes. Looking at the end of chromosome 2 or chromosome 15, we can tell that there have been mistakes made by nPhase, since the heterozygous are all represented by only one haplotig. You should expect to see evidence of this in the discordancevis plot. One thing you can do is take the fastQ file that corresponds to this haplotig and run nphase with it as the long read input, so that you might find a way to reduce the discordance levels and obtain a prediction that is coherent on that level. There's a lot to be interpreted in these plots, for example, chromosome 13 shows evidence that it might have a +1 aneuploidy, making it a 5n. On the first 300kb it seems clear that there are 4 distinct haplotypes, three of which are covered at about a 1X level, and the fourth covered twice as much as the others, which is evidence there might be 5 copies. |
Hi I'm trying to run nPhase on a tetraploid yeast genome I seqeunced with PacBio HiFi reads.
Here is my code:
nphase pipeline --sampleName trialRun1 --reference s288c.fa --longReads pacbioHiFireads.fastq.gz --longReadPlatform pacbio --R1 illuminaShortRead_1.fastq --R2 illuminaShortRead_2.fastq --output trialRun1Output --threads 8
The issue I am having is that it's taking a long time to run - my computer accidentally restarted during its run so the progress I made running it for the past day or two is gone and before I run the code again I wanted to see if there was a tag or something I could add or do to the data to make it go faster.
I saw your messages in issue #11 and #10 but was wondering if there was anything done to the nPhase code to account for this. I also read up on your solution to #12 but I'm not sure how to go about doing that. I can run the code using the command I have written out above and I don't mind if it takes a longer amount of time but how much time on average would it take for a tetraploid yeast genome to run through nPhase?
The text was updated successfully, but these errors were encountered: