-
Notifications
You must be signed in to change notification settings - Fork 0
/
Copy pathFastQCexercise.html
245 lines (170 loc) · 7.44 KB
/
FastQCexercise.html
1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
17
18
19
20
21
22
23
24
25
26
27
28
29
30
31
32
33
34
35
36
37
38
39
40
41
42
43
44
45
46
47
48
49
50
51
52
53
54
55
56
57
58
59
60
61
62
63
64
65
66
67
68
69
70
71
72
73
74
75
76
77
78
79
80
81
82
83
84
85
86
87
88
89
90
91
92
93
94
95
96
97
98
99
100
101
102
103
104
105
106
107
108
109
110
111
112
113
114
115
116
117
118
119
120
121
122
123
124
125
126
127
128
129
130
131
132
133
134
135
136
137
138
139
140
141
142
143
144
145
146
147
148
149
150
151
152
153
154
155
156
157
158
159
160
161
162
163
164
165
166
167
168
169
170
171
172
173
174
175
176
177
178
179
180
181
182
183
184
185
186
187
188
189
190
191
192
193
194
195
196
197
198
199
200
201
202
203
204
205
206
207
208
209
210
211
212
213
214
215
216
217
218
219
220
221
222
223
224
225
226
227
228
229
230
231
232
233
234
235
236
237
238
239
240
241
242
243
244
245
<!DOCTYPE html>
<html>
<head>
<meta http-equiv="Content-Type" content="text/html; charset=utf-8"/>
<title>Assignment RNA sequence work flow for Week 5</title>
<script type="text/javascript">
window.onload = function() {
var imgs = document.getElementsByTagName('img'), i, img;
for (i = 0; i < imgs.length; i++) {
img = imgs[i];
// center an image if it is the only element of its parent
if (img.parentElement.childElementCount === 1)
img.parentElement.style.textAlign = 'center';
}
};
</script>
<style type="text/css">
body, td {
font-family: sans-serif;
background-color: white;
font-size: 13px;
}
body {
max-width: 800px;
margin: auto;
padding: 1em;
line-height: 20px;
}
tt, code, pre {
font-family: 'DejaVu Sans Mono', 'Droid Sans Mono', 'Lucida Console', Consolas, Monaco, monospace;
}
h1 {
font-size:2.2em;
}
h2 {
font-size:1.8em;
}
h3 {
font-size:1.4em;
}
h4 {
font-size:1.0em;
}
h5 {
font-size:0.9em;
}
h6 {
font-size:0.8em;
}
a:visited {
color: rgb(50%, 0%, 50%);
}
pre, img {
max-width: 100%;
}
pre {
overflow-x: auto;
}
pre code {
display: block; padding: 0.5em;
}
code {
font-size: 92%;
border: 1px solid #ccc;
}
code[class] {
background-color: #F8F8F8;
}
table, td, th {
border: none;
}
blockquote {
color:#666666;
margin:0;
padding-left: 1em;
border-left: 0.5em #EEE solid;
}
hr {
height: 0px;
border-bottom: none;
border-top-width: thin;
border-top-style: dotted;
border-top-color: #999999;
}
@media print {
* {
background: transparent !important;
color: black !important;
filter:none !important;
-ms-filter: none !important;
}
body {
font-size:12pt;
max-width:100%;
}
a, a:visited {
text-decoration: underline;
}
hr {
visibility: hidden;
page-break-before: always;
}
pre, blockquote {
padding-right: 1em;
page-break-inside: avoid;
}
tr, img {
page-break-inside: avoid;
}
img {
max-width: 100% !important;
}
@page :left {
margin: 15mm 20mm 15mm 10mm;
}
@page :right {
margin: 15mm 10mm 15mm 20mm;
}
p, h2, h3 {
orphans: 3; widows: 3;
}
h2, h3 {
page-break-after: avoid;
}
}
</style>
</head>
<body>
<p>Paula Carrio Cordo - 24/10/2016 </p>
<h3>Assignment RNA sequence work flow for Week 5</h3>
<p><strong>Option 1. Download the fastq file(s) for one publicly available sample.
Briefly describe the sample, run FastQC on the file(s) and comment on the results.</strong></p>
<p>This analysis is based on a fastq file downloaded from the European Nucleotide Archive (ENA). The Study selected is PRJNA28911 (sample accession SAMN00001622). It is part of 1000 Genomes Project Pilot 1 (low coverage sequencing of 180 Hapmap individuals from multiple populations), a study in <em>Homo Sapiens</em>.</p>
<p>More details were obtained after running a FastQC on the file. FastQC as a quality control tool for high sequencing data allows to check for low-quality data, remaining adapters, contamination,...Overall, this analysis shows that the sample has a good quality since the FastQC Report does not show many errors or warnings.</p>
<h4>Basic Statistics</h4>
<p>The sample ERR000051_1.fastq.gz, has a file type of Conventional base calls. It was encoding by Sanger/Illumina 1.9. The total sequences are 8145207. Sequences flagged as poor quality has a null value. Interestingly the sequence length is 45 which is an acceptable value. The content of GC is 44%.</p>
<h4>Per Base Sequence Quality</h4>
<p>The overview of the range of quality values across all bases at each position in the FastQ file shows a failure. For more than half of the bases the median is less than 20. In addition, for some bases the lower quartile is less than 5.<br>
The quality scores across all bases shows an unusual pattern that may come from a general degradation of quality over the duration of long runs. </p>
<p><img src="%22C:/Users/TOSHIBA/Desktop/R%20stuff/imagesFastQC/1%22" alt="Figure 1"> </p>
<h4>Per Sequence Quality Scores</h4>
<p>The graphic shows the quality score distribution over all sequences. In this case there is not a subset of sequences with universally low quality values. The Average Quality per read was 35. </p>
<p><img src="%22C:/Users/TOSHIBA/Desktop/R%20stuff/imagesFastQC/2%22" alt="Figure 2"> </p>
<h4>Per Base Sequence Content</h4>
<p>There is a normal proportion of each base position for our sample file for which each of the four normal DNA bases has been called.
For the bases A and T the values are more or less constant around the 30%, whereas for the bases G and C, the values goes around 20%.</p>
<p><img src="%22C:/Users/TOSHIBA/Desktop/R%20stuff/imagesFastQC/3%22" alt="Figure 3"> </p>
<h4>Per sequence GC content</h4>
<p>The obtained distribution of GC count per read is similar to the theoretical distribution of GC content, showing a normal distribution. The central peak corresponds to the overall GC content of the genome of the sample.</p>
<p><img src="%22C:/Users/TOSHIBA/Desktop/R%20stuff/imagesFastQC/4%22" alt="Figure 4"> </p>
<h4>Per Base N Content</h4>
<p>A very low proportion of Ns appear for all the positions. This is a good indicative that the sequencer was able to make a base call with sufficient confidence during the whole process of sequenciation.</p>
<p><img src="%22C:/Users/TOSHIBA/Desktop/R%20stuff/imagesFastQC/5%22" alt="Figure 5"></p>
<h4>Sequence Length Distribution</h4>
<p>This high throughput sequencer genereted sequence fragments of uniform length. The graph shows that most of the fragments had a sequence length of 45 bp.</p>
<p><img src="%22C:/Users/TOSHIBA/Desktop/R%20stuff/imagesFastQC/6%22" alt="Figure 6"> </p>
<h4>Sequence Duplication Levels</h4>
<p>The relative number of sequences with different degrees of duplication has good levels.The sequences fall into the far left of the plot in both the red and blue lines. This indicates that the sample had a properly diverse library.</p>
<p><img src="%22C:/Users/TOSHIBA/Desktop/R%20stuff/imagesFastQC/7%22" alt="Figure 7"></p>
<h4>Overrepresented Sequences</h4>
<p>There is a reported warning in this section. The analysis FastQC has found a sequence "NNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNN..." is representing more than 1% of the total. This is an indicator that the library may be contaminated, or not as diverse as expected.</p>
<p><img src="%22C:/Users/TOSHIBA/Desktop/R%20stuff/imagesFastQC/8%22" alt="Figure 8"></p>
<h4>Adapter Content</h4>
<p>The Kmer Content module after doing a generic analysis of all of the Kmers in the library did not find uneven coverage through the length of the reads. A positive result is reported since there were not any sequence presented in more than 5% of all reads.</p>
<p><img src="%22C:/Users/TOSHIBA/Desktop/R%20stuff/imagesFastQC/9%22" alt="Figure 9"></p>
<h4>Kmer Content</h4>
<p>This analysis of overrepresented sequences shows a failure. This may be indicative of long sequences with poor sequence quality. It is possible that random sequencing errors had dramatically reduce the counts for exactly duplicated sequences.</p>
<p><img src="%22C:/Users/TOSHIBA/Desktop/R%20stuff/imagesFastQC/10%22" alt="Figure 10"></p>
</body>
</html>