Skip to content

Bioinformaticsnl/A50-plot

 
 

Repository files navigation

These R scripts create metrics and graphs from genome assemblies.

To use, first install the dependancies with R CMD BATCH install.R. After that, edit the input.csv file to match your situation. Then you can run the image and stats generating script with:

Rscript AssemblyStats.R input.csv

The CSV (comma separated) columns are: Seq_Name Seq_File

Spaces are allowed. The order in the file will define the order in the graphics' legend. You can comment lines with # First line HAS to be the title: Seq_Name,Seq_File

You can edit AssemblyStats.R to change the maximum X limit (number of contigs): max_count <- 20000 Any assembly with more than 20000 contigs will be trimmed to this size and the total sum after this will be shown as 20001.

We have defined a new concept called "lookup value" which is defined as 75% of the harmonic mean of the sequences sizes. Once this values is set, the Nlookup (contig size) and Ilookup (contig index) is calculate for each sequence. Having the harmonic mean among all sizes as a constant lookup value seems more comparable than using 50% of the length of each assembly individualy (N50). E.G.: assemblies size: 1Mbp, 10Mbp, 20Mbp Hmean = 1 / (( 1/1M + 1/10M + 1/20M ) / 3) = 2.6Mbp Mean = ( 1M + 10M + 20M ) / 3) = 10.3Mbp

Harmonic means, usually, gives a lower value than the Mean, allowing the better analysis of discrepant values easier. For this dataset, the lookup value would be 75% of 2.6Mbp (1.95Mbp). This fixed value will be used to calculate Nl and Il for the tree individuals. A line is drawn showing the lookup value (Nl). In the top of the graph tick marks shows the Il. The legend has the values numerically.

The result of the script looks like this :

About

No description, website, or topics provided.

Resources

Stars

Watchers

Forks

Releases

No releases published

Packages

No packages published