Skip to content

Latest commit

 

History

History
72 lines (56 loc) · 3.34 KB

paper.md

File metadata and controls

72 lines (56 loc) · 3.34 KB
title tags author affiliations date bibliography
gffpandas: A Python library to process gff3 files as pandas data frames
bioinformatics
computational biology
data analyses
gff3 format
pandas
genome annotation
name affiliation orcid
Vivian A. Monzon
1
0000-0001-7125-6212
name affiliation orcid
Konrad U. Förstner
1, 2, 3, 4
0000-0002-1481-2996
name index
Institut for Molecular Infection Biology, Julius-Maximilian-University Würzburg, Würzburg, Germany
1
name index
Core Unit Systemmedizin, Julius-Maximilian-University Würzburg, Würzburg, Germany
2
name index
ZB MED - Information Centre for Life Sciences, Cologne, Germany
3
name index
TH Köln - University of Applied Sciences, Cologne, Germany
4
XX June 2019
paper.bib

Summary

The GFF3 format (general feature format version 3) is a widely used plain text file format for the represention of genome annotations. gffpandas is a Python library, which aims to make the processing and integration such GFF3 files easy and efficient. For this purpuse it builds upon the popular pandas [@mckinney-proc-scipy-2010] library, inherits its design principles and extends the DataFrame class. This has the advantage, that the general (and widely known) methods of pandas' DataFrames as well as specific methods for the manipulation and filtering of genome annotation can be used to process annotation data. Filter and processing methods can easily be combined to perform processing of annotation in few lines of code.

Requirements

The library requires Python <= 3.4 and the pandas library.

Availability

gffpandas is available under ISC license.

References

""" gffpandas is a Python library, which can be used to work with genome annotation data. It facilitates the work with gff3 (general feature format version 3) files in regard to filter desired annotation entries of the gff3 file. Thereby gffpandas is an easy to use and time-saving library.

A gff3 file contains information about the location and attributes of genomic features as for example a gene, or an exon. It is always written in the same format, which has a header with meta-information and nine columns with the feature information [@gff3-The-Sequence-Ontology].
If only entries with specific characteristics are needed, there is no simple tool to extract these from the whole gff3 file. With the gffpandas library it is possible to return desired entries of a gff3 file, as for example all entries of a specific feature type or a given feature length.

The gffpandas library is an alternative to gffutils or bcbio-gff, but it is inspired by the Python library pandas. Based on the pandas library, gffpandas reads in a gff3 file into a data frame, to use this structure for further functions. One big advantage is that several filter functions can be combined so that the required annotation entries can be selected. Furthermore, the annotation data can be safed again as gff3 file or as csv or tsv file.

Further options of the gffpandas library are described in the project documentation [@Git-Repository].

The library should be used with Python3 or a higher version and it is dependent on the Python libraries pandas and itertools. """

References