title | tags | author | affiliations | date | bibliography | ||||||||||||||||||||||||||||||||||||||||
---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|
gffpandas: A Python library to process gff3 files as pandas data frames |
|
|
|
XX June 2019 |
paper.bib |
The GFF3 format (general feature format version 3) is a widely used plain text file format for the represention of genome annotations. gffpandas is a Python library, which aims to make the processing and integration such GFF3 files easy and efficient. For this purpuse it builds upon the popular pandas [@mckinney-proc-scipy-2010] library, inherits its design principles and extends the DataFrame class. This has the advantage, that the general (and widely known) methods of pandas' DataFrames as well as specific methods for the manipulation and filtering of genome annotation can be used to process annotation data. Filter and processing methods can easily be combined to perform processing of annotation in few lines of code.
The library requires Python <= 3.4 and the pandas library.
gffpandas is available under ISC license.
""" gffpandas is a Python library, which can be used to work with genome annotation data. It facilitates the work with gff3 (general feature format version 3) files in regard to filter desired annotation entries of the gff3 file. Thereby gffpandas is an easy to use and time-saving library.
A gff3 file contains information about the location and attributes of genomic features as for example a gene, or an exon. It is always written in the same format, which has a header with meta-information and nine columns with the feature information [@gff3-The-Sequence-Ontology].
If only entries with specific characteristics are needed, there is no simple tool to extract these from the whole gff3 file. With the gffpandas library it is possible to return desired entries of a gff3 file, as for example all entries of a specific feature type or a given feature length.
The gffpandas library is an alternative to gffutils or bcbio-gff, but it is inspired by the Python library pandas. Based on the pandas library, gffpandas reads in a gff3 file into a data frame, to use this structure for further functions. One big advantage is that several filter functions can be combined so that the required annotation entries can be selected. Furthermore, the annotation data can be safed again as gff3 file or as csv or tsv file.
Further options of the gffpandas library are described in the project documentation [@Git-Repository].
The library should be used with Python3 or a higher version and it is dependent on the Python libraries pandas and itertools. """