Skip to content

wisteriaztttt/GO_KEGG_enrichment_analysis_for_rice_genes

 
 

Folders and files

NameName
Last commit message
Last commit date

Latest commit

 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 

Repository files navigation

License: GPL v3

GO/KEGG Enrichment Analysis on Gene Lists from Rice (Oryza Sativa)

Overview

This is a workflow for using clusterProfiler software to perform GO and KEGG enrichment analysis for gene lists from rice. The gene annotations are the key to these analyses. For GO annotaion, two types of sources are used here: 1. self-curated annotations derived from public rice databases, i.e. RAP-DB and OryzaBase, and 2. GO annotatons from a bioconductor package, AnnotationHub. For KEGG annotation, the gene annotation is directly retrieved from the KEGG database when using the clusterProfiler enrichment function.

Introduction

RNA-seq data analysis has been streamlined, and functional enrichment analysis is a critical step to provide biological insights on the results. Enrichment analysis, or over-representative analysis, is to examine whether a gene ontology or a biological pathway is enriched in the target gene list more than is expected by chance. Many tools were developed to contain both annotation files and enrichment test functions to streamline this process. However, some plant species may still lack of gene annotation information, which could be an obstacle for the functional enrichment analysis. For instance, only 20 GO annotation databases were available under OrgDb from Bioconductor, where only one is the plant species, Arabidopsis. In this protocol, I focus on performing functional enrichment analysis on genes of rice, a model organism for the grass family, using one of the most commonly used enrichment analysis R software clusterProfiler (Yu et al., 2012). I provide a step-by-step instruction using annotation information obtained from two different ways. The scripts are mainly the R scripts, with some Bash command lines for curating a GO annotation file.

Installation

Install required R packages through Biocondutor. Skip the packages that have already been installed.

if (!require("BiocManager", quietly = TRUE))
  install.packages("BiocManager")
packages <- c("clusterProfiler", "GO.db", "AnnotationHub", "dplyr", "data.table", "ggplot2")
BiocManager::install(setdiff(packages, rownames(installed.packages())))

Input Data

  1. Target gene list (genes.txt), background gene list (bkgd.txt, optional but recommended). The gene IDs are the RAP IDs in this protocol, e.g. Os01g0102500, Os01g0106300. See the /input folder.
  2. The gene annotation file obtained from The Rice Annotation Project Database (RAP-DB), including both the GO annotation information and RAP gene ID to RAP transcript ID conversion information. This file is a large data table, where each row is an individual transcript ID, and each column is a gene annotation information, and "GO" is the column which contains the GO annotations which are extracted into the self-curated annotation files. Link.
  3. The gene annotation file from the OryzaBase website. This file is also a large data table, where each row is for a "Trait Gene ID", with annotations of "RAP ID" and "Gene Ontology", which are used for generating self-curated annotation files. Link.
  4. RAP ID to Entrez ID conversion table from the He Lab at Fujian Agriculture and Forestry University, China. Link.

Major steps

Step 1: GO enrichment analysis using annotations from self-curated annotation files

  • 1.1 Prepare rice gene GO annotation files curated from public annotation databases.

Run the command in Linux.

sh workflow/1_curate_rice_GO_annotation.sh

After this step, work on a local computer and run the following scripts in R.

Also, create two new folders: data and output. /data: to save intermediate result files. /output: to save the final files. Save the files from Step 1.1 into the /data folder.

  • 1.2 Prepare the GO ID to GO name mapping files.
source("workflow/2_prep_GO_annotation_files.R")

Here are the resulting GO annotation files, TERM2NAME (optional but highly recommended), and TERM2GENE (required).

  • 1.3 Run universal enrichment function, enricher.
source("workflow/3_run_enricher.R")

Step 2: GO enrichment analysis using annotations from AnnotationHub package

source("workflow/4_run_enrichGO.R")

Step 3: KEGG enrichment analysis using annotations from KEGG database

source("workflow/5_run_enrichKEGG.R")

Expected results

With the input gene lists, twenty-nine GO terms were enriched in the GO enrichment analysis using the self-curated annotation files, while none when using annoations from the AnnotationHub package. And eleven KEGG terms were enriched in the KEGG enrichment analysis using the annotation from the KEGG database. The result tables of the enriched GO/KEGG terms are in the /output folder, and the screenshots of the tables are shown below. The top 10 most statisitcally significant enriched GO and KEGG terms are visualized by dot plots, which are saved in the /figures folder and are shown below.

License

This code is free and open source, licensed under GPLv3.

About

No description, website, or topics provided.

Resources

Stars

Watchers

Forks

Releases

No releases published

Packages

No packages published

Languages

  • R 83.7%
  • Shell 16.3%