Abstract

A plethora of R packages exist on CRAN and Bioconductor to perform over-representation analysis (ORA) and gene set enrichment analysis (GSEA). However, consistency in the underlying nomenclature for specific analyses and user friendly implementation is still lacking. prora aims at unifying ID mapping and enrichment analysis in a syntactically coherent and intuitive way, while ensuring reproducibility of results. prora primarily consists of wrapper functions around the sigora and WebGestaltR packages from CRAN and rmarkdown based reports for visualisation and contextualisation of analysis results.

Introduction

Methods to identify differentially regulated pathways among a vast background of quantified genes or proteins differ in complexity of the underlying algorithms and computations, ranging from the application of Fisher’s exact test in ORA and extensions to gene pairs (Foroushani, Brinkman, and Lynn 2013) to more complex statistical procdures in gene set enrichment analysis (Subramanian et al. 2005) and network based procedures (Glaab et al. 2012). Almost all methods come with their own implementation as an R package, i.e. ORA in both sigora and WebGestaltR, network based gene enrichment analysis in enrichnet and the list keeps going on. Naturally, different implementations come with different – and often inconsistent – features, such as ID mapping or demand a lot of knowledge from the user in setting up appropriate data structures and other preconditions. prora aims at providing a consistent and user friendly interface, which will be introduced in the following section. Elaborating on the methodological and statistical aspects of the underlying methods is beyond the scope of this vignette and the reader is referred to the publications in the reference section.

Installation and loading

After installing the packages from Bioconductor as shown below, the package can be loaded using library(prora).

BiocManager::install("prora")
library(prora)

Example workflow

We shall analyse proteomic data with the following format:

data("exampleContrastData")
#glimpse(exampleContrastData)

The first column protein_Id contains FASTA formatted human protein accessions and the second column fold change estimates. e.g. as obtained by a limma::toptable() call. The first step is now to translate the FASTA headers into Uniprot accessions for subsequent ID mapping.

dd <- prora::get_UniprotID_from_fasta_header(exampleContrastData)
#glimpse(dd)

We see that we have lost 0 accessions which did not correspond to a FASTA format. Now we apply ORA, sigORA and GSEA to the obtained ranked protein list. To apply sigORA and ORA one must first compute the background GPS repository using makeGPS_wrappR(). The advantage is that backgrounds can be specified separately for each experiment or combined for several batches of experiments. However, computing the accession pair signatures and calculating their respective weights is computationally intensive and potentially time consuming.

data("idmap", package = "sigora")
myGPSrepo <- makeGPS_wrappR(ids = dd$UniprotID, target = "GO")


res <- sigoraWrappR(data =  dd,
                    threshold = 0.2,
                    score_col = "estimate",
                    GPSrepos = myGPSrepo$gps,
                    greater_than = TRUE)

The result of sigoraWrappR() is a list containing the sigORA and ORA results along with user specified inputs that will be used in the reports described in section @ref(reports).

names(res)

## [1] "sigora"                 "ora"                    "threshold"             
## [4] "GPSrepository"          "database"               "data"                  
## [7] "proteinsAfterFiltering" "direction"

Under the hood a lot of ID mapping has taken place of which the user is unaware. To check the ID mapping efficiency when mapping Uniprot IDs to GO, KEGG or ENTREZ accessions the package comes with a checkIDmappingEfficiency() function to check the number of lost IDs.

checkIDmappingEfficiency(dd$UniprotID, keytype = "UNIPROT") %>% 
  round(2)

##                InputIDs ENTREZIDs KEGGIDs   GOIDs
## Number             3701   3680.00 1534.00 3660.00
## Percent Mapped      100     99.43   41.45   98.89

Evidently, the mapping from Uniprot to KEGG IDs seems to be fairly inefficient and results should be interpreted with care.

Reports

Good reporting and reproducibility of all intermediate and final results is key to good scientific practice. In this spirit, prora provides a reproducible, rmarkdown based reports for each method. To highlight the most important targets, settings and parameters, a walkthrough is given in this part of the vignette.

To run all methods and produce the reports on the example data from the prior section, the following parameters must be set in advance.

organism <- "hsapiens" # Organism to which the FASTA headers correspond
ID_col <- "UniprotID" # Column containing the IDs
fc_col <- "estimate" # Column containing the estimates (in this case log fold changes)
target_GSEA <- c( # all possible targets for GSEA
  "geneontology_Biological_Process",
  "geneontology_Cellular_Component",
  "geneontology_Molecular_Function"
)
target_SIGORA <- # all possible targets for sigORA
  c("GO", "KEGG", "reactome")
fc_threshold <- 0.5 # threshold for ORA methods
greater <- TRUE # direction of the threshold
nperm <- 10 # number of permutations used to compute enrichment scores in GSEA
odir <- "tmp" # output directory to write all reports

To iterate over all possible target directly, the following workflow functions can be used. They take the above specified parameters as inputs and facilitate the enrichment analysis workflows tremendously. The results will be saved to the specified output directory with a subfolder for each method and target.

The following code chunk iterates the runGSEA() function over all possible GSEA target databases. Internally, runGSEA() calls WebGestaltR() for the analysis and compiles prora's in the same output directory.

sapply(target_GSEA, function(x) {
  runWebGestaltGSEA(
    data = dd,
    fpath = "",
    ID_col = ID_col,
    score_col = fc_col,
    organism = organism,
    target = x,
    nperm = nperm,
    outdir = file.path(odir, "WebGestaltGSEA")
  )
})

Analogously, runWebGestaltORA() and runSIGORA() embody a whole workflow and save nicely formatted results to the output directory.

sapply(target_GSEA, function(x) {
  runWebGestaltORA(
    data = dd,
    fpath = "",
    organism = organism,
    ID_col = ID_col,
    target = x,
    threshold = fc_threshold,
    greater = greater,
    nperm = nperm,
    score_col = fc_col,
    outdir = file.path(odir, "WebGestaltORA")
  )
})

runSIGORA() calls the sigoraWrappR() function which was introduced in section @ref(Example workflow) and compiles the built-in report.

sapply(target_SIGORA, function(x) {
  runSIGORA(
    data = dd,
    target = x,
    score_col = fc_col,
    threshold = fc_threshold,
    greater = greater,
    outdir = file.path(odir, "sigORA")
  )
})

In the following the runGSEA() function’s inner workings will be discussed in more detail. After creating the output directory outdir the provided data is fed into WebGestaltR’s WebGestaltR(..., enrichMethod = "GSEA"). The results are then used to generate the report provided by the prora package¹. All WebGestaltR and prora results can then be found in the output directory.

Command line tools

The enrichment methods in this package (ORA, GSEA sigORA) come with a docopt based command line tool to facilitate analysing batches of files². The main advantage of docopt is that all tools have to come with a thorough documentation where it is also possible to provide default values. For running GSEA, the command line tool is:

"WebGestaltR GSEA for multigroup reports

Usage:
  lfq_multigroup_gsea.R <grp2file> [--organism=<organism>] [--outdir=<outdir>] [--idtype=<idtype>] [--ID_col=<ID_col>]  [--nperm=<nperm>] [--score_col=<score_col>] [--contrast=<contrast>]

Options:
  -o --organism=<organism> organism [default: hsapiens]
  -r --outdir=<outdir> output directory [default: results_gsea]
  -t --idtype=<idtype> type of id used for mapping [default: uniprotswissprot]
  -i --ID_col=<ID_col> Column containing the UniprotIDs [default: UniprotID]
  -n --nperm=<nperm> number of permutations to calculate enrichment scores [default: 500]
  -e --score_col=<score_col> column containing fold changes [default: pseudo_estimate]
  -c --contrast=<contrast> column containing fold changes [default: contrast]

Arguments:
  grp2file  input file
" -> doc

library(docopt)

opt <- docopt(doc)

It can be run using the following command in a UNIX/Linux terminal, specifying the organism as mouse:

Rscript path/to/lfq_multigroup_gsea.R path/to/group2file.xlsx -o mmusculus

The help file can also be accessed via the command line using:

Rscript path/to/lfq_multigroup_gsea.R -h

Session info

## R version 4.1.1 (2021-08-10)
## Platform: x86_64-w64-mingw32/x64 (64-bit)
## Running under: Windows 10 x64 (build 19044)
## 
## Matrix products: default
## 
## locale:
## [1] LC_COLLATE=English_United States.1252 
## [2] LC_CTYPE=English_United States.1252   
## [3] LC_MONETARY=English_United States.1252
## [4] LC_NUMERIC=C                          
## [5] LC_TIME=English_United States.1252    
## 
## attached base packages:
## [1] parallel  stats4    stats     graphics  grDevices utils     datasets 
## [8] methods   base     
## 
## other attached packages:
##  [1] GO.db_3.13.0         org.Hs.eg.db_3.13.0  AnnotationDbi_1.54.1
##  [4] IRanges_2.26.0       S4Vectors_0.30.2     Biobase_2.52.0      
##  [7] BiocGenerics_0.38.0  prora_0.1.0          WebGestaltR_0.4.4   
## [10] forcats_0.5.1        stringr_1.4.0        dplyr_1.0.7         
## [13] purrr_0.3.4          readr_2.0.1          tidyr_1.1.3         
## [16] tibble_3.1.4         ggplot2_3.3.5        tidyverse_1.3.1     
## 
## loaded via a namespace (and not attached):
##  [1] colorspace_2.0-2       ellipsis_0.3.2         rprojroot_2.0.2       
##  [4] XVector_0.32.0         fs_1.5.0               rstudioapi_0.13       
##  [7] bit64_4.0.5            fansi_0.5.0            lubridate_1.8.0       
## [10] xml2_1.3.2             codetools_0.2-18       doParallel_1.0.16     
## [13] cachem_1.0.6           knitr_1.36             jsonlite_1.7.2        
## [16] apcluster_1.4.8        broom_0.7.10           dbplyr_2.1.1          
## [19] png_0.1-7              compiler_4.1.1         httr_1.4.2            
## [22] backports_1.2.1        assertthat_0.2.1       Matrix_1.3-4          
## [25] fastmap_1.1.0          cli_3.1.0              htmltools_0.5.2       
## [28] tools_4.1.1            igraph_1.2.7           gtable_0.3.0          
## [31] glue_1.4.2             GenomeInfoDbData_1.2.6 doRNG_1.8.2           
## [34] Rcpp_1.0.7             slam_0.1-48            cellranger_1.1.0      
## [37] jquerylib_0.1.4        pkgdown_1.6.1          vctrs_0.3.8           
## [40] Biostrings_2.60.2      svglite_2.0.0          iterators_1.0.13      
## [43] sigora_3.99.5          xfun_0.26              rvest_1.0.2           
## [46] lifecycle_1.0.1        rngtools_1.5.2         zlibbioc_1.38.0       
## [49] scales_1.1.1           reactome.db_1.76.0     ragg_1.2.0            
## [52] hms_1.1.1              yaml_2.2.1             memoise_2.0.0         
## [55] gridExtra_2.3          UpSetR_1.4.0           sass_0.4.0            
## [58] stringi_1.7.4          RSQLite_2.2.8          desc_1.4.0            
## [61] foreach_1.5.1          GenomeInfoDb_1.28.4    rlang_0.4.11          
## [64] pkgconfig_2.0.3        systemfonts_1.0.3      bitops_1.0-7          
## [67] evaluate_0.14          lattice_0.20-44        bit_4.0.4             
## [70] tidyselect_1.1.1       plyr_1.8.6             magrittr_2.0.1        
## [73] R6_2.5.1               generics_0.1.1         DBI_1.1.1             
## [76] pillar_1.6.4           haven_2.4.3            whisker_0.4           
## [79] withr_2.4.3            KEGGREST_1.32.0        RCurl_1.98-1.5        
## [82] modelr_0.1.8           crayon_1.4.2           utf8_1.2.2            
## [85] tzdb_0.1.2             rmarkdown_2.11         grid_4.1.1            
## [88] readxl_1.3.1           blob_1.2.2             reprex_2.0.1          
## [91] digest_0.6.28          textshaping_0.3.6      munsell_0.5.0         
## [94] bslib_0.3.1

References

Foroushani, Amir B. K., Fiona S. L. Brinkman, and David J. Lynn. 2013. “Pathway-Gps and Sigora: Identifying Relevant Pathways Based on the over-Representation of Their Gene-Pair Signatures.” PeerJ 1 (December): e229. https://doi.org/10.7717/peerj.229.

Glaab, Enrico, Anais Baudot, Natalio Krasnogor, Reinhard Schneider, and Alfonso Valencia. 2012. “EnrichNet: Network-Based Gene Set Enrichment Analysis.” Bioinformatics 28 (18): i451–i457.

Subramanian, Aravind, Pablo Tamayo, Vamsi K. Mootha, Sayan Mukherjee, Benjamin L. Ebert, Michael A. Gillette, Amanda Paulovich, et al. 2005. “Gene Set Enrichment Analysis: A Knowledge-Based Approach for Interpreting Genome-Wide Expression Profiles.” Proceedings of the National Academy of Sciences 102 (43): 15545–50. https://doi.org/10.1073/pnas.0506580102.

The source files for all reports can be found in system.file("rmarkdown_reports", package = "fgczgseaora")↩
Those scripts can be found in system.file("run_scripts", package = "prora")↩

prora: Unifying methods on gene set enrichment

Lucas Heinrich Kook

Witold Wolski

08 December 2021