Using the package prozor for creating standardized fasta. The package can be used to:

 * merge several fasta files into a single fasta file
 * add reverse sequences to the fasta file.
 * add contaminants to the fasta file

Contaminants

This package provides two sets of typical contaminant proteins and peptides, one with and one without contaminants of human origin, which can be accessed by the functions loadContaminantsFasta and loadContaminantsNoHumanFasta. At the FGCZ we always are adding one of those two contaminant files to the database. To databases already containing human proteins, we will add the ContaminantsNoHumanFasta. The contaminants are easy to distinguish from other entries thanks to the zz|FGCZCont prefix.

## [1] "zz|fgczContaminants2021|" "zz|Y-FGCZCont00001|"     
## [3] "zz|Y-FGCZCont00002|"      "zz|Y-FGCZCont00003|"     
## [5] "zz|Y-FGCZCont00004|"      "zz|Y-FGCZCont00005|"
## [1] 499
## [1] 350

Creating a fasta protein amino acid sequence database for searching.

To merge several fasta databases into a single file place them into a single folder and give the folder the name of the database. At the FGCZ the database name starts with the project number e.g. p1000 a consecutive number e.g. db1 and descriptive name example,i.e.p1000_db1_example`.

Add to the folder also an annotation.txt file. The annotation file should contain a single line formatted like a fasta entry header with the following conent: aa|p<project_number>_<database_name>|<YYYYMMDD> <detailed_description>.

Example : AA|p1000_db1_example|20180119_Example https://github.com/protViz/prozor

The package provides an example of such a folder with the fasta files. Based on this folder a database can be created.

databasedirectory = system.file("p1000_db1_example",package = "prozor")
#databasedirectory <- file.path(find.package("prozor"), "p1000_db1_example")
dbname <- basename(databasedirectory)
fasta <- grep("fasta", dir(databasedirectory),value = TRUE)
files1 <- file.path(databasedirectory,fasta)
annot <- grep("annotation",dir(databasedirectory), value = TRUE)
annotation <- readLines(file.path(databasedirectory,annot))
annotation
## [1] "AA|p2569_db1_mouse_phosPXS|20171029_UNIPROT http://www.uniprot.org/proteomes/UP000000589"

Create non decoy database

resDB <- createDecoyDB(files1, useContaminants = loadContaminantsFasta2021(),
                       annot = annotation, revLab = NULL)
## reading db :C:/Users/wolski/R/win-library/4.1/prozor/p1000_db1_example/Annotation_allSeq.fasta.gz
## reading db :C:/Users/wolski/R/win-library/4.1/prozor/p1000_db1_example/Annotation_canSeq.fasta.gz
length(resDB)
## [1] 1954

Based on the directory name we build the name of the fasta file adding the current date.

dirname(databasedirectory)
## [1] "C:/Users/wolski/R/win-library/4.1/prozor"
xx <- file.path(dirname(databasedirectory), paste(dbname,"_",format(Sys.time(), "%Y%m%d"),".fasta" ,sep = ""))
print(xx)
## [1] "C:/Users/wolski/R/win-library/4.1/prozor/p1000_db1_example_20211207.fasta"
writeFasta(resDB, file=xx)

Create decoy database

To add a decoy database, using reverse sequences specify the revLab parameter in the createDecoyDB function. The resulting database will be twice as long as the non-decoy database.

resDBDecoy <- createDecoyDB(files1,
                            useContaminants = loadContaminantsFasta2021(),
                            annot = annotation,
                            revLab = "REV_")
## reading db :C:/Users/wolski/R/win-library/4.1/prozor/p1000_db1_example/Annotation_allSeq.fasta.gz
## reading db :C:/Users/wolski/R/win-library/4.1/prozor/p1000_db1_example/Annotation_canSeq.fasta.gz
resDBDecoy[[length(resDBDecoy) - 1]]
## [1] "ADAFGLESLKQHAEAYDAFFADEDAAYKDVLPRFVPDSLLAKDSPLQLLGEKEGSLLETFYSNDFILPNSTWPGEFGSREKHAAGITHGGSLAVIDQDTLGMAKGFVDRLHDSGKTADPLRGEPPPEPKDERGPHFPVEPGGTVEVAVVGALQYFDAYSLIPFEAKLPELLRVAIDLGNNASHALEAPHKITGFPGGTKTGKDFTGASHWALRLMLPACRKEAIIGRLKKKAKEVAKQYDASVTPYSKGM"
## attr(,"name")
## [1] "REV_zz|Y-FGCZCont00497|"
## attr(,"Annot")
## [1] ">REV_zz|Y-FGCZCont00497|  tr|A0A445I1L0|A0A445I1L0_GLYSO L-ascorbate peroxidase OS=Glycine soja OX=3848 GN=D0Y65_029960 PE=3 SV=1"
## attr(,"class")
## [1] "SeqFastaAA"
length(resDBDecoy)
## [1] 3907
sum(duplicated(names(resDBDecoy)))
## [1] 764
sum(duplicated(resDBDecoy))
## [1] 764
dbname_decoy <- unlist(strsplit(dbname,"_"))
dbname_decoy <- paste(c(dbname_decoy[1],"d",dbname_decoy[2:length(dbname_decoy)]),collapse = "_")
dbname_decoy
## [1] "p1000_d_db1_example"
xx <- file.path(dirname(databasedirectory), paste(dbname_decoy,"_",format(Sys.time(), "%Y%m%d"),".fasta" ,sep = ""))
print(xx)
## [1] "C:/Users/wolski/R/win-library/4.1/prozor/p1000_d_db1_example_20211207.fasta"
writeFasta(resDBDecoy, file = xx)

Session Info

## R version 4.1.1 (2021-08-10)
## Platform: x86_64-w64-mingw32/x64 (64-bit)
## Running under: Windows 10 x64 (build 19044)
## 
## Matrix products: default
## 
## locale:
## [1] LC_COLLATE=English_United States.1252 
## [2] LC_CTYPE=English_United States.1252   
## [3] LC_MONETARY=English_United States.1252
## [4] LC_NUMERIC=C                          
## [5] LC_TIME=English_United States.1252    
## 
## attached base packages:
## [1] stats     graphics  grDevices utils     datasets  methods   base     
## 
## other attached packages:
## [1] prozor_0.3.1
## 
## loaded via a namespace (and not attached):
##  [1] Rcpp_1.0.7            pillar_1.6.4          bslib_0.3.1          
##  [4] compiler_4.1.1        jquerylib_0.1.4       tools_4.1.1          
##  [7] digest_0.6.28         docopt_0.7.1          tibble_3.1.4         
## [10] lifecycle_1.0.1       jsonlite_1.7.2        evaluate_0.14        
## [13] memoise_2.0.0         AhoCorasickTrie_0.1.2 lattice_0.20-44      
## [16] pkgconfig_2.0.3       rlang_0.4.11          Matrix_1.3-4         
## [19] yaml_2.2.1            pkgdown_1.6.1         xfun_0.26            
## [22] fastmap_1.1.0         stringr_1.4.0         knitr_1.36           
## [25] desc_1.4.0            fs_1.5.0              sass_0.4.0           
## [28] vctrs_0.3.8           systemfonts_1.0.3     hms_1.1.1            
## [31] rprojroot_2.0.2       ade4_1.7-18           grid_4.1.1           
## [34] R6_2.5.1              textshaping_0.3.6     fansi_0.5.0          
## [37] rmarkdown_2.11        tzdb_0.1.2            purrr_0.3.4          
## [40] readr_2.0.1           seqinr_4.2-8          magrittr_2.0.1       
## [43] htmltools_0.5.2       ellipsis_0.3.2        MASS_7.3-54          
## [46] ragg_1.2.0            utf8_1.2.2            stringi_1.7.4        
## [49] cachem_1.0.6          crayon_1.4.2