The Food Allergy Research and Resource Program (FARRP) AllergenOnline.org database has been updated to version 15 on January 12, 2015. Version 15 contains a comprehensive list (1897 protein (amino acid) sequence entries that are categorized into 744 taxonomic-protein groups of unique proven or putative allergens (food, airway, venom/salivary and contact). Some of the allergenic wheat gliadins or glutenins may also cause celiac disease, however they are listed on the allergen site if there is evidence of IgE binding.
The annual update process includes collecting new sequences designated as “allerg*” in reference files from NCBI protein database (compiled from GenBank, RefSeq and TPA databases as well as protein sequences from SwissProt, PIR, PRF and PDB databases). In a few instances, sequences are taken directly from a peer reviewed publication as they have not been entered into the NCBI or other available databases. Duplicate and inappropriate sequences are removed by a process described below. The sequences are categorized by taxonomic group (genus/species) and protein sequence identity (close homology). The new draft dataset is compared with sequences contained in the previous version of AllergenOnline.org and integrated into existing groups if appropriate, or classified into new groups. Peer-reviewed publications are identified from PubMed and other resources, then collected and reviewed for evidence of allergenicity of the source organism and the specific protein. Additional information is gathered from the Allergen Nomenclature Committee website of IUIS (International Union of Immunological Societies), Allergome and occasionally Wikipedia. This information was reviewed for each group of sequences as described below to classify the entries as likely “allergenic” (absolute proof including challenge testing, or putative, specific IgE binding using sera from individuals with allergies to the source organism), or “insufficient proof” of allergy due to a lack of convincing evidence of allergenicity. During the review process, an attempt is made to identify new publications demonstrating proof of allergy for groups of potential allergens that were designated as having insufficient proof of allergenicity in previous versions. A consensus decision by the whole peer review panel is normally reached for each group regarding the designation as an “allergen” or having “insufficient proof”. However in a few instances a majority decision is taken. Criteria used to reach a decision to include or exclude each sequence or allergen group is described below.
REMOVAL OF "FALSE" ENTRIES
A keyword search restricted to “allergen” of the NCBI protein sequence database identified 58901 peptide sequences (on 7 June, 2014). A few true allergens might be missed as those identifying the sequence may not have associated the protein with allergy. Searches with less restrictive terms, e.g. “allerg*”, retrieved 2,860,882 sequences (on 7 June, 2014). Some of those are retrieved because the investigator was associated with an institute of allergy, or the sequence was merely designated as being a hypothetical translation product or "homologue" of an allergen. In some cases the protein is associated with the allergic immune response, not with causing allergy. The lists are becoming longer as sequences are now often auto-annotated by computer programs simply based on some degree of similarity to a functional property, in a number of cases has been listed as allergenicity or allergen. Compilation of a list of sequences for review by the entire expert panel includes screening to remove sequences that are included only based on being "similar to" an allergen, or homologous. Many peptides are from a taxonomic organism that is associated with allergy (e.g. Aspergillus sp., Alternaria sp.). In order to reduce the list to manageable size without excluding likely true allergens we exclude sequences from genome model organisms (e.g. Drosophila melanogaster, Arabidopsis thaliana, Caenorhabditis elegans, etc.). However, proteins from allergenic species that are also genetic models (rice, mice and corn) that have information suggesting allergenicity are included in the initial list, but without inclusion of published references related to allergenicity are excluded. Proteins that are obviously merely associated with an allergic response (e.g. cytokines, chemokines, immunoglobulins and transcription factors) are also excluded. Sequences are then screened and grouped based on sequence identity and taxonomic identity to those already in the AllergenOnline.org database (e.g. allergens, putative allergens or sequences with insufficient evidence to demonstrate allergy, see below) from previous versions. Relevant publications are collected for the review panel, using references from the NCBI sequence entry as well as separate searches of the PubMed database, based on keyword searches of the taxa, common name and sequence authors. The information for each allergen group is triaged to gather more specific information and reviewed by the expert panel in a three stage process as described below.
PEER REVIEW PROCESS FOR CATEGORIZING SEQUENCE GROUPS: PROOF OF ALLERGENICITY
Goal: To update and curate the list of sequences included in the AllergenOnline.org (FARRP) database on an annual basis to include only protein sequences that are supported by evidence demonstrating that the protein is a proven allergen or that there is substantial proof of allergy to the source of the protein as well as immunoglobulin E (IgE) binding to the specific protein using sera from individuals with allergies to the source. Nearly identical sequences in the same taxonomic / protein group are included if it is clear there are variants of the protein that might contribute to allergy.
Rationale: The AllergenOnline database is intended for use as a tool for evaluating the safety of proteins included in foods through processing or genetic modification. The Codex Alimentarius Guidelines (2003) established a process for evaluating potential allergenicity based on evidence that the protein is likely to cause allergic reactions in consumers. A key component in the evaluation process is comparison of candidate products (proteins) with those of known allergens using a bioinformatics approach such as FASTA or BLASTP local alignment tools to identify proteins that would require further testing by serum IgE binding and/or clinical testing to evaluate safety. It is therefore important to have scientific evidence that the database entries are allergens or probable (putative) allergens in order to maximize the reliability of bioinformatics searches.
Peer Review Process: In 2005, FARRP brought together a panel of seven food allergy experts to define criteria for inclusion in future versions of the database. A protocol was developed for including sequences for consideration, for classifying sequences into groups (allergen, putative allergen, insufficient evidence to classify as an allergen or putative allergen), collecting publications for review, providing information to reviewers and finally for voting to accept sequences as allergens or putative allergens. In general we have included sequences in the taxonomic allergen group that are at least 67% identical to the protein that is the subject of peer-reviewed published study supporting IgE binding to the protein, using sera from clinically defined subjects allergic to the source. The identity limit was initially suggested by the IUIS Allergen Nomenclature Subcommittee as a limit for defining isoallergen groups. Information regarding the individual proteins should demonstrate the protein is actually expressed in the source material that causes allergic reactions.
Criteria for three classes of assignment were agreed to: Allergen is a protein that has been demonstrated to specifically bind IgE using sera from individuals with clear allergies to the source of the gene/protein and further that the protein causes basophil activation or histamine release, skin test reactivity or challenge test reactivity using subjects allergic to the source. Putative allergen is a protein that has met most of the criteria of an allergen, but has a missing component, usually biological activity (basophil activation or in vivo reactivity), less well defined clinical population or lack of data demonstrating the specific protein was used in reported testing. Both Allergens and Putative Allergens are retained in the list of sequence searchable protein entries in AllergenOnline.org. The third category, those with Insufficient Evidence of Allergenicity, are not included in the sequence searchable protein list because they were judged to be lacking critical evidence of specific IgE binding, the serum donors were not demonstrated to be allergic to the source and there was no allergic biological activity demonstrated for the protein. The proteins categorized as "Insufficient evidence" are maintained in a list for future annual reviews as new candidate "allergens" are identified from NCBI and the published literature. If new evidence supports reclassification in the opinion of the reviewers, they would be included in future versions of the database. In rare instances after 2007 individual sequence entries in the database that were previously included in the searchable allergen list have been removed after more detailed reviews have failed to identify published evidence the protein is expressed in allergenic material or that the original review miss-interpreted the data in the available publications.
The amount and quality of published objective data supporting the classification of various proteins as allergens varies remarkably. For many food, airway or contact allergens there is unquestionable objective data of the identity, characterization and purity of the protein and clear evidence that human subjects with relevant allergic histories and symptoms were tested to demonstrate reactions upon challenge, or at least clear evidence of specific IgE binding. However, there are also a number of proteins labeled as allergens in the literature or in the NCBI sequence database (or in UNIPROT) for which there is not sufficient objective data characterizing the protein used in testing, or data to demonstrate human reactivity or specific IgE binding. Our peer review process is designed to review the collective literature for individual proteins and classify the individual allergen groups based on our stated criteria.
The review process includes triage and initial evaluation summary Dr. Goodman at FARRP. Often additional references are identified and added for further review. Then each sequence group is assigned to two other reviewers from the expert panel. The detailed review comments from all three reviewers are compiled and presented to the entire group of seven experts for a final round of reviews. Comments and votes are recorded in the database files as an archive file. Later changes in status and reasons for changes are also included in the archive. A list of relevant references that were included in the review process are included in the public view of each version of the database.
Before release of the database the sequences, GI numbers, taxonomy of the source and reference lists are compiled and checked before release of the new version to the public. The public website shows relevant information for each sequence.
PEER REVIEW PANEL
Baumert, Joe, PhD, FARRP, University of Nebraska, USA
Bohle, Barbara, PhD, Division of Immunopathology, Medical University of Vienna, Austria
Ebisawa, Motohiro, MD, Pediatric Allergy, National Sagamihara Hospital, Japan
Fatima Ferreira, PhD, University of Salzburg, Austria
Goodman, Rick, PhD, FARRP, University of Nebraska, USA
Sampson, Hugh, MD, Pediatric Allergy, Mount Sinai Medical Center, New York, USA
Taylor, Steve, PhD, FARRP, University of Nebraska, USA
van Ree, Ronald, PhD, University of Amsterdam, The Netherlands
Vieths, Stefan, PhD, Paul-Ehrlich-Institut, Germany (2005-2012)
Hefle, Sue, PhD, FARRP, University of Nebraska, USA (2005-2006)
ALLERGEN DATABASE SEARCH ROUTINES
This website includes a sequence comparison routine, FASTA (Pearson and Lipman, 1988) which may be used to compare a protein sequence (the query sequence) to entries in the allergen database. This version of the FASTA search interface utilizes the FASTA3 (Pearson, 2000) algorithm. The purpose of the comparison routine is to evaluate whether the query protein sequence is identical to, or homologous with known or putative allergens in the database. Alignments with high identity scores may indicate a potential for allergenic cross-reactions. However, there is not sufficient scientific data to establish a simple scoring boundary (E-score or percent identity), beyond which cross-reactivity is certain, or below which cross-reactivity is not possible. Based on historical data, cross-reactivity is not likely for proteins with less than 50% identity over the entire protein sequence, and is fairly common above 70% identity (Aalberse, 2000). Through experience we find that sequences of two proteins having published evidence of cross-reactivity will align in AllergenOnline.org with a relatively high percent identity (>50% over nearly full-length) and have an E score (statistical expectation score) smaller than 1e-7 (0.0000001). Thus if a query protein matches a sequence in AllergenOnline.org with higher identity and smaller E scores, the protein should be considered as likely to be cross-reactive in the absence of extensive testing (IgE binding and possibly clinical challenges). Proteins sharing lower identity matches by FASTA alignment and having higher E scores are not likely to share IgE binding. Experimental studies would be needed to confirm that proteins sharing identities lower than 50% and having E scores larger than 1e-4 share IgE binding and clinical reactivity. Evaluation of literature regarding the matched allergen would help to identify appropriately allergic study subjects.
Sliding 80mer FASTA
In addition to the full-length FASTA search, we have added an option to automatically scan each possible 80 amino acid segment (1-80, 2-81, 3-82, etc.) of the entered search protein against the AllergenOnline database, looking for matches of at least 35% identity. The 35% identity for 80 amino acid segments was suggested in a scientific advisory to regulators for evaluating proteins in genetically modified crops (see FAO/WHO 2001, and Codex 2003). This short segment matching routine evaluating segments of 80 amino acids appears to be quite conservative, and precautionary as discussed in Goodman et al. (2005) and Goodman and Hefle (2005). However, the 80 amino acid segment search appears to be far more likely to be informative than a search for shorter identical segments of 6 or 8 contiguous amino acids as originally recommended by Metcalfe et al. (1996) or the FAO/WHO 2001 approach, based on evaluations by Hileman et al., (2002) and Silvanovich et al. (2006). See also the summary report from the bioinformatics workshop on evaluating potential allergenicity (Goodman, 2006).In the past AllergenOnline.org has employed an E()-value (E-score) threshold of 100 as a statistical cutoff limit in the 80-mer search in identifying alignments with >35% identity matches that should be evaluated further. However, we have determined that the very large E score allows alignments with multiple gaps and leads to alignments in some cases that do not make sense when compared to full-length alignments. Reexamination of publications by Pearson in 2004 and earlier publications clearly support the use of the default E = 10 as a limit for FASTA or in exceptional cases with specialized, small databases or sequences, the limit could be set lower (e.g. E = 0.01). We have therefore modified the search parameters to evaluate only alignments with E scores = 10 or less in the release of AllergenOnline.org version 15 (12 January, 2015). It is important to keep in mind that the default E()-value is simply a starting threshold used to allow alignments to be observed and then investigated using 35% identity and 80 amino acid overlap as the criteria. In cases when the alignment identified matches of >35% identity in the sliding 80mer search, additional bioinformatics comparisons maybe useful to evaluate likely biological significance, or specific serum testing may prove useful if appropriate specifically allergic serum donors can be identified to evaluate the potential cross-reactivity suggested by the match.
8mer Identity Match
Although the CODEX mentions using short segment (6 or 8 amino acid) sequence matches, it also indicates that searches must be based on scientific proof. As we have searched for, and been unable to find examples where an isolated identity match of 6 or 8 amino acids was found between cross-reactive proteins unless there was at least a 35% identity match over 80 amino acids, we previously did not include that search routine on our database (Goodman et al. 2008). However, since some countries still require an eight amino acid identity search, even in the lack of evidence demonstrating a positive predictive value, we now provide that as an option.
For bioinformatic analysis:
- Pearson, W.R. and Lipman, D.J. 1988. Improved tools for biological sequence comparison. Proc. Natl. Acad. Sci. USA 85:2440-2448.
- Pearson, W.R. 2000. Flexible sequence similarity searching with the FASTA3 program package. Methods Mol. Biol. 132:185-219.
- Pearson, W.R. 2003. Finding protein and nucleotide similarities with FASTA. Current Protocols in Bioinformatics. Section 3.9.1 to 3.9.23.
- Siruguri V, Bharatraj DK, Vankudavath RN, Rao Mendu VV, Gupta V, Goodman RE. Evaluation of Bar, Barnase, and Barstar recombinant proteins expressed in genetically engineered Brassica juncea (Indian mustard) for potential risks of food allergy using bioinformatics and literature searches. Food Chem Toxicol. 2015 Jun 14;83:93-102. doi: 10.1016/j.fct.2015.06.003. [Epub ahead of print] PubMed PMID: 26079618.
For protein sequence (structure) and allergenicity:
- Aalberse, R.C. 2000. Structural biology of allergens. J. Allergy Clin. Immunol. 106:228-238.
- Aalberse, R. C., and Stapel, S. O. 2001. Structure of food allergens in relation to allergenicity. Pediatr Allergy Immunol 12:10-4.
- Codex Alimentarius Commission. 2003. Alinorm 03/34: Appendix III. Guideline for the conduct of food safety assessment of foods derived from recombinant DNA plants. Annex IV. Annex on the assessment of possible allergenicity, Rome, Italy.
- Doolittle, R.F. in Methods in Enzymology Vol. 183. Molecular evolution: Computer analysis of protein and nucleic acid sequences, R. F. Doolittle, Ed. (Academic Press, Inc., San Diego, 1990), chap. 6.
- FAO/WHO 2001. Evaluation of allergenicity of genetically modified foods derived from biotechnology. Rome, Italy.
- Goodman, RE, Hefle, SL. 2005. Gaining perspective on the allergenicity assessment of genetically modified food crops. Expert Rev. Clin. Immunol. 1(4):561-578.
- Goodman, RE, Hefle, SL, Taylor SL, van Ree, R. 2005. Assessing genetically modified crops to minimize the risk of increased food allergy. Int. Arch. Allergy Immunol. 137(2):153-166.
- Goodman RE. 2006. Practical and predictive bioinformatics methods for the identification of potentially cross-reactive protein matches. Mol Nutr Food Res 50:655-660.
- Goodman RE, Vieths S, Sampson HA, Hill D, Ebisawa M, Taylor SL, van Ree R. 2008. Allergenicity assessment of genetically modified crops - what makes sense? Nat Biotech 26(1):73-81.
- Hileman, R.E., Silvanovich, A., Goodman, R.E., Rice, E.A., Holleschak, G., Astwood, J.D. and Hefle, S.L. 2002. Bioinformatic methods of allergenicity assessment using a comprehensive allergen database. Int. Arch. Allergy Immunol. 128:280-291.
- Ladics GS, Bannon GA, Silvanovich A, Cressman RF. 2007. Comparison of conventional FASTA identity searches with the 80 amino acid sliding window FASTA search for the elucidation of potential identities to known allergens. Mol Nutr Food Res 51(8):985-998.
- Metcalfe, D. D., Astwood, J. D., Townsend, R., Sampson, H. A., Taylor, S. L., and Fuchs, R. L. 1996. Assessment of the allergenic potential of foods derived from genetically engineered crop plants. Crit Rev Food Sci Nutr 36 Suppl:S165-86.
- Silvanovich A, Nemeth MA, Song P, Herman R, Tagliani, L, Bannon, GA. 2006. The value of short amino acid sequence matches for prediction of protein allergencity. Toxicol. Sci. 90(1):252-258.
- Thomas K, Bannon G, Hefle S, Herouet C, Holsapple M, Ladics G, MacIntosh S, Privalle L. 2005. In silico methods for evaluating human allergenicity of novel proteins. Toxicol Sci 88(2):307-310.
Additional or alternative bioinformatics tools and databases may also be useful for the evaluation of potential allergens (see also LINKS):
- Brusic, V., Petrovsky, N., Gendel, S.M., Millot, M., Gigonzac, O., Stelman, S.J. 2003. Computational tools for the study of allergens. Allergy 58:1083-1092.
- Brusic, V., Petrovsky, N., Gendel, S.M., Millot, M., Gigonzac, O., Stelman, S.J. 2003. Allergen databases. Allergy 58:1093-1100.
- Kleter, G.A. and Peijnenburg, A.A.C.M. 2002. Screening of transgenic proteins expressed in transgenic food crops for the presence of short amino acid sequences identical to potential, IgE=binding linear epitopes of allergens. BMC Structural Biology 2:8.
- Ivanciuc, O., Schein, C.H., Braun, W. 2003. SDAP: database and computational tools for allergenic proteins. Nuc. Acids Res. 31:359-362.
- Malandain, H. 2004. Basic immunology, allergen prediction and bioinformatics Allergy 59:1011-1012.
- Martinez Barrio, A., Soeria-Atmadja, D., Nister, A., Gustafsson, M.G., Hammerling, U., Bongcam-Rudloff, E. EVALLER: a web server for in silico assessment of potential protein allergenicity. Nuc. Acids Res. 35(Web Server Issue): W694-W700.
- Saha, S., Raghava, G.P.S. 2006. Algpred: prediction of allergenic proteins and mapping of IgE epitopes. Nuc. Acids Res. 34(Web Server Issue): W202-W209.
- Stadler, M.B. and Stadler, B.M. 2003. Allergenicity prediction by protein sequence. FASEB J. 17:1141-1143.
- Zhang, L., Huang, Y., Zou, Z., He, Y., Chen, X., Tao, A. 2012. SORTALLER: predicting allergens using substantially optimized algorithm on allergen family featured peptides. Bioinformatics. 28(16):2178-2179