About Search Routines
|FASTA program information|
|Released||Jan. 15, 2009|
|scoring matrix||BLOSUM 50|
The FASTA bioinformatics tool, developed by W.R. Pearson at the University of Virginia (1988), provides a quick search and local alignment of sequences contained within a specified database. Results of matched sequences are compiled with the best optimum individual alignments between your query sequence and each of the database sequences based on a scoring matrix. The scoring matrix used on the AllergenOnline website is a BLOSUM 50 (Henikoff and Henikoff 1992) that is weighted to favor identical matches between amino acids that are likely to significantly impact the overall protein structure, with less weight given to those unlikely to significantly impact the structure, and less weight given to "similar" amino acids. The matrix allows minimal gaps in matched regions. Various researchers have used different scoring matrices to calculate the similarity of proteins with the FASTA (or BLAST) algorithm based on substitutions and values for amino acids, various gap penalties and other criteria, depending on the purpose of the searches. Default values on AllergenOnline are a "word size" of 2 and an expectation value score (E-value) of 1 (highest number, representing the least similarity). The E score can be increased or decreased by the researcher, giving very different results. Based on experience an E score of 1 is sufficiently large to prevent missing any important alignment, but not so small that even remote homologies are missed. The scoring matrix of BLOSUM 50 is supposed to identify highly similar proteins that are likely to have similar overall structure and function, whether of distant evolutionary origin or closely related sequences (Pearson, 2000). Highly similar proteins are considered to be probable homologues that are similar because of the evolutionary relationship of the organisms.
The output of the FASTA includes a list of aligned sequences from best to least similar as well as a histogram showing the distribution of matches relative to a theoretical distribution, statistical scoring of each alignment (Expectation, or E-score value), a percent identity of the overlapping alignment and the best alignment of the query and aligned protein. Interpretation of the FASTA search results requires evaluation of both the Expectation (E) values and the percent identity of the amino acid sequences to understand the importance of each alignment. The steps described below should aid users in evaluating the potential homology and the potential relevance of matches the query protein with the allergens and putative allergens listed in the AllergenOnline database. The goal is to identify matches that may indicate potential allergenic cross-reactivity.
1. The E-value (expectation value) is a calculated value that reflects the degree of similarity of the query protein to its corresponding matches. Matches of identical amino acids within the alignments are scored as most significant, while matches of similar (charge and size) amino acids are given less significant values. The E-value depends on the overall length of joined (gapped) local sequence alignments, the quality (percent identity, similarity) of the overlap, and the size of the database. The size of the E-value is inversely related to similarity of two proteins, meaning a very low E-value (e.g. 10e-30) indicates a high degree of similarity between the query sequence and the matching sequence from the database, while a value of 1 or higher indicates the proteins are not likely to be related in evolution, or structure. In general, for a database the size of AllergenOnline, which contains many unrelated as well as related proteins, two sequences might be considered related in evolutionary terms (i.e. diverged from a common ancestor and share common three-dimensional structure), when the E-value of the FASTA query is less than 0.02, Pearson, 1996. The statistical distribution of the search results should be checked to determine that the sequence matches were not highly biased by an unusual distribution of amino acids (the histogram should approximately follow the theoretical asterisk line in the FASTA results. However, a value of 0.02 does not mean that the overall structures are likely to be sufficiently similar for antibodies (e.g. IgE from an allergic individual) against one protein to recognize the other. If the goal of the comparison is to identify proteins that may share immunologic or allergic cross-reactivity, matches with E-values larger than 1e-7 are not likely to identify relevant matches, while matches with E-values smaller than 1e-30 are much more likely to be cross-reactive in at least some allergic individuals (Hileman, 2002). Since E-values depend to a great degree on the scoring matrix, the size of the database and many other factors, and additionally have not been used by many authors in evaluating potentially cross-reactive allergens, interpretation of immunological significance should be viewed with caution. However, some authors have suggested relying more on E score than on percent identity (Ladics et al., 2007), but they have not reconciled the broad range of different significant E scores in possible for cross reactive allergens from very different sizes and types of proteins (e.g. 2S albumins and LTPs compared to vicilins or legumins). A more common comparison is the percent identity (below). It would be worth considering the use of a conservative E score value (e.g. 1 e-7) as an additional data point to complement the percent identity score.
2. The percent identity of the amino acid sequence between aligned proteins provides a comparison that has been historically discussed in evaluating proteins that are clinically cross-reactive or that share IgE binding properties. Aalberse (2000) reviewed potentially cross-reactive structures of known allergens and noted that proteins with greater than 70% identical primary amino acid sequences throughout the length of the protein, compared to an allergen are commonly cross-reactive, while those with less than 50% identity are unlikely to be cross-reactive. These observations suggest much more conservation of sequence/structure is needed for allergenic cross-reactivity (typically > 50% identity) than the level that indicates probable homology (at least 25% identity over 200 or more amino acids, Pearson, 1996). While it is clear that short matching segments of 20-40 amino acids with roughly 50% identity can occur by random chance or by conservation of functional motifs (Pearson, 1996), there is little evidence that short stretches of shared identity lead to allergic cross-reactivity (Aalberse, 2000). Even if an IgE epitope is present in a short region of high identity, it is important to note that an IgE mediated allergic reaction requires cross-linking of high affinity IgE receptors on the surface of Mast cells. The cross-linking would require at least two spatially distinct IgE binding epitopes on one protein, or strong linkage of peptides (e.g. disulfide bonds) having at least one IgE epitope (Bannon, 2002, Kane, 1986). In addition, many IgE epitopes involve binding to amino acid residues that are separated in linear sequence, but are adjacent in spatial arrangement as the protein is naturally folded. These are referred to as conformational epitopes. The probability of having 2 shared IgE epitopes, either linear or conformational, on two proteins that are not highly similar over a major portion of the proteins (i.e. are true homologues) is quite unlikely. The more closely related the species and the higher the identity of the proteins, the greater the likelihood of allergenic cross-reactivity.
Based on the level of understanding of likely cross-reactivity, the Codex Alimentarius Commission (2003) developed guidelines for the evaluation of the potential allergenicity of novel proteins. They recommended a bioinformatics search using a FASTA or a BLASTP algorithm, and suggested that matches of at least 35% identity over segments of at least 80 amino acids may indicate the possibility of cross-reactivity. The Codex guideline also states that if scientifically justified, an additional indicator of potential cross-reactivity, for example the occurrence of short identical matches (e.g. 8 or more contiguous amino acids) between a protein and an allergen may be useful in predicting potential cross-reactivity. In cases where a protein matches an allergen with a significant level of identity, further evaluation of potential IgE reactivity or clinical cross-reactivity may be warranted to test for potential cross-reactivity.
Scanning 80mer windows with FASTA.
A specific algorithm was added to AllergenOnline in May, 2005 to perform a search with every possible 80 amino acid segment of the query protein. The rationale is based on the recommendation by the FAO/WHO 2001 expert panel recommended using a criteria of >35% IDENTITY over any segment of 80 or more amino acids as an indication of possible cross-reactivity for allergens which was adopted by the Codex Alimentarius Commission (2003) as the primary sequence search criteria for use in flagging proteins that might be of some concern of cross-reactivity for genetically modified plants (Codex Alinorm 03/34, 2003). This comparison is done by sequential FASTA3 searches of amino acid segments of 1-80, then 2-81, 3-82, etc., until the end of the query sequence is reached. The identity score is adjusted to compensate for segments less than 80 amino acids due to inserted gaps, or aligned segments less than 80 amino acids that calculate to >35% identity if adjusted to 80 amino acids total. The query sequence may be entered with or without spacing gaps and numbers in the sequence, or in a typical FASTA format. While there is not good data demonstrating that proteins sharing only 35% identity over 80 amino acids are actually cross-reactive, there is some in vitro data to suggest specific binding at this relatively low percent identity. Further, the regulatory criterion seems to have been set unless studies are performed to justify a new criterion.
Scanning 8mer exact match.
At this time there has not been any data demonstrating any predictive power in using an exact match of 8 amino acids in sequence comparisons to identify potentially cross-reactive proteins (Goodman et al., 2005, Goodman et al., 2008). For that reason AllergenOnline does not have an exact short-mer matching algorithm.
An example of match representing proteins with probable cross-reactivity. The following FASTA output shows both the E-value and Percent Identity for a match between a known cross-reactive query sequence of a protein from the European hazelnut (nut), gi 5726304 (top sequence), which we know is a food allergen Cor a 1.0401, but is used as an example here, and a birch pollen allergen (Bet v 1), gi 1321714. The search identified a match with a number of birch pollen allergen (Bet v 1) isoforms, as well as homologues in other species including the hazelnut pollen allergen Cor a 1.0201, gi 1321731. The E-value of the match between the Cor a 1.0401 (nut allergen) and the specifically indicated Bet v 1 allergen is 1.9 e-37. This low value is highly suggestive that the two proteins are homologues. The identity is 70% over 160 amino acids, indicating a good chance of allergic cross-reactivity. In fact there is considerable clinical data indicating many birch pollen allergic individuals have demonstrable allergic reactions if they eat hazelnuts and many are also sensitized to hazelnut tree pollen. Furthermore, IgE binding to the hazelnut protein (Cor a 1.0401) or the hazelnut pollen protein (Cor a 1.01, 1.02 or 1.03) that is attached to a solid surface (e.g. an ELISA plate), can be inhibited by adding soluble birch pollen extract or purified recombinant Bet v 1 protein (Rohac M 1991 and Luttkopf 2001). The in vitro IgE inhibition results demonstrated that the same antibodies bind to these proteins, providing evidence for the mechanism of clinical cross-reactivity. If you did not already know Cor a 1.0401 was an allergen, the FASTA result would lead you to test in vitro IgE binding to Cor a 1.0401 using sera from a number of individuals allergic to hazelnut pollen or birch pollen.
initn: 535 init1: 535 opt: 742 Z-score: 772.3 bits: 149.1 E(): 3.9e-38
Smith-Waterman score: 742; 70.625% identity (71.069% ungapped) in 160 aa overlap (1-160:1-159)
10 20 30 40 50 60
:::: :: :.::::: :::::.:.::.:::::::::: . .::.::::::::::::::
10 20 30 40 50 60
70 80 90 100 110
::. :::.:..:.:.::.:::: ::.:::: .: ::::: :::.. :: ::::::::..
70 80 90 100 110 120
120 130 140 150 160
::::::: .. :.:::.:::: .::.:::.::::: :::
130 140 150 160
An example of a short match of potential significance. FASTA search match between the example query protein (Cor a 1.0401), and a short protein segment (25 amino acids) that was published from N-terminal sequencing of an allergenic protein isolated from European chestnuts. Even though the E-value is 0.0034 and the match is less than 80 amino acids in length, the identity is very high (75%) and there is a 100% match of a segment of 11 amino acids, indicating a high degree of similarity for at least the N-terminus of the proteins. If that was the only match of the query sequence to an allergen, it would be reasonable to search all public sequence databases and scientific literature for newly published sequences of this protein, or possible allergenic homologues. If the source of the protein (European chestnuts) has been identified as a common allergen, it might be possible to obtain sera from individuals allergic to chestnuts and to test for IgE binding.
initn: 119 init1: 119 opt: 119 Z-score: 145.0 bits: 30.3 E(): 0.0034
Smith-Waterman score: 119; 75.000% identity (75.000% ungapped) in 24 aa overlap (2-25:1-24)
10 20 30 40 50 60
::: .:.. :::::::::::.:::
70 80 90 100 110 120
130 140 150 160
An example of a match between proteins that is unlikely to be homologous. The query sequence was the hazelnut Cor a 1.0401 protein gi 5726304. The alignment below is to an allergen from Holcus lanatus (velvet grass) pollen. The alignment is limited to a 125 amino acid region of the 296 amino acid pollen allergen and indicates an identity of only 22%, with an E-value of 2.2. These values indicate that the proteins are not likely to be homologous and are quite unlikely to be cross-reactive.
initn: 66 init1: 44 opt: 81 Z-score: 94.4 bits: 24.6 E(): 2.2
Smith-Waterman score: 81; 22.400% identity (24.779% ungapped) in 125 aa overlap (10-129:67-184)
10 20 30 40 50 60
10 20 30 40 50
:.. ::: ::.: . . .:: .:.. .:: ..... .
70 80 90 100 110 120
60 70 80 90 100 110
.: . : .: .. .: : :... :. :: : : : .:. ::
130 140 150 160 170
120 130 140 150
. ... ..:.
180 190 200 210 220 230
240 250 260 270 280 290
- Aalberse, R.C. 2000. Structural biology of allergens. J. Allergy Clin. Immunol. 106:228-238.
- Bannon GA, Goodman RE, Leach JN, Rice E, Fuchs RL, Astwood JD. 2002. Digestive stability in the context of assessing the potential allergenicity of food proteins. Comments on Toxicology 8:271-285.
- Codex Alimentarius Commission, 2003. Alinorm 03/34: Joint FAO/WHO Food Standard Programme, Codex Alimentarius Commission, Twenty-Fifth Session, Rome, Italy 30 June-5 July,2003. Appendix III, Guideline for the conduct of food safety assessment of foods derived from recombinant-DNA plants and Appendix IV, Annex on the assessment of possible allergenicity, pp. 47-60.
- Goodman RE. 2006. Practical and predictive bioinformatics methods for the identification of potentially cross-reactive protein matches. Mol Nutr Food Res 50:655-660.
- Goodman RE, Vieths S, Sampson HA, Hill D, Ebisawa M, Taylor SL, van Ree R. 2008. Allergenicity assessment of genetically modified crops -- what makes sense? Nat Biotech 26(1):73-81.
- Henikoff, S. and Henikoff, J.G. 1992. Proc. Natl. Acad. Sci. USA 89:10915-10919.
- Hileman RE, Silvanovich A, Goodman RE, Rice EA, Holleschak G, Astwood JD, Hefle S. 2002. Bioinformatic methods for allergenicity assessment using a comprehensive allergen database. Int. Arch. Allergy Immunol. 128:280-291.
- Kane P, Erickson J, Fewtrell C, Baird B, Holowka D. 1986. Cross-linking IgE-receptor complexes at the cell surface: synthesis and characterization of a long bivalent hapten that is capable of triggering mast cells and rat basophilic leukemia cells. Mol. Immunol. 23(7):783-790.
- Ladics GS, Bannon GA, Silvanovich A, Cressman RF. 2007. Comparison of conventional FASTA identity searches with the 80-amino acid sliding window FASTA search for the elucidation of potential identities to known allergens. Mol. Nutri. Food Res. 51(8):985-998.
- Luttkopf D, Muller U, Skov PS, Ballmer-Weber BK, Wuthrich B, Skamstrup Hansen K, Poulsen LK, Kastner M, Haustein D, Vieths S. 2001. Comparison of four variants of a major allergen in hazelnut (Corylus avellana) Cor a 1.04 with the major hazel pollen allergen Cor a 1.01. Mol. Immunol. 38:515-525.
- Pearson, W.R. and Lipman, D.J. 1988. Improved tools for biological sequence comparison. Proc. Natl. Acad. Sci. USA 85:2440-2448.
- Pearson, W.R. 1996. Effective protein sequence comparison. Methods Enzymol. 266:227-58.
- Pearson, W.R. 2000. Flexible sequence similarity searching with the FASTA3 program package. Methods Mol. Biol. 132:185-219.
- Rohac, M., Birkner, T., Bohle, B., Steiner, R., Breitenbach, M., Kraft, O., Gabl F., Rumpold, H. 1991. The immunological relationship of epitopes on major tree pollen allergens. Mol. Immunol. 28(8):897-906.