Data analysis techniques are required when protein samples are measured with TOF-SIMS because it is difficult to obtain large intact molecular ions with enough intensity using current TOF-SIMS techniques. Since all proteins consist of the same 20 amino acids, TOF-SIMS spectra of protein-adsorbed films cannot be readily differentiated by the straightforward presence or absence of unique peaks. Proteins are large molecules, approximately 1-20 nm, and secondary ions from proteins are necessarily from only partial areas or regions of the molecule. TOF-SIMS enables direct qualification and quantification of a protein sample from only partial information from a protein surface by utilizing analysis techniques appropriate to TOF-SIMS spectra, techniques such as multivariate analysis and information theory.

Although the latest SIMS techniques using the cluster ions as primary ion sources enable to obtain larger molecular ions from proteins, these SIMS spectra require data analysis methods because of the similarity of fragment ions. In addition, the orientation of immobilized proteins can be evaluated by means of TOF-SIMS spectra, because fragment ions provide useful information about the partial chemical structures of immobilized proteins.

Multivariate analysis techniques such as PCA (Huberty 1994; Jackson 1980; Wagner et al. 2004; Wold 1976, 1987) and LDA provide useful tools for gaining important information from large data sets (Mantus et al. 1993; Lhoest et al. 1998; Wagner et al. 2003b). PCA involves a mathematical procedure that transforms a number of correlated variables into a smaller number of uncorrelated variables called principal components (PCs). PCA can reduce the dimensionality of multidimensional space while retaining a large amount of the original information in the data. For example, two-dimensional data maybe transformed into one-dimensional data, as shown in Fig. 5. The first PC accounts for as much of the variability in the data as possible, and each succeeding component accounts for as much of the remaining variability as possible. Moreover, PCA is one of the unsupervised pattern-recognition techniques, and therefore provides results that are unbiased by human input.

Fig. 5. Schematic of the principal component analysis (PCA) concept; concentration of the original information

PCA had been employed to characterize the SIMS mass spectra of polymers (Vanden Eynde and Bertrand 1999). PCA has also been applied to interpretations of the TOF-SIMS spectra of protein samples and extended the application of TOF-SIMS measurement to biomaterials. The following are the TOF-SIMS data analysis steps carried out in PCA. Prior to analysis, several peaks are selected from each spectrum. One of the selection criteria can be based on SIMS studies of amino acid homopolymers, as reported by Mantus et al. (1993). The intensities of the secondary ion peaks are normalized to the total ion count before PCA in order to correct for the differences in total secondary ion yield from spectrum to spectrum. Moreover, before applying PCA to a data set, it is necessary to properly pretreat the data to assure that the variance patterns highlighted are truly related to the chemical differences between the samples and not mathematical differences in peak intensities (Wagner et al. 2004).

A typical data set X with m samples and n peaks then can be written as a matrix, with m rows and n columns. These are transformed into PCs by multiplication of the appropriate matrix. The scores of each PC suggest the character of each sample, and it is one of the advantages of PCA that the analyst can easily determine how unspecified data fit within a particular data category by considering the scores and loadings. The score plot of the first two or three scores reveals the groupings, outlines, and other strong patterns in the data. Since PCA can be applied to any data matrix and is an unsupervised and hence unbiased analysis, it is recommended as an initial stepin anymultivar iateanalysissoastoobtainafirstlookatthestructureof the data, to help identify outliers, to delineate classes, and so on. However, when the objective is classification or relating one set of variables to another, there are extensions of PCA that are more efficient for these purposes.

Discriminant analysis techniques such as LDA are methods for discriminating between several groups of discriminant function, and it is also one of the supervised analysis techniques. Origins of data are considered during the calculation of LDA to relate data from the same group samples. LDA is especially applied to find specific secondary ion peaks that will be used to characterize the samples.

Mutual information, which is defined by information theory (Shannon and Weaver 1947), was employed to select peaks from numerous candidates in the TOF-SIMS spectra of proteins (Aoyagi et al. 2003, 2004a, b). Mutual information (Eckschlager 1990; Shannon and Weaver 1947) is obtained by subtracting aposteriori entropy (uncertainty) from apriori entropy (uncertainty). In this formulation, a posteriori entropy is defined as information entropy that occurs after an event.

The calculation steps are as follows. Suppose the number of TOF-SIMS spectra is N and they are classified in two categories, the sample and the reference sample. The number of spectra belonging to the sample is n(a1) and that belonging to the reference sample is n(a2). In terms of sample categories, information entropy S(A) is defined by the following equation:

where the probability p(ai) = n(ai)/N(I = 1,2) and S(A) is the amount of information needed to determine the a priori category of a spectrum. With a certain peak threshold V, the set of spectra are split into two subsets B1 and B2. The peak intensity greater than V is classified as B1 and the number of the spectra containing these peaks as n(b1), and that less than V is classified as B2 and the number of the spectra containing these peaks as n(b2). Therefore, the information entropy of splitting induced by V, S(B) is defined by the equation:

Mutual information I(A;B) is defined by equation:

where the probability p(ai|bj) = n(ai|bj)/n(bj), S(A) is the a priori uncertainty and S(A|B) is the a posteriori uncertainty. The term n(ai|bj) is the number of spectra belonging to sample category i out of the spectra containing peaks greater than V. The best value of V is chosen to provide the largest I(A;B). When I (A;B) = S(A), the peak intensity of each spectrum is classifiable into the correct category.

For example, there are TOF-SIMS spectra of two samples, A and B. We can compare the intensities of a certain peak. In case 1, the intensities of the peak cannot be classified with an appropriate threshold V. Therefore, apriori entropy equals the a posteriori entropy. This a posteriori entropy is information entropy after the estimation of the peak intensity with threshold V. In this case, the mutual information is zero. In other words, nothing is clarified by evaluation of the peak intensity with V. In case 2, peaks are completely classified by the threshold V. Therefore, the a posteriori entropy is zero, and the mutual information is 1. In other words, when mutual information equals the apriori entropy, this peak is the most important peak for classifying the samples.

Samp e 1 Sample 2 Samp e I Sample 2 Mutual information => 0 Mutual information -> i Fig. 6. Classification concept based on mutual information

Partial least square regression was also applied to an interpretation of TOF-SIMS spectra (Wagner et al. 2004b), and an artificial neural network (ANN) was employed to chemically classify the SIMS spectra of adsorbed protein films. Sanni et al. (2002) reported a comparison of PCA and ANN in the characterization of protein spectra, and indicated the superiority of the ANN technique to distinguish the spectra of all of the adsorbed protein films using the entire mass spectrum. Other chemometric (Eckschlager et al. 1990; Gallagher et al. 2004; Vogt and Mizaikoff2003) analysis techniques should also prove helpful for the correct interpretion of the TOF-SIMS spectra of complicated samples.

Was this article helpful?

## Post a comment