Probabilistic machine learning

a machine learning...

Unlike other methods probabilistic machine learning is based on one consistent principle which is used throughout the entire inference procedure. Probabilistic methods approach inference of latent variables, model coefficients, nuisance parameters and model order essentially by applying Bayesian Theory. Hence we may treat all unknown variables identically which is mathematically nice. For computational reasons a fully probabilistic model might not be feasible. In such situations we have to use approximations. Obviously for a methodology that has to stand the test in an empirical discipline, a mathematical consistency argument is not too convincing. So why should one use probabilistic methods?

Advantages of probabilistic models

Disadvantages of probabilistic models

Bayesian sensor fusion

The basic idea of Bayesian sensor fusion is to take uncertainty of information into account. In machine learning the seminal papers were those by (MacKay 1992) who discussed the effects of model uncertainty. In (Wright 1999) these ideas were later extended to input uncertainty. Related ideas have been used by (Dellaportas & Stephens 1995), who discuss models for errors in observables. I got interested in these issues in the context of hierarchical models where model parameters of a feature extraction stage are used for diagnosis or segmentation purposes. Such models are e.g. used for sleep analysis or also in the context of BCI. In a Bayesian sense these features are latent variables and should be treated as such. Again this is a consistency argument which has to be examined for its practical relevance.

sensor fusing DAG In order to obtain a hierarchical model that does sensor fusion we simply regard the feature extraction stage as latent space and integrate (marginalize) over all uncertainties. The left figure compares a sensor fusing DAG with current practice in many applications of probabilistic modeling that regard extracted features as observations. I reported on a first attempt to approach this problem in (Sykacek 1999) which is in more detail described in section 4 in my Ph.D. thesis.

In order to see that a latent feature space has practical advantages, we consider a very simple case where two sensors are fused in a naive Bayes manner to predict the posterior probability in a two class problem. Illustration of sensor fusion The model is similar to the one in the graph used above, however with two latent feature stages that are, conditional on the state of interest t, assumed to be independent. The plot on the right illustrates the effect of knowing one of the latent features with varying precision. Conditioning on a best estimate results obviously in probabilities that are independent of the precision. We hence obtain a flat line with a probability of class "2" of about 0.27. Marginalization changes this probability. Depending on how much the uncertainties differ, we can, as is illustrated in this figure, also obtain different predicted states. We may thus expect to improve in such cases where the precision of the distributions in the latent feature space varies. We have successfully applied a HMM based latent feature space model to classification of segments of multi sensor time series. Such problems arise in clinical diagnosis (sleep classification) and in the context of brain computer interfaces (BCI). A MCMC implementation and evaluation on synthetic and BCI data has been published in (Sykacek & Roberts 2002 a). This work was also the topic of a talk I gave at the NCRG in Aston in July 2002. A pdf version of the slides being available here. Recently (Beal et al. 2002) have applied similar ideas to sensor fusion of audio and video data.

Adaptive classification based on variational Kalman filtering

DAG describing adaptive BCI The image on the right shows a directed acyclic graph of our BCI classifier. Every time instance n represents a second of EEG and the corresponding cognitive state. At instance n, we are given a distribution over the parameter vector wn-1 a new observation yn. Assuming a two state BCI (e.g. we want to discriminate between cortical activity in the left and right motor cortex), yn ∈ [0,1] and the BCI classifier predicts the probability P(yn|D). We use D to denote all past observations including previous sessions. Data D results in a Gaussian distribution over the previous parameter p(wn-1|D) which has mean ûn-1 and precision matrix Λn-1. As soon as we observe yn we can update the distribution to obtain p(wn-1|D, yn). The hyper parameter λ can be regarded as a forgetting factor that controls the amount of tracking of the algorithm. Rather than setting its value directly, we use a hierarchical setup. We specify a Gamma prior p(λ|α, β) and infer the posterior distribution p(λ|yn-N,...yn-1,α, β). This assumes a stationary adaptation rate λ within a window of size N. DAG describing adaptive BCI The adaptive classifier requires to specify values for α, β and the window length N. Based on investigations with stationary data and with data with known non-stationary behavior we suggest to use α=0.01, β=10-4 and N=10. Results on a synthetic data set with these and similar settings are shown in the figure on the left. The above plot illustrates the expectation of λ under the posterior. The second plot shows the instantaneous generalization accuracy. The data set is stationary - apart from a non-stationary behavior at sample 500, where we flip the labels. Although different values of N lead to different estimates of the optimal adaptation rate, the effect on the generalization accuracy is small and the best compromise between rapid tracking and high stationary accuracy is obtained with a window of size 10. As a final comment we would like to mention that these settings allow that the algorithm is done in real time - a vital property to be useful for a BCI.

References

(Beal et al. 2002)
M. J. Beal, H. Attias and N. Jojic. A Self-Calibrating Algorithm for Speaker Tracking Based on Audio-Visual Statistical Models. In Proc. Int. Conf. on Acoustics Speech and Signal Proc. (ICASSP), May 2002.
(Bernardo & Smith 1994)
J. M. Bernardo and A. F. M. Smith. Bayesian Theory. John Wiley & Sons, Chichester UK, 1994.
(Dellaportas & Stephens 1995)
P. Dellaportas and S. A. Stephens. Bayesian analysis of errors-in-variables regression models. Biometrics, 51:1085-1095, 1995.
(MacKay 1992)
D. J. C. MacKay. Bayesian interpolation. in Neural Computation, pages 415-447, 1992.
(Sykacek 1999)
Learning from uncertain and probably missing data. Peter Sykacek, OEFAI. An abstract is available on the workshop homepage.
(Sykacek et al. 2003 b)
P. Sykacek, I. Rezek and S. J. Roberts. Bayes Consistent Classification of EEG Data by Approximate Marginalisation. In S. J. Roberts and R. Dybowski editors. Applications of Probabilistic Models for Medical Informatics and Bioinformatics, pages to appear, Springer Verlag, 2003.
(Sykacek & Roberts 2003)
P. Sykacek and S. J. Roberts. Adaptive classification by variational Kalman filtering. In S.Thrun, S. Becker and K. Obermayer, editors, Advances in Neural Information Processing Systems 15, pp 737-744, MIT press, 2003.
(Sykacek & Roberts 2002)
P. Sykacek and S. Roberts. Bayesian time series classification. In T.G. Dietterich, S. Becker, and Z. Ghahramani, editors, Advances in Neural Processing Systems 14, pages 937–944. MIT Press, 2002.
(Wright 1999)
W. A. Wright. Bayesian approach to Neural-Network modeling with input uncertainty. IEEE Trans. Neural Networks, 10:1261-1270, 1999.

Online Supplement to “Bayesian Modelling of Shared Gene Function”

This work was done by the authors, while contributing to the BBSRC funded "Shared Genetic Pathways in Cell Number Control" research program, which was awarded to the Department of Pathology, University of Cambridge, UK. As the project title suggests, this project investigates molecular biological processes that control development cycles in different biological systems. The search for the underlying genetic markers requires a principled approach that can infer which genes are of shared importance in several microarray experiments. We propose for that purpose a fully Bayesian model for an analysis of shared gene function. The approach assumes that several microarray experiments with known cross annotations between transcripts (genes) should be analyzed for common genetic markers. The implementation described in this work has in particular the advantage to combine data sets before applying thresholds and thus the advantage that the result is independent of that choice. For more information on the method, we refer to the original paper (Sykacek et al 2007 a) and the pdf supplement (Sykacek et al 2007 b).

Experimental Supplement

Biological Details

The analysis of development processes in many tissues is faced with several interacting biological processes and a mix of various cell types. As an example we investigate in this work the shared biological activity at gene level in a mouse mammary gland development cycle (Clarkson et al 2004) and a human endothelial cell culture with apoptosis induced by serum withdrawal (Johnson et al 2004). The biological complexity of the experiments is best visualized, if we mark different development stages for active biological processes at a macro level. For the mouse mammary time course, we get the following table of active processes. During lactation, time is in days and during involution we use hours.

Biological Processes During Lactation and Involution in Mouse Mammary Development
Biological ProcessL0L5L10 I12I24I48I72 I96
Type I Apoptosis---++?- -
Type II Apoptosis-----?+ +
Apoptosis---++++ +
Differentiation+++?--- -
Inflammation?--++?- -
Remodeling-/(?)----?+ +
Acute Phase+---+++ +

We use "+" to indicate that a process is active and "-" to indicate it's inactivity. A "?" indicates epochs where we are uncertain about the process activity. A similar though simpler classification can be obtained for the second experiment which studies human endothelial cells under serum deprivation. Duration during serum deprivation is in hours.

Biological Processes in Endothelial Cells Under Serum Withdraw
Biological Processcontrol (t0)t28
Type II Apoptosis-+
Apoptosis-+
Differentiation+-

Results

In addition to the results we present in the original paper and in the pdf supplement, we provide here the top 20 genes we find important to contribute to both data sets.

Ranking From Shared Analysis of Mammary Development and Endothelial Cells
Gene SymbolP(It|D) P(G=t|D)Co-Regulation
SAT 0.99951 0.047597 anti
ODC1 0.99921 0.029237 co
GRN 0.99921 0.029125 co
BSCL2 0.99919 0.028601 anti
MLF2 0.99884 0.019988 anti
IFRD2 0.99867 0.017425 co
BTG2 0.99843 0.014688 co
CCNG2 0.99826 0.013274 co
TNK2 0.99789 0.010943 anti
C9orf10 0.99783 0.010614 co
HAGH 0.99764 0.0097747 co
PPP2CB 0.99759 0.0095567 anti
SSR1 0.99748 0.0091528 co
MUT 0.99747 0.0091039 co
DHRS3 0.99746 0.0090926 co
PSMA1 0.99741 0.0089018 anti
HBLD2 0.99732 0.0086073 co
SYPL1 0.99724 0.0083639 co
C2F 0.99723 0.0083374 co
ATP6V1B2 0.99706 0.0078419 anti

The full gene list in comma separated format is available as zipped archive. To check, which biological processes we find attributed to this list, we follow the suggestion in (Lewin et al. 2006) and use Fishers exact test to infer significance levels of active gene ontology (GO) categories from the probabilistic rank list (see also (Al-Shahrour et al.)). The resulting GO categories for the gene list of this shared analysis can be obtained as comb_apo_all.xml in xml format. This file is compatible with Treemap - (C) University of Maryland and preserves the parents - child relationships from the directed acyclic GO graph. Note that the treeml.dtd file is part of the Treemap package and not available here. Treemap is under a non commercial license. If it is unavailable despite that, the xml file can be inspected with any reasonable web browser.

Software

The software to calculate indicator probabilities that capture shared gene function comes as collection of MatLab libraries. The package consists of the main code, which uses the variational Bayesian approach described in (Sykacek et al 2007 a) and additional functions for data handling, output generation and an EM implementation for regularized probit link regression used during initialization of all Q-distributions. To make the software distribution flexible, all MatLab functions are collected in archives each containing functions of a particular type. The software is available under GPL 2 license and comes without any warranty.

Installation

To install the package, one has to download all required archives, provided as *.zip files or tared gzip archives (*.tar.gz), unpack the archives and set appropriate MatLab paths. Scripts using a hypothetical experiment derived from a mouse testis time course kindly provided by R. Furlong, demonstrate how to use the library.

Library Files for Shared Analysis (Zip files ending with .zip instead of .tar.gz are available as well)
Library File Description
helpers.tar.gz generic helper functions
statsgen.tar.gz generic statistics functions
mca_base.tar.gz basic microarray file handling (loading various microarray data formats)
mca_fuse.tar.gz generic handling in connection with shared analysis (cross annotation and output generation)
probitem.tar.gz Penalized maximum likelihood (MAP) for probit link regression via an EM algorithm.
combanalysis.glb.hphp.tar.gz Variational Bayes for shared analysis of subset probabilities in probit regression.

All components required to successfully run the experiment, will be installed automatically, if one creates a new directory and then downloads and runs the setup script in that directory. Linux (Unix) users should use combsyssetup.sh. After download, you might have to set executable permission by invoking the command chmod +x combsyssetup.sh. Windows users should either do the same after installing a cygwin environment or install the Wget and Unzip packages from GnuWin32 and then download and run combsyssetup.bat Note that this will install all required packages and, if run at later times, install updates.

Tutorial

After having run the script, the installation directory contains MatLab scripts and data which illustrate the approach discussed in (Sykacek et al 2007 a). The data are extracted from a subset of a mouse testis development time course, kindly provided by R. Furlong. The data consists of 7 time points: adult day 1 day 5 day 10 day 15 day 23 and day 35, with differential expression measured against the adult generation.

Data Files for the Tutorial

To illustrate all steps from cross annotation to generation of gene lists, we divided this data artificially into two "experiments". One experiment contains the samples of the adult generation and days 1 and 15. Here we use the original gene ids. We assume that the biological state change is between adult and the other two development periods. The corresponding data file is called "exp1mca.tsv" and is formatted like FSPMA normalized raw output: Gene ids are used as column headers and all samples as rows below. We also have a corresponding effects description as "exp1eff.tsv", which is used to generate the labels. The second experiment contains days 35, 23, 5 and 10 and artificially modified gene ids, to mimic a situation that requires between species annotation. Here we assume that the biological states correspond to days 35 and 23 versus days 5 and 10. The files are "exp2mca.tsv" for the microarray data and "exp2eff.tsv" for the labels. Note that the assumption is that each experiment provides information about differences in late and early stages of testis development. The analysis goal is thus similar to a problem, where we attempt to combine two experiments obtained from different platforms or species. This requires "cross annotation", which is here done according to the tab delimited file "crossann.tsv". In general, each row in this file contains a tuple that provides a unique mapping between all different unique gene ids one finds in a shared analysis. To complete the list of files, we provide in addition the tab delimited file "genespec.csv", which provides for the unique gene ids in the target genome, a mapping to gene symbols and descriptions.

Data Files - Summary Table
File Name Description
exp1mca.tsv, exp2mca.tsv Normalized log ratios (location and scale adjustment)
exp1eff.tsv, exp2eff.tsv (default) Labels
crossann.tsv cross annotations between the different gene ids found in the gene lists
genespec.csv mapping from unique gene ids (for the cross annotation target) to standardized symbols and gene descriptions
shareanalysis.def specification of the cross annotated experiments that enter shared analysis

Matlab Scripts for the Tutorial

Script files, to be run in the order as listed in the table.
File Name Description
crossann.m cross annotation of microarray experiments and preparation of shared analysis
runsim.m, calccoreg.m calculation of gene indicator probabilities of shared gene function by variational Bayesian inference
combres2csv.m extraction of gene ranking w.r.t. shared gene function as a tab delimited file

Both artificial experiments have to be cross annotated. This step will align the gene ids in different experiments and provide two raw data files and a gene id to symbols and description annotation in MatLab 6 format. Cross annotation is done by the MatLab script crossann.m found in the installation folder.

After cross annotation, we have to prepare for the shared analysis. This requires to specify a tab delimited text file ("shareanalysis.def") which controls this process. The minimal requirement is to specify in this control file which (previously cross annotated) data files should be analyzed for shared gene function. In addition one can specify a different set of labels. This is useful to analyze the same data for different biological classifications. We may also provide independent test data, which will be used to obtain generalization errors. Analysis is stared with the script combanalysis.m in the installation folder. The simulation will, depending on the size of the problem and the mode of analysis, take up to several hours (this example is though done in less than one minute or in a few minutes time, if we want fold results). As a result we get all simulation output in MatLab format. Details of the calculated results require to look into the code and to analyze the variables stored in the MatLab output file.

The last step in an analysis of shared function is to generate a rank table of shared gene function. This is done with the script crossann2csv.m found in this folder as well. The result is a rank list of similar structure as the one provided in the supplement of the original paper.

All intermediate results generated during a shared analysis is stored in MatLab 6 format (for Octave compatibility). The final rank table is a comma separated file.

Summary of result files as generated from this simulation.
File Name Description
exp1.mat, exp2.mat cross annotated and normalized raw data
crossanngenespec.mat reordered gene specifications (id, symbol and description)
state.mat, crosslog.mat internal log files (see code)
sharetestres.mat inference result about shared gene function. This file contains all results including probabilities, predictions and all Q-distributions found from variational Bayesian inference.
share_test_rank.csv rank table as comma separated file.

Adapting Shared Analysis to Different Experiments

To run such an analysis on a different experiment, one must provide data files structured like those listed in the data files table. The structure of the microarray data and the default labels is identical to the output generated by FSPMA, which can thus be used as preprocessing tool. In addition, one has to generate a file which allows cross annotation between all gene sets that appear in any one data set. If there is only one set of gene ids, cross annotation should be done anyway, to obtain the data in the format as expected by runsim.m. In this case all parts in runsim.m that specify the shared analysis will refer to the same gene id column. Inference of shared gene function requires in addition a control file similar to shareanalysis.def. Finally one has to adjust all script files in the script files summary to meet the different requirements.

Acknowledgments

This page describes joint work with R. Clarkson, C. Print, R. Furlong and G. Micklem. We also thank David MacKay for advise. The project was moslty done at the Departments of Pathology and Genetics, University of Cambridge as part of the project "Shared Genetic Pathways in Cell Number Control", ref. 8/EGH16106 funded by the BBSRC within their Exploiting Genomics initiative. During completion of this work, Peter Sykacek moved to the Bioinformatics group at BOKU University, Vienna, which is funded by the WWTF, ACBT, Baxter AG, and ARC Seibersdorf.

References

(Al-Shahrour et al.)
F. Al-Shahrour, R. Díaz-Uriarte, and J. Dopazo. FatiGO: A web tool for finding significant associations of Gene Ontology terms with groups of genes. in Bioinformatics, 20, pages 578--580, 2004.
(Attias 1999)
H. Attias. Inferring parameters and structure of latent variable models by variational Bayes. in Proceedings of the Fifteenth Annual Conference on Uncertainty in Artificial Intelligence (UAI--99), pages 21-30, 1999.
(Clarkson et al 2004)
R. W. E. Clarkson, M. T. Wayland, J. Lee, T. Freeman and C. J. Watson. Gene expression profiling of mammary gland development reveals putative roles for death receptors and immune mediators in post-lactational regression. in Breast Cancer Res., 6(2), pages 92--109, 2004.
(Johnson et al 2004)
N. A. Johnson, S. Sengupta, S. A. Saidi, K. Lessan, S. D. Charnock-Jones, L. Scott, R. Stephens, T. C. Freeman, B. D. Tom, M. Harris, G. Denyer, M. Sundaram, S. K. Smith and C. G. Print. Endothelial cells preparing to die by apoptosis initiate a program of transcriptome and glycome regulation. in The FASEB Journal, 18, pages 188--190, 2004.
(Lewin et al. 2006)
A. Lewin, S. Richardson, C. Marshall, A. Glazier, and T. Aitman. Bayesian Modelling of Differential Gene Expression. in Biometrics,62(1), pages 10--18.
(Sykacek et al. 2007:a)
P. Sykacek, R. Clarkson, C. Print, R. Furlong and G. Micklem. Bayesian Modeling of Shared Gene Function. In Bioinformatics, 2007; doi: 10.1093/bioinformatics/btm280, pp 1936--1944. An abstract and a pdf preprint are available from Bioinformatics online, a local draft version is draft available in gzipped postscript and pdf.
(Sykacek et al. 2007:b)
P. Sykacek, R. Clarkson, C. Print, R. Furlong, and G. Micklem. Supplement to:"Bayesian Modeling of Shared Gene Function". Technical report, Department of Biotechnology, BOKU University, Vienna, 2007, available in gzipped postscript and pdf.