Example: stock market

# Gaussian Processes for Machine Learning

C. E. **Rasmussen** & C. K. I. Williams, **Gaussian Processes for Machine Learning**, the MIT Press, 2006, ISBN 026218253X. 2006 Massachusetts Institute of Technology.c www ...

### Information

**Domain:**

**Source:**

**Link to this page:**

### Text of Gaussian Processes for Machine Learning

C. E. Rasmussen & C. K. I. Williams, Gaussian Processes for Machine Learning, the MIT Press, 2006,ISBN 2006 Massachusetts Institute of AMathematical Joint, Marginal and Conditional ProbabilityLet then(discrete or continuous) random variablesy1,...,ynhave ajointjoint probabilityprobabilityp(y1,...,yn), orp(y) for , one ought to distin-guish between probabilities (for discrete variables) and probability densities forcontinuous variables. Throughout the book we commonly use the term prob-ability to refer to both. Let us partition the variables inyinto two groups,yAandyB, whereAandBare two disjoint sets whose union is the set{1,...,n},so thatp(y) =p(yA,yB). Each group may contain one or more ofyAis given bymarginal probabilityp(yA) = p(yA,yB)dyB.( )The integral is replaced by a sum if the variables are discrete valued. Noticethat if the setAcontains more than one variable, then the marginal probabilityis itself a joint probability whether it is referred to as one or the other dependson the context. If the joint distribution is equal to the product of the marginals,independencethen the variables are said to beindependent, otherwise they function is defined asconditional probabilityp(yA|yB) =p(yA,yB)p(yB),( )defined forp(yB)>0, as it is not meaningful to condition on an impossibleevent. IfyAandyBare independent, then the marginalp(yA) and the condi-tionalp(yA|yB) are can deal with more general cases where the density function does not exist by usingthe distribution E. Rasmussen & C. K. I. Williams, Gaussian Processes for Machine Learning, the MIT Press, 2006,ISBN 2006 Massachusetts Institute of BackgroundUsing the definitions of bothp(yA|yB) andp(yB|yA) we obtainBayes Bayes ruletheoremp(yA|yB) =p(yA)p(yB|yA)p(yB).( )Since conditional distributions are themselves probabilities, one can use all ofthe above also when further conditioning on other variables. For example, insupervised learning, one often conditions on the inputs throughout, which wouldlead to a version of Bayes rule with additional conditioning onXin allfour probabilities in eq. ( ); see eq. ( ) for an example of Gaussian IdentitiesThe multivariate Gaussian (or Normal) distribution has a joint probability den-Gaussian definitionsity given byp(x|m, ) = (2 ) D/2| | 1/2exp( 12(x m)> 1(x m)),( )wheremis themeanvector (of lengthD) and is the (symmetric, positivedefinite)covariancematrix (of sizeD D). As a shorthand we writex N(m, ).Letxandybe jointly Gaussian random vectors[xy] N([ x y],[A CC>B])=N([ x y],[ A C C> B] 1),( )then themarginaldistribution ofxand theconditionaldistribution ofxgivenconditioning andmarginalizingyarex N( x,A),andx|y N( x+CB 1(y y), A CB 1C>)orx|y N( x A 1 C(y y), A 1).( )See, von Mises [1964, sec. ], and eqs. ( - ).The product of two Gaussians gives another (un-normalized) GaussianproductsN(x|a,A)N(x|b,B) =Z 1N(x|c,C)( )wherec=C(A 1a+B 1b) andC= (A 1+B 1) that the resulting Gaussian has a precision (inverse variance) equal tothe sum of the precisions and a mean equal to the convex sum of the means,weighted by the precisions. The normalizing constant looks itself like a Gaussian(inaorb)Z 1= (2 ) D/2|A+B| 1/2exp( 12(a b)>(A+B) 1(a b)).( )To prove eq. ( ) simply write out the (lengthy) expressions by introducingeq. ( ) and eq. ( ) into eq. ( ), and expand the terms inside the exp toC. E. Rasmussen & C. K. I. Williams, Gaussian Processes for Machine Learning, the MIT Press, 2006,ISBN 2006 Massachusetts Institute of Matrix Identities201verify equality. Hint: it may be helpful to expandCusing the matrix inversionlemma, eq. ( ),C= (A 1+B 1) 1=A A(A+B) 1A=B B(A+B) generate samplesx N(m,K) with arbitrary meanmand covariancegenerating multivariateGaussian samplesmatrixKusing a scalar Gaussian generator (which is readily available in manyprogramming environments) we proceed as follows: first, compute the Choleskydecomposition (also known as the matrix square root )Lof the positive def-inite symmetric covariance matrixK=LL>, whereLis a lower triangularmatrix, see section Then generateu N(0,I) by multiple separate callsto the scalar Gaussian generator. Computex=m+Lu, which has the desireddistribution with meanmand covarianceLE[uu>]L>=LL>=K(by theindependence of the elements ofu).In practice it may be necessary to add a small multiple of the identitymatrix Ito the covariance matrix for numerical reasons. This is because theeigenvalues of the matrixKcan decay very rapidly (see section for aclosely related analytical result) and without this stabilization the Choleskydecomposition fails. The effect on the generated samples is to add additionalindependent noise of variance . From the context can usually be chosen tohave inconsequential effects on the samples, while ensuring numerical Matrix IdentitiesThematrix inversion lemma, also known as the Woodbury, Sherman & Morri-matrix inversion lemmason formula (see Press et al. [1992, p. 75]) states that(Z+UWV>) 1=Z 1 Z 1U(W 1+V>Z 1U) 1V>Z 1,( )assuming the relevant inverses all exist. HereZisn n,Wism mandUandVare both of sizen m; consequently ifZ 1is known, and a low rank ( < n)perturbation is made toZas in left hand side of eq. ( ), considerable speedupcan be achieved. A similar equation exists for determinantsdeterminants|Z+UWV>|=|Z||W||W 1+V>Z 1U|.( )Let the invertiblen nmatrixAand its inverseA 1be partitioned intoinversion of apartitioned matrixA=(P QR S),A 1=( P Q R S),( )wherePand Paren1 n1matrices andSand Saren2 n2matrices withn=n1+n2. The submatrices ofA 1are given in Press et al. [1992, p. 77] as P=P 1+P 1QMRP 1 Q= P 1QM R= MRP 1 S=M whereM= (S RP 1Q) 1,( )C. E. Rasmussen & C. K. I. Williams, Gaussian Processes for Machine Learning, the MIT Press, 2006,ISBN 2006 Massachusetts Institute of Backgroundor equivalently P=N Q= NQS 1 R= S 1RN S=S 1+S 1RNQS 1 whereN= (P QS 1R) 1.( ) Matrix DerivativesDerivatives of the elements of an inverse matrix:derivative of inverse K 1= K 1 K K 1,( )where K is a matrix of elementwise derivatives. For the log determinant of aderivative of logdeterminantpositive definite symmetric matrix we have log|K|= tr(K 1 K ).( ) Matrix NormsThe Frobenius norm A Fof an1 n2matrixAis defined as A 2F=n1 i=1n2 j=1|aij|2= tr(AA>),( )[Golub and Van Loan, 1989, p. 56]. Cholesky DecompositionThe Cholesky decomposition of a symmetric, positive definite matrixAdecom-posesAinto a product of a lower triangular matrixLand its transposeLL>=A,( )whereLis called the Cholesky factor. The Cholesky decomposition is usefulfor solving linear systems with symmetric, positive definite coefficient matrixA. To solveAx=bforx, first solve the triangular systemLy=bby forwardsolving linear systemssubstitution and then the triangular systemL>x=yby back the backslash operator, we write the solution asx=L>\(L\b), wherethe notationA\bis the vectorxwhich solvesAx=b. Both the forward andbackward substitution steps requiren2/2 operations, whenAis of sizen costThe computation of the Cholesky factorLis considered numerically extremelystable and takes timen3/6, so it is the method of choice when it can be E. Rasmussen & C. K. I. Williams, Gaussian Processes for Machine Learning, the MIT Press, 2006,ISBN 2006 Massachusetts Institute of Entropy and Kullback-Leibler Divergence203Note also that the determinant of a positive definite symmetric matrix can bedeterminantcalculated efficiently by|A|=n i=1L2ii,or log|A|= 2n i=1logLii,( )whereLis the Cholesky factor Entropy and Kullback-Leibler DivergenceTheentropyH[p(x)] of a distributionp(x) is a non-negative measure of theentropyamount of uncertainty in the distribution, and is defined asH[p(x)] = p(x) logp(x)dx.( )The integral is substituted by a sum for discrete variables. Entropy is measuredinbitsif the log is to the base 2 and innatsin the case of the natural log. Theentropy of a Gaussian inDdimensions, measured in nats isH[N( , )] =12log| |+D2(log 2 e).( )The Kullback-Leibler (KL) divergence (or relative entropy) KL(p||q) be-tween two distributionsp(x) andq(x) is defined asKL(p||q) = p(x) logp(x)q(x)dx.( )It is easy to show that KL(p||q) 0, with equality ifp=q(almost everywhere).For the case of two Bernoulli random variablespandqthis reduces todivergence of Bernoullirandom variablesKLBer(p||q) =plogpq+ (1 p) log(1 p)(1 q),( )where we usepandqboth as the name and the parameter of the Bernoullidistributions. For two Gaussian distributionsN( 0, 0) andN( 1, 1) wedivergence of Gaussianshave [Kullback, 1959, sec. ]KL(N0||N1) =12log| 1 10|+12tr 11(( 0 1)( 0 1)>+ 0 1).( )Consider a general distributionp(x) onRDand a Gaussian distributionq(x) =minimizing KL(p||q)divergence leads tomoment matchingN( , ). ThenKL(p||q) = 12(x )> 1(x )p(x)dx+12log| |+D2log 2 + p(x) logp(x)dx.( )C. E. Rasmussen & C. K. I. Williams, Gaussian Processes for Machine Learning, the MIT Press, 2006,ISBN 2006 Massachusetts Institute of BackgroundEquation ( ) can be minimized and by differentiating theseparameters and setting the resulting expressions to zero. The optimalqis theone that matches the first and second moments KL divergence can be viewed as the extra number of nats needed onaverage to code data generated from a sourcep(x) under the distributionq(x)as opposed top(x). LimitsThe limit of a rational quadratic is a squared exponentiallim (1 +x22 ) = exp( x22).( ) Measure and IntegrationHere we sketch some definitions concerning measure and integration; fullertreatments can be found in Doob [1994] and Bartle [1995].Let be the set of all possible outcomes of an experiment. For example, foraD-dimensional real-valued variable, =RD. LetFbe a -field of subsetsof which contains all the events in whose occurrences we may be is a countably additivemeasureif it is real and non-negative and forall mutually disjoint setsA1, A2,... Fwe have ( i=1Ai)= i=1 (Ai).( )If ( )< then is called afinite measureand if ( ) = 1 it is calledfinite measureaprobability measure. TheLebesgue measuredefines a uniform measure overprobability measureLebesgue measuresubsets of Euclidean space. Here an appropriate -algebra is the Borel -algebraBD, whereBis the -algebra generated by the open subsets ofR. For exampleon the lineRthe Lebesgue measure of the interval (a,b) isb now restrict to beRDand wish to give meaning to integration of afunctionf:RD Rwith respect to a measure f(x)d (x).( )We assume thatfismeasurable, that for any Borel-measurable setA R,f 1(A) BD. There are two cases that will interest us (i) when is theLebesgue measure and (ii) when is a probability measure. For the first caseexpression ( ) reduces to the usual integral notation f(x) restriction to a -field of subsets is important technically to avoid paradoxes such asthe Banach-Tarski paradox. Informally, we can think of the -field as restricting considerationto reasonable E. Rasmussen & C. K. I. Williams, Gaussian Processes for Machine Learning, the MIT Press, 2006,ISBN 2006 Massachusetts Institute of Fourier Transforms205For a probability measure onx, the non-negative functionp(x) is calledthedensityof the measure if for allA BDwe have (A) = Ap(x)dx.( )If such a density exists it is uniquely determined almost everywhere, exceptfor sets with measure zero. Not all probability measures have densities onlydistributions that assign zero probability to individual points inx-space canhave (x) exists then we have f(x)d (x) = f(x)p(x)dx.( )If does not have a density expression ( ) still has meaning by the standardconstruction of the Lebesgue =RDthe probability measure can be related to thedistributionfunctionF:RD [0,1] which is defined asF(z) = (x1 z1,...xD zD). The distribution function is more general than the density as it is alwaysdefined for a given probability measure. A simple example of a random variable point mass examplewhich has a distribution function but no density is obtained by the followingconstruction: a coin is tossed and with probabilitypit comes up heads; if itcomes up headsxis chosen fromU(0,1) (the uniform distribution on [0,1]),otherwise (with probability 1 p)xis set to 1/2. This distribution has a pointmass (or atom) atx= 1 be a measure on an input setX. For some functionf:X Rand1 p < , we define f Lp(X, ),( |f(x)|pd (x))1/p,( )if the integral exists. Forp= we define f L (X, )= ess supx X|f(x)|,( )where ess sup denotes the essential supremum, the smallest number thatupper bounds|f(x)|almost everywhere. The function spaceLp(X, ) is definedfor anypin 1 p as the space of functions for which f Lp(X, )< . Fourier TransformsFor sufficiently well-behaved functions onRDwe havef(x) = f(s)e2 is xds, f(s) = f(x)e 2 is xdx,( )3A measure has a density if and only if it is absolutely continuous with respect toLebesgue measure onRD, every set that has Lebesgue measure zero also has E. Rasmussen & C. K. I. Williams, Gaussian Processes for Machine Learning, the MIT Press, 2006,ISBN 2006 Massachusetts Institute of Backgroundwhere f(s) is called theFourier transformoff(x), see Bracewell [1986].We refer to the equation on the left as thesynthesisequation, and the equationon the right as theanalysisequation. There are other conventions for Fouriertransforms, particularly those involving = 2 s. However, this tends to de-stroy symmetry between the analysis and synthesis equations so we use thedefinitions given we have defined Fourier transforms forf(x) being a function related transforms for periodic functions, functions defined on the integerlattice and on the regularN-polygon see section ConvexityBelow we state some definitions and properties of convex sets and functionstaken from Boyd and Vandenberghe [2004].A setCis convex if the line segment between any two points inClies inC,convex if for anyx1, x2 Cand for any with 0 1, we have x1+ (1 )x2 C.( )A functionf:X Risconvexif its domainXis a convex set and if for allconvex functionx1, x2 Xand with 0 1, we have:f( x1+ (1 )x2) f(x1) + (1 )f(x2),( )whereXis a (possibly improper) subset fis functionfis convex if and only if its domainXis a convex set and itsHessian is positive semidefinite for allx X.