Random Matrices and Big Data Sets

Lead CI: Peter Forrester

In classical statistics, a big data set is when the population size (p) and/or the number of observations (n) are large. The corresponding data matrix has one column for each member of the population and therefore is of size n by p. Each member of the population can be thought of as a variable. and thus as an extra dimension. A celebrated example is the situation of p=95 genes at n = 400 locations across Europe. This data set was analysed to support the hypothesis that the transfer of farm technology across Europe thousands of years ago was the result of a migration of people rather a transfer of technology [Cavalli-Sforza, 2000]. The question is to quantify the number of different populations, as inferred from the data. The classical solution is to perform a principal component analysis of the corresponding covariance matrix – the eigenvector corresponding to the largest eigenvalue tells what linear combination of the variables is responsible for the largest variation in the data, and furthermore the relative size of the eigenvalues measure the relative importance of other linear combinations. What now if the data matrix is contaminated by high dimensional noise? Such a situation is typical in the time series of value of various international currencies against the US dollar, for example. Correlation between the variables is then thought of as a signal, and the objective is to devise inference methods to detect the signal. The study of such questions is the domain of random matrix theory, which is the field of expertise of CI Forrester. Much attention has been focused on the so-called spiked covariance model of Johnstone [2001], in which the covariance matrix is a low rank perturbation from unity. It is the low rank case which will be studied first. With great international attention on random matrix products at present, the most ripe model for the high dimensional noise is a product of standard complex Gaussians. Thus my specific suggestion is to study low rank perturbations of products of complex Gaussian matrices

In mathematics, a celebrated data set is Odlyzko’s [1989-present] computation of the Riemann zero at around 10^22, and 1 billion of its neighbours. One should mention that the Riemann zeta function is intimately related to prime numbers, and the Reimann hypothesis, which asserts that all the zeros of the Riemann zeta function are on the negative real axis or the line Re(s) = 1 is regarded as the most important unsolved problem in mathematics. The Montgomery-Odlyzko principle asserts that the scaled statistical properties of the large Riemann zeros coincide with those of the bulk eigenvalues of large complex Hermitian Gaussian random matrix. For large but finite N there are corrections. Work of Keating, Snaith, and Bogomolny et al [2006] suggest that these corrections relate to the circular ensemble of random unitary matrices with Haar measure. I have written to Odlyzko and obtained this data set. My proposal is to further probe the veracity of this observation with respect to a variant of the nearest neighbour spacing statistic considered by Bogomolny et al [2006]. I want to first delete each zero in Odlyzko’s data set with probability (1 – z), then compute the nearest neigbbour spacing. This is an appealing statistic since for the circular unitary ensemble the 1/N corrections can be systematically computed using the Painleve characteriisation due to Jimbo et al. [1980], and further developed by myself and N.S. Witte [2000]. Working out the details of the later will be of independent interest.

Over the past decade there have been eye-catching applications of random matrix theory in the theory of algorithms and machine learning. Random matrices have been used as examples of large matrices with bounded restricted isometry constants, which is an idea due to Candes and Tao [2005], and in the proof of theorems in the field of compressed sensing (Donoho [2006]). Random matrices have also been used to give new proofs and insights into the celebrated Johnson-Lindenstrauss lemma concerning the distortion of the embedding of points from high-dimensional to low-dimensional Euclidean space (Dasgupta and Gupta, 2003), which is a result of major theoretical importance in the field of machine learning. Very recently I have written my first paper relating to this general research area, in the form of a (mostly) review article `The Golden-Thompson inequality – historical aspects and random matrix applications (JMP, vol. 55, 023503 (2014)), co-authored with Colin Thompson. Colin Thompson is the same Thompson as in the Golden-Thompson inequality, and is an emeritus professor in the Department of Mathematics and Statistics here at the University of Melbourne. My reading to date indicates that I am well placed to contribute to the topic of concentration of measure (in fact I have written 3 papers on this topic since 2012), and also the condition number of a large linear system with random coefficients, which from the work of Edelman [1988] is known to relate to integrable structures. In addition to researching on these topics, I plan to broaden my knowledge base by further study of the existing literature.