ACEMS team develops new method to speed up MCMC algorithms in big data problems

Powerful simulation algorithms for analysing complex data problems were developed already in the late 1940’s. It took however another half century before computers were fast enough to make such Markov Chain Monte Carlo (MCMC) algorithms practical in applications. Starting in the 1990’s these algorithms quickly became the workhorse for Bayesian statistical inference and have since then been used in thousands of high-profile applications.

The future of MCMC simulation has recently been questioned, however, since datasets have grown faster in size than the computing power to analyse them. As a result, practitioners have started to look elsewhere for approximate methods that scale better for big data, but MCMC remains the ideal.

matias-quiroz.jpg

Dr Matias Quiroz

A team of ACEMS researchers have recently proposed a novel method to speed up MCMC algorithms in big data problems. The paper, which has been accepted for publication in the Journal of the American Statistical Association (JASA), brings together several fields of statistics. The team consisted of Dr Matias Quiroz and Professor Robert Kohn from the University of New South Wales (UNSW), Professor Mattias Villani from Linköping University in Sweden, and Dr Minh Ngoc Tran from The University of Sydney.

Simply put, the research focuses on using a subsample of the data in each iteration of the algorithm, rather than the full data set, so that computationally it’s much quicker to do.

“So instead of evaluating the whole data set, we use a subsample and construct an estimate of the likelihood of the full data set,” said Dr Quiroz, one of the team members.

“The challenge is to do this in such a way that we don't lose any information contained in the full dataset. In other words, the aim is to get the same inference from a small subsample as you would if the full dataset was used instead.”

 “The problem formulation sounds very easy. But it turns out there are a lot of issues encountered, so we needed to develop a really solid framework to address the issues. I think that what made the research so successful is that it was a real team effort drawing on the research strengths of all the team to combine two quite separate fields in Statistics: The first is the well-developed field of Survey Sampling; the second is MCMC simulation methods and in particular the new and fast developing field of pseudo marginal methods.  The team showed the correctness of the methods in an MCMC context, demonstrated empirically that the methods worked well, and provided a rigorous mathematical justification of their large sample properties.”

Dr Quiroz says these types of methods are getting a lot of attention in the field of machine learning.

“Our paper show that our method outperforms those machine learning methods by a lot. We have some very interesting ongoing work on scaling these methods to a large number of parameters by using principles from physics,” said Dr Quiroz.

That work also includes Khue-Dung Dang, a Ph.D. student supervised by Prof Kohn and co-supervised by Dr Quiroz.