Stochastic spatial random forest (SS-RF) for interpolating probabilities of missing land cover data

How do you measure something you can’t see?

That’s the aim of new research just published by ACEMS’ researchers in the Journal of Big Data. Led by Jacinta Holloway Brown from ACEMS at QUT, the researchers developed a new statistical method to predict forest cover in satellite images where portions of the image are blocked by cloud cover. Not only that, the new method also calculates a probability to show how confident the prediction is.

Forests are a global environmental priority that need to be monitored frequently and at large scales. Free satellite images are a key data source for monitoring forest cover globally. However, these images often have missing data in tropical regions due to climate driven persistent cloud cover. Remote sensing and statistical approaches to filling these missing data gaps exist and these can be highly accurate, but these methods do not provide measures of uncertainty.
The new method is accurate (up to 0.90 accuracy), fast, scalable to millions of observations and importantly answers two questions: what do the missing data look like and how confident the researchers can be in their answers. The spatial stochastic random forest (SS-RF) is a two-step method that uses random forest algorithms to construct Beta distributions for interpolating missing data. In their case study, they use the method to fill missing data gaps due to simulated clouds in satellite images. They identify pixels as forest and not forest with an associated probability. These outputs can be used to produce spatial maps of probabilities which highlight areas of uncertainty. These maps can inform decisions about where it could be beneficial to invest in field data collection.
In addition, the new SS-RF method can be applied to other big data problems. For example, predicting probability of disease in health, species presence and absence, and crop identification in agricultural monitoring. The use of the random forest in their method means the SS-RF can be used in many applications because it will ‘learn’ from any training data set. The method can also be used for more than two classes, which will be shown in a future publication.