Fellowship - Sparse, Rank-Reduced and General Smooth Modelling

  • Wood, Simon, (PI)

Project: Research council

Description

Smooth regression models are useful when some variable of interest is related to a number of predictor variable in a complex manner, and we want to understand that relationship. In many cases the complexity of the dependence between the variables means that it is impractical to follow the traditional statistical approach of writing down a simple statistical model describing the relationship, in which only a few unknown parameters are to be estimated. Instead the statistical model is specified in terms of unknown smooth functions of predictors, for example `log blood pressure is given by a smooth function of age plus a smooth function of weight and height plus a smooth function of hours of exercise per week'. The statistical challenge is then to estimate the smooth functions. Given decades of work on the theory and computation of these smooth models, their use is now widespread and almost as routine as that of traditional regression models. However there remain several practical obstacles to their use, in exactly the complex data situations in which they should be most appealing. 1. Current methods allow either the effective modelling of short range spatial, temporal or spatio-temporal correlation, via sparse computational methods, OR the modelling of complex relationships involving many variables, via reduced rank methods, but not both. However it is complex models with short range residual correlation are exactly where such smooth models are most practically appealing. 2. In the reduced rank setting, that allows feasible computation with highly complex models, the most reliable and efficient computational methods are so far restricted to situations where variable of interest comes from the exponential family of distributions (normal, Poisson, binomial etc). But given the proven wide utility of such methods, there would also be many applications for similarly reliable methods for models where the variable of interest follows a very different distribution to those in the exponential family (for example it might be the waiting time to an event, or the occurrence of an event at a spatial location). 3. Increasingly researchers and companies are seeking to analyze very large datasets, which are simply infeasible with current smooth modelling technology. This project aims to address these challenges, thereby massively increasing the practical scope and utility of this class of models. In particular the project will seek to find novel ways to hybridize the sparse and reduced rank approaches to smooth modelling to resolve issue 1; to build on experience with the exponential family methods to develop reliable and efficient methods for variables from a much more general class of distributions, to resolve 2; and to develop novel and efficient algorithms for handling large and complex models that can be readily parallelized on cheap standard computer hardware, to address 3. The methods developed will be implemented in free open source software, building on the PIs successfully mgcv package for generalized additive modelling, in the R statistical computing environment. The methods will also be disseminated via a textbook, short courses and the provision of web resources.

Key findings

* Initial parallelization methods developed and released in R package mgcv.
* improved p-values for random effect terms
* General framework developed for smooth models beyond exponential family - paper and software nearing completion
StatusFinished
Effective start/end date1/02/1330/11/15

Funding

  • Engineering and Physical Sciences Research Council

Fingerprint

Computational methods
Textbooks
Blood pressure
Normal distribution
Computer hardware
Industry
Statistical Models
Open source software