Pehong Chen Distinguished Professor, Department of Electrical Engineering and Computer Science and Department of Statistics, California University, Berkeley

Members of NAS, NAE and AAAS

Divide-and-Conquer and Statistical Inference for Big Data

Divide-and-conquer is a natural computational paradigm for approaching Big Data problems, particularly given recent developments in distributed and parallel computing, but some interesting challenges arise when applying divide-and-conquer algorithms to statistical inference problems. One interesting issue is that of obtaining confidence intervals in massive datasets. The bootstrap principle suggests resampling data to obtain fluctuations in the values of estimators, and thereby confidence intervals, but this is infeasible with massive data. Subsampling the data yields fluctuations on the wrong scale, which have to be corrected to provide calibrated statistical inferences. The new procedure, the “bag of little bootstraps,” circumvents this problem, inheriting the favorable theoretical properties of the bootstrap but also having a much more favorable computational profile. Another issue is the problem of large-scale matrix completion. Here divide-and-conquer is a natural heuristic that works well in practice, but new theoretical problems arise when attempting to characterize the statistical performance of divide-and-conquer algorithms. Here the theoretical support is provided by concentration theorems for random matrices, and a new approach to this problem bases on Stein's method.

[Joint work with Ariel Kleiner, Lester Mackey, Purna Sarkar, Ameet Talwalkar, Richard Chen, Brendan Farrell and Joel Tropp]