Interplays between Numerical Approximation, Gaussian Process Regression and Learning.

Broadly speaking my research concerns the exploration, from a Game Theoretic perspective, of interplays between Numerical Approximation, Gaussian Process Regression and Learning with applications to Numerical Homogenization, Operator Adapted Wavelets, Fast Solvers, Dense Kernel Matrices, Machine Learning and Uncertainty Quantification. For more see the following [ paper ] and the following [ slides ] (no videos) [ slides ] (with videos) A few highlights are described below. [ pdf | arXiv:1406.6668 ] and [ pdf | arXiv:1503.03467 ] introduced the idea of solving deterministic PDEs as learning problems with GPs/kernel methods and informing the underlying kernels/GPs about the underlying physics. The motivation for this work was using the Gaussian Process perspective as a pathway to the discovery of numerical methods. Indeed, as discussed in [ pdf | arXiv:1406.6668 ] while the discovery of numerical solvers for PDEs is usually based on a combination of insight and guesswork, this process can be facilitated to the point of being automated, using this GP perspective. For instance, for nonsmooth PDEs, physics informed Gaussian priors can be identified by (a) filtering uninformed Gaussian priors through the inverse operator or (b) turning the process of computing fast with partial information into repeated zero-sum games with physics informed losses (whose optimal mixed strategies are physics informed Gaussian priors). Solving deterministic PDEs as learning problems is now a popular topic in Computational Sciences and Engineering. See also this [ 2015 SIAM CSE talk ] for a presentation of the vision of solving CSE problems as learning/decision theory problems. The interface between data sciences, ML and Computational Sciences and Engineering is now an increasingly popular field. [ pdf ] introduced the idea of combining compressed sensing with concentration inequalities to learn SPDEs. Employing sparse approximation to learn differential equations is now a popular topic in Computational Sciences and Engineering. [ pdf | arXiv:1304.7046 ] introduced the idea of automating the discovery of new mathematical identities (Selberg). [ pdf | arXiv:1009.0679 ] developed a worst-case approach to Uncertainty Quantification with unknown functions and measures leading to computable optimal (min and max) bounds on quantities of interest. These correspond to infinite- dimensional optimization problems over spaces of functions and measures -very generally, they can be reduced to finite-dimensional, and hence amenable to computation.The underlying calculus was generalized in [ pdf | arXiv:1304.6772 | blog ] and [ pdf | arXiv:1308.6306 | blog 1 | blog 2 ] to analyze the robustness of Bayesian Inference under finite information . This analysis has uncovered the extreme sensitivity (brittleness) of Bayesian inference and suggested that robust inference, in a continuous world under finite-information, should be done with reduced/coarse models rather than highly sophisticated/complex models (with a level of coarseness/reduction depending on the available finite-information) . The mechanism causing brittleness/robustness (1) demonstrated that accuracy and robustness are antagonistic requirements, (2) raised the question of a missing stability condition for using Bayesian Inference in a continuous world under finite information. The study of the lack of robustness of learning algorithms and the design of attacks exploiting this lack of robustness is now a popular topic in Machine Learning (known as adversarial machine learning).

Kernel Flows

Learning can be formulated as solving an interpolation problem. Although Kriging offers a solution, it does not scale well with the number of training data points and it requires the prior specification of a Kernel . Kernel Flow (KF) is an algorithm that learns kernels by simulating a data driven dynamical system (a flow in the input space). The motivation for developing this algorithm is that it is amenable to some degree of analysis (which could help in the formulation of a rigorous theory for deep learning) and is scalable to large datasets. When applied to datasets such as MNIST or fashion-MNIST KF produces a kernel capable of generalization from 1 sample per class. The gifs below and above show the application of KF to the fashion MNIST dataset and to the swiss roll cheesecake classification problem. The bottom left mage shows the instantaneous flow generated by KF, the bottom right image shows the time-averaged flow. For more see the following [ slides ] (no videos) [slides] (with videos), the following [ paper ] and the following [videos ]

Operator adapted wavelets.

The application of the game theoretic approach to the problem of approximating solutions of linear operators also leads the identification of operator adapted wavelets (gamblets) satisfying three properties: (1) scale orthogonality (in operator scalar product, which leads to a a block-diagonalization of the operator) (2) well-conditioned linear systems (which leads to blocks of uniformly bounded condition numbers) (3) localization or exponential decay (which leads to sparse blocks). These gamblets can also be interpreted as Wannier functions, i.e. linear combinations of eigenfunctions concentrated around given eigenvalues that also remain concentrated in space. They are identified by conditioning the Gaussian field emerging from the game described above with respect to a hierarchy of non-adapted wavelets (e.g. Haar). Although wavelets satisfying one of those three conditions were known before, this simple Gaussian Process Regression approach leads to the first identification satisfying all these three properties.

Numerical homogenization.

Numerical homogenization can be formalized as the problem of identifying a finite number of basis functions (for a given operator) that are (1) as accurate as possible (in approximating the solution space of that operator) (2) as localized as possible. This problem is non trivial because (1) and (2) are conflicting and its solution provides a generalization of results from classical homogenization with assumptions such as scale separation or ergodicity. The adversarial game approach to numerical approximation provides a simple solution to this problem and the resulting basis functions can be identified as elementary bets for playing a game with the task of recovering an unknown function from partial measurements. The optimal strategy for this game is a Gaussian field whose conditioning (on elementary measurements) produces these elementary bets (gamblets)/basis functions. This interplay between Gaussian Process Regression and Numerical Approximation provides the first rigorous proof of the screening effect with quantitative exponential decay estimates.

Game theoretic approach to numerical approximation and algorithm design.

The three main approaches to uncertainty quantification (worst case, Bayesian, adversarial game/decision theory) can be translated into three main approaches to numerical analysis and algorithm design. The worst case approach is the classical approach of numerical analysis. The Bayesian approach is the average case approach of information based complexity and bayesian numerical analysis which is currently reemerging as probabilistic numerics. The game theoretic approach stems from the observation that to compute fast one must compute with partial information over hierarchies of levels of complexity. This process of computing with partial information can be turned into a process of playing repeated adversarial games against the missing information. The optimal (mixed/randomized) strategies for playing such games can be used to automate the process of discovery in numerical analysis and algorithm design (i.e. learn numerical solvers through a machine learning approach to approximating the solutions of ODEs/PDEs).
Fast solvers, theory and algorithms. Scalable linear solvers play an important role in science and industry and  are characterized by a large degree of diversity. Although, as affirmed by Sard, no numerical approximation method can be universal, one may  wonder if some degree of universality could be achieved for a large class of linear operators? The game theoretic approach leads to scalable (of provable near-linear complexity) numerical solvers for  arbitrary linear elliptic operators on Sobolev spaces and dense kernel matrices. The complexity of these direct solvers is O(N log2d+1 N) which is the state of the art for PDEs with rough coefficients. These solvers also perform an approximate  PCA/eigen-subspace decomposition corresponding to a diagonalization into uniformly well conditioned and sparse blocks.

Stochastic and Multiscale analysis.

My older research work concerns (1) The study of anomalous diffusion in systems characterized by an infinite number of scales. [ paper ] (2) Homogenization and inverse homogenization with a continuum of scales. (3) Numerical Homogenization. [ paper ] (4) The geometric integration of stochastic Hamiltonian systems. [ paper ] (5) The structure preserving integration of (possibly stochastic) multiscale systems with hidden slow variables. [ paper ] (6) Super-resonnance. Control via parametric resonance. (7) The proof of Davies’ and periodization conjectures in homogenization. [ paper ] [ paper ]

Stochastic and derandomization calculus in infinite dimensional spaces.

The worst case, Bayesian and decision theory approaches to Uncertainty Quantification require the development of a form of calculus enabling the manipulation of infinite dimensional information structures involving the resolution of optimization (min, max and minimax) problems over spaces of measures and functions and spaces of measures over measures and functions. Another motivation for the development of this calculus is its potential application to (1) the design of fast, robust and scalable numerical solvers and the analysis of partial differential equations (2) model form uncertainty quantification and design (3) model reduction. Although these applications may appear, at first glance, dissimilar, they are all based on the efficient processing of incomplete information with limited computational resources (i.e., complexity management capabilities constraints): (1) fast and robust computation requires computation with partial information (this concept forms the core of Information Based Complexity where it is also augmented by concepts of contaminated and priced information associated with, for example, truncation errors and the cost of numerical operations) (2) model form UQ and design require the management and processing of epistemic uncertainties and limited data, and (3) model reduction requires the approximation of the full state of a complex system through operations performed on a few (coarse/reduced) variables. Examples of applications include Optimal Uncertainty Quantification and the reduction of infinite dimensional optimization problems over measures and functions to finite dimensional ones, the identification of new Selberg identities [ paper ], the quantification of the robustness of Bayesian inference under finite information, the identification of Gaussian fields as optimal mixed strategies for the game theoretic approach to numerical approximation.

Research supported by

AFOSR BEYOND LIMITS DARPA DOE LANL NASA/JPL NNSA NSF ONR UTRC

Uncertainty Quantification

The work of my group in this area has been focused on adversarial approaches to Uncertainty Quantification (worst case min max and game theoretic minimax). The following [ slides ] (Caltech Nov. 2013) provide a high level description of this research. The following extract from this [ paper ] describes our general vision (based on Decision/Game theory) During the past century the need to solve large complex problems in applications such as fluid dynamics, neutron transport or ballistic prediction drove the parallel development of computers and numerical methods for solving ODEs and PDEs. It is now clear that this development lead to a paradigm shift. Before: each new PDE required the development of new theoretical methods and the employment of large teams of mathematicians and physicists; in most cases, information on solutions was only qualitative and based on general analytical bounds on fundamental solutions. After: mathematical analysis and computer science worked in synergy to give birth to robust numerical methods (such as finite element methods) capable of solving a large spectrum of PDEs without requiring the level of expertise of an A.~L.~Cauchy or level of insight of a R.~P.~Feynman. This transformation can be traced back to sophisticated calculations performed by arrays of human computers organized as parallel clusters such as in the pioneering work of Lewis Fry Richardson, who in 1922 had a room full of clerks attempt to solve finite-difference equations for the purposes of weather forecasting, and the 1947 paper by John Von Neumann and Herman Goldstine on Numerical Inverting of Matrices of High Order. Although Richardson's predictions failed due to the use of unfiltered data/initial conditions/equations and large time-steps not satisfying the CFL stability condition, his vision was shared by Von Neumann in his proposal of the Meteorology Research Project to the U.S. Navy in 1946, qualified by Platzman as ``perhaps the most visionary prospectus for numerical weather prediction since the publication of Richardson’s book a quarter-century earlier.'' The past century has also seen a steady increase in the need of estimating and predicting complex systems and making (possibly critical) decisions with limited information. Although computers have made possible the numerical evaluation of sophisticated statistical models, these models are still designed by humans through the employment of multi-disciplinary teams of physicists, computer scientists and statisticians. Contrary to the original human computers (such as the ones pioneered by L. F. Richardson or overseen by R. P. Feynman at Los Alamos), these human teams do not follow a specific algorithm (such as the one envisioned in Richardson's Forecast Factory where 64,000 human computers would have been working in parallel and at high speed to compute world weather charts) because there is currently no known recipe or algorithm for dividing the design of a statistical model into a sequence of arithmetic operations. Furthermore, while human computers were given a specific PDE or ODE to solve, these human teams are not given a well posed problem with a well defined notion of solution. As a consequence different human teams come up with different solutions to the design of the statistical model along with different estimates on uncertainties. Indeed enabling computers to think as humans have the ability to do when faced with uncertainty is challenging in several major ways: (1) There is currently no known recipe or algorithm for dividing the design of a statistical model into a sequence of arithmetic operations (2) Formulating the search for an optimal statistical estimator/model as a well posed problem is not obvious when information on the system of interest is incomplete and comes in the form of a complex combination of sample data, partial knowledge of constitutive relations and a limited description of the distribution of input random variables. (3) The space of admissible scenarios along with the space of relevant information, assumptions, and/or beliefs, tend to be infinite dimensional, whereas calculus on a computer is necessarily discrete and finite. One purpose of UQ is to explore and develop the foundations of a rigorous/rational framework for the scientific computation of optimal statistical estimators/models and the optimal quantification of uncertainties for complex systems. The emerging questions lead to many challenging problems in Decision Theory, Machine Learning, Bayesian Inference, Stochastic Optimization, Robust Optimization, and Information Based Complexity, the most fundamental of these being the simultaneous emphasis on computation and performance as in Machine Learning initiated by Valiant. For more see the following [ slides ], the following [ paper 1, paper 2 ] and the following interview HPC Wire: The Masters of Uncertainty (09/2013)
“Technology, in common with many other activities, tends toward avoidance of risks by investors. Uncertainty is ruled out if possible. People generally prefer the predictable. Few recognize how destructive this can be, how it imposes severe limits on variability and thus makes whole populations fatally vulnerable to the shocking ways our universe can throw the dice.” Frank Herbert (Heretics of Dune)

Brittleness/Robustness of machine learning algorithms

Under FA9550-12-1-0389 we derived Brittleness (non robustness) results for Bayesian estimators. (see the following blog posts [09/2013], [01/2015], [01/2016]) Since this lack of robustness is caused by a mechanism that (1) is inherent to doing inference in a continuous world with finite-information, and (2) implies that consistency and robustness are conflicting requirements, we predicted that deep/machine learning algorithms could be non robust and could lead to increased confidence in incorrect solutions (see [ talk ] by Mike McKerns at SciPy 2013 published on July 2013). Google engineers who were present at the talk tested these predictions for neural networks and observed (Szegedy et al, Dec 2013, Intriguing properties of neural networks) the non robustness of these algorithms to adversarial examples. This area has grown into the field known as adversarial machine learning which studies (1) how those algorithms could be attacked by exploiting their brittleness and (2) how to protect those algorithms against those attacks. Addressing these vulnerabilities is now recognized as critical to the safety of Machine Learning Algorithms.

Kernel mode decomposition and programable/interpretable regression networks

Kernel mode decomposition networks (KMDNets) are interpretable regression networks for pattern recognition. KMDNets are programmable and amenable to analysis, whereas artificial neural networks are trainable but not interpretable, nor programmable. The programming of KMDNets is achieved by assembling elementary modules decomposing and recomposing kernels and data. These elementary modules are repeated across levels of abstraction and interpreted from the equivalent perspectives of optimal recovery, game theory and Gaussian process regression (GPR). The mode decomposition module produces an approximation of a vector (whose entries are elements of linear sub-spaces of a common Hilbert space) from the observation of the sum of its entries. The second module recomposes modes and kernels based on alignments between the signal and the recovered modes. As a prototypical application, KMDNets have been programmed to recover (to near-machine precision) the modes of a (possibly noisy) signal when those modes are unknown periodic functions that are slowly modulated in amplitude and frequency (when the periodic waveforms are trigonometric, these modes are known intrinsic mode functions in the setting of empirical mode decomposition). For more see the following the following [ paper ] and the following [ slides ]