Interplays between Numerical Approximation, Gaussian Process Regression and Learning.
Broadly speaking my research concerns the exploration, from a Game Theoretic perspective, of interplays
between Numerical Approximation, Gaussian Process Regression and Learning with applications to Numerical Homogenization,
Operator Adapted Wavelets, Fast Solvers, Dense Kernel Matrices, Machine Learning and Uncertainty Quantification.
For more see the following [ paper ] and the following [ slides ] (no videos) [ slides ] (with videos)
A few highlights are described below.
[ pdf | arXiv:1406.6668 ] and [ pdf | arXiv:1503.03467 ] introduced the idea of solving deterministic PDEs as learning problems
with GPs/kernel methods and informing the underlying kernels/GPs about the underlying physics. The motivation for this work
was using the Gaussian Process perspective as a pathway to the discovery of numerical
methods. Indeed, as discussed in [ pdf | arXiv:1406.6668 ] while the discovery of numerical solvers
for PDEs is usually based on a combination of insight and guesswork, this process can be facilitated to the point of being
automated, using this GP perspective. For instance, for nonsmooth PDEs, physics informed Gaussian priors can be
identified by (a) filtering uninformed Gaussian priors through the inverse operator or (b) turning the process of computing fast
with partial information into repeated zero-sum games with physics informed losses (whose optimal mixed strategies are
physics informed Gaussian priors). Solving deterministic PDEs as learning problems is now
a popular topic in Computational Sciences and Engineering.
See also this [ 2015 SIAM CSE talk ] for a presentation of the vision of solving CSE problems as learning/decision theory problems.
The interface between data sciences, ML and Computational Sciences and Engineering is now an increasingly popular field.
[ pdf ] introduced the idea of combining compressed sensing with concentration inequalities to learn
SPDEs. Employing sparse approximation to learn differential equations is now a popular topic in Computational
Sciences and Engineering.
[ pdf | arXiv:1304.7046 ] introduced the idea of automating the discovery of new mathematical identities (Selberg).
[ pdf | arXiv:1009.0679 ] developed a worst-case approach to Uncertainty Quantification with unknown functions and measures
leading to computable optimal (min and max) bounds on quantities of interest. These correspond to infinite-
dimensional optimization problems over spaces of functions and measures -very generally, they can be reduced to
finite-dimensional, and hence amenable to computation.The underlying calculus was generalized
in [ pdf | arXiv:1304.6772 | blog ] and [ pdf | arXiv:1308.6306 | blog 1 | blog 2 ] to analyze the robustness of
Bayesian Inference under finite information . This analysis has uncovered the extreme sensitivity (brittleness) of Bayesian inference
and suggested that robust inference, in a continuous world under finite-information, should be done with
reduced/coarse models rather than highly sophisticated/complex models (with a
level of coarseness/reduction depending on the available finite-information) . The mechanism causing brittleness/robustness
(1) demonstrated that accuracy and robustness are antagonistic requirements, (2) raised the question of a missing
stability condition for using Bayesian Inference in a continuous world under finite information.
The study of the lack of robustness of learning algorithms and the design of attacks exploiting this lack of robustness is now
a popular topic in Machine Learning (known as adversarial machine learning).
Kernel Flows
Learning can be formulated as solving an interpolation problem. Although Kriging offers a solution, it does not scale well with
the number of training data points and it requires the prior specification of a Kernel .
Kernel Flow (KF) is an algorithm that learns kernels by simulating a data driven dynamical system (a flow in the input space).
The motivation for developing this algorithm is that it is amenable to some degree of analysis (which could help in the formulation
of a rigorous theory for deep learning) and is scalable to large datasets.
When applied to datasets such as MNIST or fashion-MNIST KF produces a kernel capable of generalization from 1 sample
per class. The gifs below and above show the application of KF to the fashion MNIST dataset and to the swiss roll cheesecake
classification problem. The bottom left mage shows the instantaneous flow generated by KF, the bottom right image shows the
time-averaged flow.
For more see the following [ slides ] (no videos) [slides] (with videos), the following [ paper ] and the following [videos ]
Operator adapted wavelets.
The application of the game theoretic approach to the problem of approximating solutions of linear operators
also leads the identification of operator adapted wavelets (gamblets) satisfying three properties:
(1) scale orthogonality (in operator scalar product, which leads to a a block-diagonalization of the operator)
(2) well-conditioned linear systems (which leads to blocks of uniformly bounded condition numbers)
(3) localization or exponential decay (which leads to sparse blocks).
These gamblets can also be interpreted as Wannier functions, i.e. linear combinations of eigenfunctions concentrated
around given eigenvalues that also remain concentrated in space. They are identified by conditioning the Gaussian field emerging
from the game described above with respect to a hierarchy of non-adapted wavelets (e.g. Haar). Although wavelets satisfying
one of those three conditions were known before, this simple Gaussian Process Regression approach leads to the first identification
satisfying all these three properties.
Numerical homogenization.
Numerical homogenization can be formalized as the problem of identifying a finite number of basis functions (for a given operator)
that are (1) as accurate as possible (in approximating the solution space of that operator) (2) as localized as possible.
This problem is non trivial because (1) and (2) are conflicting and its solution provides a generalization of results from
classical homogenization with assumptions such as scale separation or ergodicity. The adversarial game approach
to numerical approximation provides a simple solution to this problem and the resulting basis functions can be identified
as elementary bets for playing a game with the task of recovering an unknown function from partial measurements.
The optimal strategy for this game is a Gaussian field whose conditioning (on elementary measurements) produces these
elementary bets (gamblets)/basis functions. This interplay between Gaussian Process Regression and Numerical Approximation
provides the first rigorous proof of the screening effect with quantitative exponential decay estimates.
Game theoretic approach to numerical approximation and algorithm design.
The three main approaches to uncertainty quantification (worst case, Bayesian, adversarial game/decision theory)
can be translated into three main approaches to numerical analysis and algorithm design. The worst case
approach is the classical approach of numerical analysis. The Bayesian approach is the average case approach
of information based complexity and bayesian numerical analysis which is currently reemerging as probabilistic numerics.
The game theoretic approach stems from the observation that to compute fast one must compute with partial information
over hierarchies of levels of complexity. This process of computing with partial information can be turned into a process of
playing repeated adversarial games against the missing information. The optimal (mixed/randomized) strategies for playing such
games can be used to automate the process of discovery in numerical analysis and algorithm design (i.e. learn
numerical solvers through a machine learning approach to approximating the solutions of ODEs/PDEs).
Stochastic and Multiscale analysis.
My older research work concerns
(1) The study of anomalous diffusion in systems characterized by an infinite number of scales. [ paper ]
(2) Homogenization and inverse homogenization with a continuum of scales.
(3) Numerical Homogenization. [ paper ]
(4) The geometric integration of stochastic Hamiltonian systems. [ paper ]
(5) The structure preserving integration of (possibly stochastic) multiscale systems with hidden slow variables. [ paper ]
(6) Super-resonnance. Control via parametric resonance.
(7) The proof of Davies’ and periodization conjectures in homogenization. [ paper ] [ paper ]
Stochastic and derandomization calculus in infinite dimensional spaces.
The worst case, Bayesian and decision theory approaches to Uncertainty Quantification require the development of a form of
calculus enabling the manipulation of infinite dimensional information structures involving the resolution of optimization
(min, max and minimax) problems over spaces of measures and functions and spaces of measures over measures and functions.
Another motivation for the development of this calculus is its potential application to (1) the design of fast, robust and
scalable numerical solvers and the analysis of partial differential equations (2) model form uncertainty quantification and design
(3) model reduction. Although these applications may appear, at first glance, dissimilar, they are all based on the efficient
processing of incomplete information with limited computational resources (i.e., complexity management capabilities
constraints): (1) fast and robust computation requires computation with partial information (this concept forms the core
of Information Based Complexity where it is also augmented by concepts of contaminated and priced information associated
with, for example, truncation errors and the cost of numerical operations) (2) model form UQ and design require the
management and processing of epistemic uncertainties and limited data, and (3) model reduction requires the approximation
of the full state of a complex system through operations performed on a few (coarse/reduced) variables.
Examples of applications include Optimal Uncertainty Quantification and the reduction of infinite dimensional optimization
problems over measures and functions to finite dimensional ones, the identification of new Selberg identities [ paper ],
the quantification of the robustness of Bayesian inference under finite information, the identification of
Gaussian fields as optimal mixed strategies for the game theoretic approach to numerical approximation.
Research supported by
•
AFOSR
•
BEYOND LIMITS
•
DARPA
•
DOE
•
LANL
•
NASA/JPL
•
NNSA
•
NSF
•
ONR
•
UTRC
Uncertainty Quantification
The work of my group in this area has been focused on adversarial approaches to Uncertainty Quantification (worst case
min max and game theoretic minimax). The following [ slides ] (Caltech Nov. 2013) provide a high level description of this research.
The following extract from this [ paper ] describes our general vision (based on Decision/Game theory)
During the past century the need to solve large complex problems in applications such as fluid dynamics, neutron
transport or ballistic prediction drove the parallel development of computers and numerical methods for solving ODEs and PDEs.
It is now clear that this development lead to a paradigm shift. Before: each new PDE required the development of new theoretical
methods and the employment of large teams of mathematicians and physicists; in most cases, information on
solutions was only qualitative and based on general analytical bounds on fundamental solutions.
After: mathematical analysis and computer science worked in synergy to give birth to robust numerical methods
(such as finite element methods) capable of solving a large spectrum of PDEs without requiring the level of
expertise of an A.~L.~Cauchy or level of insight of a R.~P.~Feynman. This transformation can be traced back to sophisticated
calculations performed by arrays of human computers organized as parallel clusters such as in the pioneering work of
Lewis Fry Richardson, who in 1922 had a room full of clerks attempt to solve finite-difference equations for the purposes of
weather forecasting, and the 1947 paper by John Von Neumann and Herman Goldstine on Numerical Inverting of Matrices of
High Order. Although Richardson's predictions failed due to the use of unfiltered data/initial conditions/equations and large
time-steps not satisfying the CFL stability condition, his vision was shared by Von Neumann in his proposal of the Meteorology
Research Project to the U.S. Navy in 1946, qualified by Platzman as ``perhaps the most visionary prospectus for numerical
weather prediction since the publication of Richardson’s book a quarter-century earlier.''
The past century has also seen a steady increase in the need of estimating and predicting complex systems and
making (possibly critical) decisions with limited information. Although computers have made possible the numerical evaluation
of sophisticated statistical models, these models are still designed by humans through the employment of
multi-disciplinary teams of physicists, computer scientists and statisticians. Contrary to the original human computers
(such as the ones pioneered by L. F. Richardson or overseen by R. P. Feynman at Los Alamos), these human
teams do not follow a specific algorithm (such as the one envisioned in Richardson's Forecast Factory where 64,000 human
computers would have been working in parallel and at high speed to compute world weather charts) because there is currently
no known recipe or algorithm for dividing the design of a statistical model into a sequence of arithmetic operations.
Furthermore, while human computers were given a specific PDE or ODE to solve, these human teams are not given a well
posed problem with a well defined notion of solution. As a consequence different human teams come
up with different solutions to the design of the statistical model along with different estimates on uncertainties.
Indeed enabling computers to think as humans have the ability to do when faced with uncertainty
is challenging in several major ways: (1) There is currently no known recipe or algorithm for dividing the design of a statistical
model into a sequence of arithmetic operations (2) Formulating the search for an optimal statistical estimator/model as a well
posed problem is not obvious when information on the system of interest is incomplete and comes in the form of a
complex combination of sample data, partial knowledge of constitutive relations and a limited description of the
distribution of input random variables. (3) The space of admissible scenarios along with the space of relevant information,
assumptions, and/or beliefs, tend to be infinite dimensional, whereas calculus on a computer is necessarily discrete and finite.
One purpose of UQ is to explore and develop the foundations of a rigorous/rational framework for the
scientific computation of optimal statistical estimators/models and the optimal quantification of uncertainties
for complex systems. The emerging questions lead to many challenging problems in Decision Theory,
Machine Learning, Bayesian Inference, Stochastic Optimization, Robust Optimization, and Information Based Complexity,
the most fundamental of these being the simultaneous emphasis on computation and performance as in Machine Learning
initiated by Valiant.
For more see the following [ slides ], the following [ paper 1, paper 2 ] and the following interview
HPC Wire: The Masters of Uncertainty (09/2013)
“Technology, in common with many other activities, tends toward avoidance of risks
by investors. Uncertainty is ruled out if possible. People generally prefer the
predictable. Few recognize how destructive this can be, how it imposes severe
limits on variability and thus makes whole populations fatally vulnerable to the
shocking ways our universe can throw the dice.”
Frank Herbert (Heretics of Dune)
Brittleness/Robustness of machine learning algorithms
Under FA9550-12-1-0389 we derived Brittleness (non robustness) results for Bayesian estimators.
(see the following blog posts [09/2013], [01/2015], [01/2016])
Since this lack of robustness is caused by a mechanism that (1) is inherent to doing inference in a continuous world with
finite-information, and (2) implies that consistency and robustness are conflicting requirements, we predicted
that deep/machine learning algorithms could be non robust and could lead to increased confidence in incorrect solutions
(see [ talk ] by Mike McKerns at SciPy 2013 published on July 2013). Google engineers who were present at the talk tested these
predictions for neural networks and observed (Szegedy et al, Dec 2013, Intriguing properties of neural networks) the non
robustness of these algorithms to adversarial examples. This area has grown into the field known as adversarial machine learning
which studies (1) how those algorithms could be attacked by exploiting their brittleness and (2) how to protect those algorithms
against those attacks. Addressing these vulnerabilities is now recognized as critical to the safety of Machine Learning Algorithms.
Kernel mode decomposition and programable/interpretable regression networks
Kernel mode decomposition networks (KMDNets) are interpretable regression networks for pattern recognition. KMDNets are
programmable and amenable to analysis, whereas artificial neural networks are trainable but not interpretable, nor programmable.
The programming of KMDNets is achieved by assembling elementary modules decomposing and recomposing kernels and data.
These elementary modules are repeated across levels of abstraction and interpreted from the equivalent perspectives of optimal
recovery, game theory and Gaussian process regression (GPR). The mode decomposition module produces an approximation of a
vector (whose entries are elements of linear sub-spaces of a common Hilbert space) from the observation of the sum of its entries.
The second module recomposes modes and kernels based on alignments between the signal and the recovered modes. As a
prototypical application, KMDNets have been programmed to recover (to near-machine precision) the modes of a (possibly noisy)
signal when those modes are unknown periodic functions that are slowly modulated in amplitude and frequency (when the periodic
waveforms are trigonometric, these modes are known intrinsic mode functions in the setting of empirical mode decomposition).
For more see the following the following [ paper ] and the following [ slides ]