H. Akaike, A new look at the statistical model identification, IEEE Transactions on Automatic Control, vol.19, pp.716-723, 1974.

Z. Allen-zhu, Y. Li, and Z. Song, A convergence theory for deep learning via over-parameterization, Proceedings of the 36th International Conference on Machine Learning, vol.97, pp.242-252, 2019.

N. Alon, Y. Matias, and M. Szegedy, The space complexity of approximating the frequency moments, Journal of Computer and system sciences, vol.58, p.1, 2008.

P. Alquier, Density estimation with quadratic loss: a confidence intervals method, ESAIM: Probability and Statistics, vol.12, pp.438-463, 2008.
URL : https://hal.archives-ouvertes.fr/hal-00020740

P. Alquier, Pac-bayesian bounds for randomized empirical risk minimizers, Mathematical Methods of Statistics, vol.17, issue.4, pp.279-304, 2008.
URL : https://hal.archives-ouvertes.fr/hal-00354922

P. Alquier and J. Ridgway, Concentration of tempered posteriors and of their variational approximations, 2017.

P. Alquier, J. Ridgway, and N. Chopin, On the properties of variational approximations of Gibbs posteriors, The Journal of Machine Learning Research, vol.17, issue.1, pp.8374-8414, 2016.
URL : https://hal.archives-ouvertes.fr/hal-02403354

A. Ambroladze, E. Parrado-hernández, and J. S. Shawe-taylor, Tighter pacbayes bounds, Advances in neural information processing systems, pp.9-16, 2007.

N. Amenta, M. Bern, D. Eppstein, and S. H. Teng, Regression depth and center points, Discrete and Computational Geometry, vol.23, issue.3, pp.305-323, 2000.

D. Andrews, Non-strong mixing autoregressive processes, Journal of Applied Probability, vol.21, issue.4, pp.930-934, 1984.

C. Andrieu, N. De-freitas, A. Doucet, J. , and M. I. , An introduction to mcmc for machine learning, Machine learning, vol.50, issue.1-2, pp.5-43, 2003.

S. Arlot, A. Celisse, and Z. Harchaoui, A kernel multiple change-point algorithm via model selection, 2012.
URL : https://hal.archives-ouvertes.fr/hal-00671174

J. Audibert, Fast learning rates in statistical inference through aggregation, The Annals of Statistics, vol.37, issue.4, pp.1591-1646, 2009.
URL : https://hal.archives-ouvertes.fr/hal-00139030

S. Ayer and H. Sawhney, Layered representation of motion video using robust maximum-likelihood estimation of mixture models and mdl encoding, International Conference on Computer Vision, 1995.

A. Bacharoglou, Approximation of probability distributions by convex mixtures of Gaussian measures, Proceedings of the American of the, vol.138, pp.2619-2628, 2010.

P. Baldi and K. Hornik, Neural networks and principal component analysis: Learning from examples without local minima, Neural Networks, vol.2, pp.53-58, 1989.

A. Banerjee, On Bayesian bounds, Proceedings of ICML, pp.81-88, 2006.

S. Banerjee, I. Castillo, and S. Ghosal, Bayesian inference in high-dimensional models, 2020.

Y. Baraud and L. Birgé, Rho-estimators revisited, General theory and applications, 2016.
URL : https://hal.archives-ouvertes.fr/hal-01314781

Y. Baraud, L. Birgé, and M. Sart, A new method for estimation and model selection: rho-estimation, Inventiones mathematicae, vol.207, issue.2, pp.425-517, 2017.
URL : https://hal.archives-ouvertes.fr/hal-00966808

A. Barron, Universal approximation bounds for superpositions of a sigmoidal function, Information Theory, IEEE Transactions on, vol.39, pp.930-945, 1993.

A. Barron, M. J. Schervish, and L. Wasserman, The consistency of posterior distributions in nonparametric problems, The Annals of Statistics, vol.27, issue.2, pp.536-561, 1999.

A. R. Barron, Approximation and estimation bounds for artificial neural networks, Machine Learning, vol.14, pp.115-133, 1994.

P. L. Bartlett, D. J. Foster, M. J. Telgarsky, I. Guyon, U. V. Luxburg et al., Spectrally-normalized margin bounds for neural networks, Advances in Neural Information Processing Systems, vol.30, pp.6240-6249, 2017.

L. E. Baum and T. Petrie, Statistical inference for probabilistic functions of finite state markov chains, Ann. Math. Statist, vol.37, issue.6, pp.1554-1563, 1966.

T. Bayes, An essay towards solving a problem in the doctrine of chances. Philosophical transactions of the, Royal Society of London, issue.53, pp.370-418, 1763.

G. Behrens, N. Friel, and M. Hurn, Tuning tempered transitions, Statistics and computing, vol.22, issue.1, pp.65-78, 2012.

G. Bellec, D. Kappel, W. Maass, and R. Legenstein, Deep rewiring: Training very sparse deep networks, International Conference on Learning Representations, 2018.

Y. Bengio and O. Delalleau, On the expressive power of deep architectures, Proceedings of the 22Nd International Conference on Algorithmic Learning Theory, ALT'11, pp.18-36, 2011.

R. Beran, Minimum hellinger distance estimates for parametric models, The annals of Statistics, vol.5, issue.3, pp.445-463, 1977.

S. Bernstein, Theory of probability, 1917.

E. Bernton, P. E. Jacob, M. Gerber, R. , and C. , Inference in generative models using the wasserstein distance, 2017.
URL : https://hal.archives-ouvertes.fr/hal-01517550

A. Bhattacharya, D. Pati, Y. , and Y. , Bayesian fractional posteriors, 2016.

A. Bhattacharya, D. Pati, Y. , and Y. , On statistical optimality of variational Bayes. PMLR: Proceedings of AISTAT, p.84, 2018.

G. Biau, B. Cadre, M. Sangnier, and U. Tanielian, Some theoretical properties of GANs, 2018.
URL : https://hal.archives-ouvertes.fr/hal-01737975

P. J. Bickel, Another look at robustness: A review of reviews and some new developments, Scand J. Statis, vol.3, pp.145-168, 1976.

C. Biernacki, G. Celeux, and G. Govaert, An improvement of the NEC criterion for assessing the number of clusters in a mixture model, Pattern Recognition Letters, vol.20, issue.3, pp.267-272, 1999.

L. Birgé, Approximation dans les espaces métriques et théorie de l'estimation. Annales de l'Institut Henri Poincare (B) Probability and Statistics, vol.65, pp.181-237, 1983.

L. Birgé, Model selection via testing: an alternative to (penalized) maximum likelihood estimators, 2006.

C. Bishop, Variational principal components, Proceedings Ninth International Conference on Artificial Neural Networks, ICANN'99, vol.1, pp.509-514, 1999.

C. M. Bishop, Pattern Recognition and Machine Learning, 2006.

P. G. Bissiri, C. C. Holmes, and S. G. Walker, A general framework for updating belief distributions, Journal of the Royal Statistical Society: Series B, vol.78, issue.5, pp.1103-1130, 2016.

D. M. Blei, A. Kucukelbir, and J. D. Mcauliffe, Variational inference: A review for statisticians, Journal of the American Statistical Association, vol.112, issue.518, pp.859-877, 2017.

D. M. Blei and J. D. Lafferty, Dynamic topic models, Proceedings of the 23rd International Conference on Machine Learning, pp.113-120, 2006.

D. M. Blei, A. Ng, C. Wang, J. , and M. , Latent Dirichlet allocation, The Journal of Machine Learning Research, vol.3, pp.993-1022, 2003.

C. Blundell, J. Cornebise, K. Kavukcuoglu, and D. Wierstra, Weight uncertainty in neural networks, Proceedings of the 32Nd International Conference on International Conference on Machine Learning, vol.37, pp.1613-1622, 2015.

S. Boucheron, G. Lugosi, and P. Massart, Concentration inequalities using the entropy method, Ann. Probab, vol.31, issue.3, pp.1583-1614, 2003.

S. Boucheron, G. Lugosi, and P. Massart, Concentration inequalities, 2012.
URL : https://hal.archives-ouvertes.fr/hal-00751496

C. Bouveyron and C. Brunet-saumard, Model-based clustering of highdimensional data: a review, Computational Statistics and Data Analysis, vol.71, pp.52-78, 2014.
URL : https://hal.archives-ouvertes.fr/hal-00750909

L. Breiman, Statistical modeling: The two cultures (with comments and a rejoinder by the author), Statistical science, vol.16, issue.3, pp.199-231, 2001.

L. Breiman, L. Lecam, and L. Schwartz, Consistent estimates and zero-one sets, The Annals of Mathematical Statistics, vol.35, issue.1, pp.157-161, 1964.

F. Briol, A. D. Barp, A. Girolami, and M. , Statistical inference for generative models via maximum mean discrepancy, 2019.

T. Broderick, N. Boyd, A. Wibisono, A. C. Wilson, J. et al., Streaming variational Bayes, NIPS, pp.1727-1735, 2013.

S. Bubeck, Introduction to online optimization. Lecture notes (Princeton University), 2011.

A. Buchholz, F. Wenzel, and S. Mandt, Quasi-Monte Carlo variational inference, Proceedings of the 35th International Conference on Machine Learning, vol.80, pp.668-677, 2018.

F. Bunea, A. B. Tsybakov, and M. H. Wegkamp, Sparse density estimation with 1 penalties, Conference on Computational Learning Theory, 2007.
URL : https://hal.archives-ouvertes.fr/hal-00160850

F. Bunea, A. B. Tsybakov, and M. H. Wegkamp, Spades and mixture models, The Annals of Statistics, vol.38, issue.4, pp.2525-2558, 2010.
URL : https://hal.archives-ouvertes.fr/hal-00514124

T. Cai, Z. Ma, and Y. Wu, Optimal estimation and rank detection for sparse spiked covariance matrices. Probability Theory and Related Fields, vol.161, pp.781-815, 2015.

T. Campbell and X. Li, Universal boosting variational inference, 2019.

P. Carbonetto and M. Stephens, Scalable variational inference for Bayesian variable selection in regression, and its accuracy in genetic association studies, Bayesian analysis, vol.7, issue.1, pp.73-108, 2012.

L. Carel and P. Alquier, Simultaneous dimension reduction and clustering via the nmf-em algorithm, 2017.

I. Castillo, Bayesian nonparametrics, convergence and limiting shape of posterior distributions. Habilitation à diriger des recherches, 2014.
URL : https://hal.archives-ouvertes.fr/tel-01096755

I. Castillo, J. Schmidt-hieber, and A. Vaart, Bayesian linear regression with sparse priors, The Annals of Statistics, vol.43, issue.5, pp.1986-2018, 2015.
URL : https://hal.archives-ouvertes.fr/hal-01226832

O. Catoni, Statistical Learning Theory and Stochastic Optimization. Saint-Flour Summer School on Probability Theory, Lecture Notes in Mathematics, 2001.
URL : https://hal.archives-ouvertes.fr/hal-00104952

O. Catoni, PAC-Bayesian supervised classification: the thermodynamics of statistical learning, Institute of Mathematical Statistics Lecture Notes-Monograph Series, vol.56, 2007.
URL : https://hal.archives-ouvertes.fr/hal-00206119

O. Catoni, Challenging the empirical mean and empirical variance: a deviation study, Annales de l'IHP Probabilités et statistiques, vol.48, issue.4, pp.1148-1185, 2012.
URL : https://hal.archives-ouvertes.fr/hal-00517206

O. Catoni and I. Giulini, Dimension free PAC-Bayesian bounds for the estimation of the mean of a random vector. PAC-Bayesian trends and insights, NIPS-2017 Workshop (Almost) 50 Shades of Bayesian Learning, 2017.

G. Celeux, S. Frühwirth-schnatter, and C. P. Robert, Handbook of Mixture Analysis, 2018.
URL : https://hal.archives-ouvertes.fr/hal-01928103

N. Cesa-bianchi and G. Lugosi, Prediction, learning, and games, 2006.

E. Challis and D. Barber, Gaussian Kullback-Leibler approximate inference, The Journal of Machine Learning Research, vol.14, issue.1, pp.2239-2286, 2013.

A. Chambaz, A. Garivier, and E. Gassiat, A minimum description length approach to hidden markov models with poisson and gaussian emissions. application to order identification, Journal of Statistical Planning and Inference, vol.139, issue.3, pp.962-977, 2009.

M. T. Chan, An optimal randomized algorithm for maximum tukey depth, Proceedings of the fifteenth annual ACM-SIAM symposium on Discrete algorithms, 2004.

M. Chen, C. Gao, and Z. Ren, Robust covariance and scatter matrix estimation under huber's contamination model, The Annals of Statistics, vol.46, issue.5, pp.1932-1960, 2018.

Y. Cheng, I. Diakonikolas, and R. Ge, High-dimensional robust mean estimation in nearly-linear time, Proceedings of the Thirtieth Annual ACM-SIAM Symposium on Discrete Algorithms, pp.2755-2771, 2019.

Y. Cherapanamjeri, N. Flammarion, and P. L. Bartlett, Fast mean estimation with sub-gaussian rates, 2019.

B. Chérief-abdellatif, F. Ruiz, C. Zhang, D. Liang, and T. Bui, Consistency of ELBO maximization for model selection, Proceedings of The 1st Symposium on Advances in Approximate Bayesian Inference, vol.96, pp.11-31, 2019.

B. Chérief-abdellatif, Convergence rates of variational inference in sparse deep learning, 2019.

B. Chérief-abdellatif and P. Alquier, Consistency of variational Bayes inference for estimation and model selection in mixtures, Electronic Journal of Statistics, vol.12, issue.2, pp.2995-3035, 2018.

B. Chérief-abdellatif and P. Alquier, Finite sample properties of parametric MMD estimation: robustness to misspecification and dependence, 2019.

B. Chérief-abdellatif, P. Alquier, F. Ruiz, T. Bui, A. B. Dieng et al., MMD-Bayes: Robust Bayesian estimation via Maximum Mean Discrepancy, Proceedings of The 2nd Symposium on Advances in Approximate Bayesian Inference, vol.118, pp.1-21, 2020.

B. Chérief-abdellatif, P. Alquier, and M. E. Khan, A generalization bound for online variational inference, Proceedings of The Eleventh Asian Conference on Machine Learning, vol.101, pp.662-677, 2019.

G. Chinot, G. Lecué, and M. Lerasle, Robust statistical learning with lipschitz and convex loss functions. Probability Theory and Related Fields, vol.0, pp.1-44, 2019.
URL : https://hal.archives-ouvertes.fr/hal-01923033

O. Collier and A. S. Dalalyan, Minimax estimation of a p-dimensional linear functional in sparse Gaussian models and robust estimation of the mean, 2017.

V. Cottet and P. Alquier, 1-bit matrix completion: PAC-Bayesian analysis of a variational approximation, Machine Learning, vol.107, issue.3, pp.579-603, 2018.

N. V. Cuong, L. S. Ho, and V. Dinh, Generalization and robustness of batched weighted average algorithm with V-geometrically ergodic Markov data, International Conference on Algorithmic Learning Theory, pp.264-278, 2013.

G. Cybenko, Approximation by superpositions of a sigmoidal function, Mathematics of Control, Signals, and Systems (MCSS), vol.2, pp.303-314, 1989.

B. Dai, N. He, H. Dai, and L. Song, Provable Bayesian inference via particle mirror descent, AISTAT, pp.985-994, 2016.

A. S. Dalalyan, E. Grappin, and Q. Paris, On the exponentially weighted aggregate with the laplace prior, The Annals of Statistics, vol.46, issue.5, pp.2452-2478, 2018.

A. S. Dalalyan and M. Sebbar, Optimal kullback-leibler aggregation in mixture density estimation by maximum likelihood, 2017.

A. S. Dalalyan and A. B. Tsybakov, Aggregation by exponential weighting and sharp oracle inequalities, Learning Theory, vol.4539, pp.97-111, 2007.
URL : https://hal.archives-ouvertes.fr/hal-00160857

P. Deb, W. Gallo, P. Ayyagari, J. Fletcher, and J. Sindelar, The effect of job loss on overweight and drinking, Journal of Health Economics, 2011.

J. Dedecker, P. Doukhan, G. Lang, L. R. Rafael, S. Louhichi et al., Weak dependence: With examples and applications, 2007.
URL : https://hal.archives-ouvertes.fr/hal-00141567

A. Dempster, N. Laird, R. , and D. , Maximum likelihood from incomplete data via the EM algorithm, Journal of the Royal Statistical Society. Series B (Methodological), vol.39, issue.1, pp.1-38, 1977.

J. Depersin and G. Lecué, Robust subgaussian estimation of a mean vector in nearly linear time, 2019.

L. Devroye, M. Lerasle, G. Lugosi, and R. I. Oliveira, Sub-gaussian mean estimators, The Annals of Statistics, vol.44, issue.6, pp.2695-2725, 2016.
URL : https://hal.archives-ouvertes.fr/hal-01204519

L. Devroye and G. Lugosi, Combinatorial methods in density estimation, 2001.

I. Diakonikolas, G. Kamath, D. M. Kane, J. Li, A. Moitra et al., Robustly learning a gaussian: Getting optimal error, efficiently, Proceedings of the Twenty-Ninth Annual ACM-SIAM Symposium on Discrete Algorithms, pp.2683-2702, 2018.

I. Diakonikolas, G. Kamath, D. M. Kane, J. Li, K. Moitra et al., Robust estimators in high dimensions without the computational intractability. Foundations of Computer Science (FOCS), 2016 IEEE 57th Annual Symposium, 2016.

I. Diakonikolas and D. M. Kane, Recent advances in algorithmic high-dimensional robust statistics, 2019.

I. Diakonikolas, D. M. Kane, and A. Stewart, Statistical query lower bounds for robust estimation of high-dimensional gaussians and gaussian mixtures, 2017 IEEE 58th Annual Symposium on Foundations of Computer Science (FOCS), pp.73-84, 2017.

I. Diakonikolas, D. M. Kane, and A. Stewart, List-decodable robust mean estimation and learning mixtures of spherical gaussians, Proceedings of the 50th Annual ACM SIGACT Symposium on Theory of Computing, 2018.

I. Diakonikolas, W. Kong, and A. Stewart, Efficient algorithms and lower bounds for robust linear regression, 2018.

M. N. Do, Fast approximation of Kullback-Leibler distance for dependence trees and hidden Markov models, IEEE Signal Processing Letters, vol.10, issue.4, pp.115-118, 2003.

J. Domke, Provable gradient variance guarantees for black-box variational inference, Advances in Neural Information Processing Systems, pp.328-337, 2019.

D. L. Donoho and R. C. Liu, The automatic robustness of minimum distance functionals, The Annals of Statistics, pp.552-586, 1988.

D. L. Donoho and R. C. Liu, Pathologies of some minimum distance estimators. The Annals of Statistics, pp.587-608, 1988.

J. L. Doob, Application of the theory of martingales. Le Calcul des Probabilités et ses Applications, Colloques Internationaux du CNRS, issue.13, pp.23-27, 1949.

A. Doucet, N. De-freitas, G. , and N. , Sequential Monte Carlo Methods in Practice, 2001.

A. Doucet and A. Johansen, A tutorial on particle filtering and smoothing: Fifteen years later. Handbook of Nonlinear Filtering, p.12, 2009.

P. Doukhan, Mixing: properties and examples, vol.85, 1994.

P. Doukhan and S. Louhichi, A new weak dependence condition and applications to moment inequalities, Stochastic Processes and their Applications, p.84, 1999.

S. Du, J. Lee, H. Li, L. Wang, and X. Zhai, Gradient descent finds global minima of deep neural networks, Proceedings of the 36th International Conference on Machine Learning, vol.97, pp.1675-1685, 2019.

S. S. Du, S. Balakrishnan, and A. Singh, Computationally efficient robust estimation of sparse functionals, 2017.

G. K. Dziugaite, D. M. Roy, and Z. Ghahramani, Training generative neural networks via maximum mean discrepancy optimization, 2015.

S. Forth, P. Hovland, E. Phipps, J. Utke, W. et al., Recent Advances in Algorithmic Differentiation, 2014.

K. Fukumizu, L. Song, and A. Gretton, Kernel bayes' rule: Bayesian inference with positive definite kernels, Journal of Machine Learning Research, vol.14, pp.3002-3048, 2013.

F. Futami, I. Sato, and M. Sugiyama, Variational inference based on robust divergences, Proceedings of the Twenty-First International Conference on Artificial Intelligence and Statistics, vol.84, pp.813-822, 2018.

Y. Gal, Uncertainty in Deep Learning, 2016.

C. Gao, J. Liu, Y. Yao, and W. Zhu, Robust estimation and generative adversarial nets, 2019.

E. Gassiat, J. Rousseau, and E. Vernet, Efficient semiparametric estimation and model selection for multidimensional mixtures, Electronic Journal of Statistics, vol.12, issue.1, pp.703-740, 2018.
URL : https://hal.archives-ouvertes.fr/hal-01345919

S. Gerchinovitz, Sparsity regret bounds for individual sequences in online linear regression, The Journal of Machine Learning Research, vol.14, issue.1, pp.729-769, 2013.
URL : https://hal.archives-ouvertes.fr/inria-00552267

P. Germain, F. Bach, A. Lacoste, and S. Lacoste-julien, Pac-bayesian theory meets bayesian inference, Advances in Neural Information Processing Systems, pp.1884-1892, 2016.
URL : https://hal.archives-ouvertes.fr/hal-01324072

S. Ghosal, J. K. Ghosh, and R. Ramamoorthi, Consistency issues in bayesian nonparametric asymptotic, Nonparametrics and Time Series: A Tribute to Madan Lal Puri, pp.639-667, 1999.

S. Ghosal, J. K. Ghosh, . Van-der, and A. W. Vaart, Convergence rates of posterior distributions, Annals of Statistics, pp.500-531, 2000.

S. Ghosal and A. Van-der-vaart, Fundamentals of nonparametric Bayesian inference, vol.44, 2017.

S. Ghosal, . Van-der, and A. Vaart, Convergence rates of posterior distributions for noniid observations, The Annals of Statistics, vol.35, issue.1, pp.192-223, 2007.

S. Ghosal, . Van-der, and A. W. Vaart, Entropies and rates of convergence for maximum likelihood and bayes estimation for mixtures of normal densities, Annals of Statistics, pp.1233-1263, 2001.

A. Ghosh and A. Basu, Robust Bayes estimation using the density power divergence, The Annals of Statistics, pp.500-531, 2016.

I. Giulini, Robust pca and pairs of projections in a hilbert space, Electronic Journal of Statistics, vol.11, issue.2, pp.3903-3926, 2017.

I. Giulini, Robust dimension-free gram operator estimates, Bernoulli, vol.11, issue.2, pp.3864-3923, 2018.

I. Goodfellow, Y. Bengio, and A. Courville, Deep Learning, 2016.

I. Goodfellow, J. Pouget-abadie, M. Mirza, B. Xu, D. Warde-farley et al., Generative adversarial nets, Advances in Neural Information Processing Systems, vol.27, pp.2672-2680, 2014.

A. Graves, J. Shawe-taylor, R. S. Zemel, P. L. Bartlett, F. Pereira et al., Practical variational inference for neural networks, Advances in Neural Information Processing Systems, vol.24, pp.2348-2356, 2011.

A. Gretton, K. M. Borgwardt, M. J. Rasch, B. Schölkopf, and A. Smola, A kernel two-sample test, Journal of Machine Learning Research, vol.13, pp.723-773, 2012.

A. Gretton, K. Fukumizu, Z. Harchaoui, and B. K. Sriperumbudur, A fast, consistent kernel two-sample test, Advances in neural information processing systems, pp.673-681, 2009.

P. Grohs, D. Perekrestenko, D. Elbrächter, and H. Bölcskei, Deep neural network approximation theory, 2019.

P. Grünwald, Model selection based on minimum description length, Journal of Mathematical Psychology, vol.44, issue.1, pp.133-152, 2000.

P. Grünwald and T. Van-ommen, Inconsistency of bayesian inference for misspecified linear models, and a proposal for repairing it, Bayesian Analysis, vol.12, issue.4, pp.1069-1103, 2017.

B. Guedj, A primer on PAC-Bayesian learning, 2019.
URL : https://hal.archives-ouvertes.fr/hal-01983732

R. Guhaniyogi, R. M. Willett, and D. B. Dunson, Approximated Bayesian inference for massive streaming data, 2013.

L. P. Hansen, Large sample properties of generalized method of moments estimators, Econometrica: Journal of the Econometric Society, pp.1029-1054, 1982.

M. H. Hansen and B. Yu, Model selection and the principle of minimum description length, Journal of the American Statistical Association, vol.96, issue.454, pp.746-774, 2001.

S. Hayakawa and T. Suzuki, On the minimax optimality and superiority of deep neural network learning over sparse parameter spaces, 2019.

E. Hazan, Introduction to online convex optimization, Foundations and Trends R in Optimization, vol.2, issue.3-4, pp.157-325, 2016.

J. Hershey and P. Olsen, Approximating the Kullback Leibler divergence between Gaussian mixture models, IEEE International Conference on Acoustics, Speech and Signal Processing, p.4, 2007.

G. E. Hinton and D. Van-camp, Keeping the neural networks simple by minimizing the description length of the weights, Proceedings of the Sixth Annual Conference on Computational Learning Theory, COLT '93, pp.5-13, 1993.

M. Hoffman, F. R. Bach, and D. M. Blei, Online learning for latent dirichlet allocation, advances in neural information processing systems, pp.856-864, 2010.

M. D. Hoffman, D. M. Blei, C. Wang, and J. Paisley, Stochastic variational inference, The Journal of Machine Learning Research, vol.14, issue.1, pp.1303-1347, 2013.

M. J. Holland, Distribution-robust mean estimation via smoothed random perturbations, 2019.

M. J. Holland, Pac-bayes under potentially heavy tails, 2019.

G. Hooker and A. N. Vidyashankar, Bayesian model robustness via disparities, Test, vol.23, issue.3, pp.556-584, 2014.

S. B. Hopkins, Sub-gaussian mean estimation in polynomial time, 2019.

D. Hsu and S. Sabato, Loss minimization and parameter estimation with heavy tails, JMLR, vol.17, pp.1-40, 2016.

P. J. Hüber, Robust estimation of a location parameter. The annals of mathematical statistics, vol.35, pp.73-101, 1964.

J. H. Huggins, T. Campbell, M. Kasprzak, and T. Broderick, Practical bounds on the error of Bayesian posterior approximations: A nonasymptotic approach, 2018.

M. Imaizumi and K. Fukumizu, Deep neural networks learn non-smooth functions effectively, Proceedings of Machine Learning Research, vol.89, pp.869-878, 2019.

P. Jaiswal, H. Honnappa, and V. A. Rao, Risk-sensitive variational bayes: Formulations and bounds, 2019.

P. Jaiswal, V. A. Rao, and H. Honnappa, Asymptotic consistency of ?-rényiapproximate posteriors, 2019.

G. Jerfel, An information theoretic interpretation of variational inference based on the mdl principle and the bits-back coding scheme, 2017.

M. Jerrum, L. Valiant, and V. Vazirani, Random generation of combinatorial structures from a uniform distribution, Theoretical Computer Science, vol.43, pp.186-188, 1986.

J. Jewson, J. Smith, and C. Holmes, Principles of Bayesian inference using general divergence criteria, Advances in Neural Information Processing Systems (NeurIPS), pp.262-271, 2018.

W. Jitkrittum, W. Xu, Z. Szabó, K. Fukumizu, and A. Gretton, A linear-time kernel goodness-of-fit test, Advances in Neural Information Processing Systems, pp.262-271, 2017.
URL : https://hal.archives-ouvertes.fr/hal-01527717

M. I. Jordan, Z. Ghahramani, T. S. Jaakkola, and L. K. Saul, An introduction to variational methods for graphical models, Machine Learning, vol.37, pp.183-233, 1999.

A. Kalai and S. Vempala, Efficient algorithms for online decision problems, Journal of Computer and System Sciences, vol.71, issue.3, pp.291-307, 2005.

R. E. Kalman, A new approach to linear filtering and prediction problems, Transactions of the ASME-Journal of Basic Engineering, vol.82, pp.35-45, 1960.

K. Kawaguchi, D. D. Lee, M. Sugiyama, U. V. Luxburg, I. Guyon et al., Deep learning without poor local minima, Advances in Neural Information Processing Systems, vol.29, pp.586-594, 2016.

K. Kawaguchi, J. Huang, and L. P. Kaelbling, Effect of depth and width on local minima in deep learning, Neural Computation, vol.31, issue.6, pp.1462-1498, 2019.

M. Khan, D. Nielsen, V. Tangkaratt, W. Lin, Y. Gal et al., Fast and scalable Bayesian deep learning by weight-perturbation in Adam, Proceedings of the 35th International Conference on Machine Learning, vol.80, pp.2611-2620, 2018.

M. E. Khan and W. Lin, Conjugate-computation variational inference: Converting variational inference in non-conjugate models to inferences in conjugate models, Proceedings of the 20th International Conference on Artificial Intelligence and Statistics, vol.54, pp.878-887, 2017.

M. E. Khan and D. Nielson, Fast yet simple natural-gradient descent for variational inference in complex models, 2018.

D. P. Kingma and M. Welling, Auto-encoding variational Bayes, International Conference on Learning Representations, 2013.

J. Knoblauch, J. Jewson, and T. Damoulas, Generalized variational inference, 2019.

P. K. Kothari, J. Steinhardt, and D. Steurer, Robust moment estimation and improved clustering via sum of squares, Proceedings of the 50th Annual ACM SIGACT Symposium on Theory of Computing, 2018.

W. Kruijer, J. Rousseau, . Van-der, and A. Vaart, , 2010.

, Adaptive Bayesian density estimation with location-scale mixtures, Electronic Journal of Statistics, vol.4, pp.1225-1257

A. Laforgia and P. Natalin, On some inequalities for the gamma function, Advances in Dynamical Systems and Applications, vol.8, pp.261-267, 2013.

K. A. Lai, A. B. Rao, and S. Vempala, Agnostic estimation of mean and covariance, Foundations of Computer Science (FOCS), 2016 IEEE 57th Annual Symposium, 2016.

P. Laplace, Mémoire sur les approximations des formules qui sont fonctions de très grands nombres et sur leur applications aux probabilités, 1810.

P. S. Laplace, Mémoire sur la probabilité des causes par les évènements. Mémoires de Mathematique et de Physique, Presentés à l'Académie Royale des Sciences, Par Divers Savans & Lus Dans ses Assemblées, pp.621-656, 1774.

L. Cam and L. , On the assumptions used to prove asymptotic normality of maximum likelihood estimates, 1970.

L. Cam and L. , Convergence of estimates under dimensionality restrictions, The Annals of Statistics, vol.1, pp.38-53, 1973.

L. Cam and L. , On local and global properties in the theory of asymptotic normality of experiments, Stochastic processes and related topics (Proc. Summer Res. Inst. Statist. Inference for Stochastic Processes, vol.1, pp.13-54, 1974.

G. Lecué, M. Lerasle, and T. Mathieu, Robust classification via mom minimization, 2018.

Y. Lecun, Y. Bengio, and G. Hinton, Deep learning, Nature, vol.521, issue.7553, pp.436-444, 2015.

Y. Lecun, L. Bottou, Y. Bengio, and P. Haffner, Gradient-based learning applied to document recognition, Proceedings of the IEEE, pp.2278-2324, 1998.

M. Lerasle, Z. Szabó, T. Mathieu, and G. Lecué, Monk-outlier-robust mean embedding estimation by median-of-means, International Conference on Machine Learning, 2019.
URL : https://hal.archives-ouvertes.fr/hal-01705881

G. Leung and A. Barron, Information theory and mixing least-squares regressions. Information Theory, IEEE Transactions on, vol.52, pp.3396-3410, 2006.

Y. Li, K. Swersky, and R. Zemel, Generative moment matching networks, International Conference on Machine Learning, pp.1718-1727, 2015.

N. Littlestone and M. K. Warmuth, The weighted majority algorithm. Information and computation, vol.108, pp.212-261, 1994.

J. Liu, Y. Huang, R. Singh, J. Vert, and W. Noble, Jointly embedding multiple single-cell omics measurements, BioRxiv, p.644310, 2019.
URL : https://hal.archives-ouvertes.fr/hal-02444746

S. Louhichi, Théorèmes limites pour des suites positivement ou faiblement dépendantes, 1998.

C. Louizos, M. Welling, and D. P. Kingma, Learning sparse neural networks through l 0 -regularization, International Conference on Learning Representations, 2018.

G. Lugosi and S. Mendelson, Risk minimization by median-of-means tournaments, 2016.

G. Lugosi and S. Mendelson, Mean estimation and regression under heavy-tailed distributions-a survey, 2019.

J. Lv and J. S. Liu, Model selection principles in misspecified models, Journal of the Royal Statistical Society: Series B (Statistical Methodology), vol.76, issue.1, pp.141-167, 2013.

D. J. Mackay, Bayesian methods for adaptive models, 1992.

D. J. Mackay, A practical bayesian framework for backpropagation networks, Neural Computation, vol.4, issue.3, pp.448-472, 1992.

P. Massart, Lectures from the 33rd Summer School on Probability Theory, Lecture Notes in Mathematics, vol.1896, 2003.

D. A. Mcallester, Some PAC-Bayesian theorems, Machine Learning, vol.37, pp.355-363, 1999.

C. Mcdiarmid, On the method of bounded differences, Surveys of Combinatorics. Mathematical Society Lecture Notes Series, vol.141, 1989.

P. D. Mcnicholas, Model-based clustering, Journal of Classification, vol.33, issue.3, pp.331-373, 2016.

P. W. Millar, Robust estimation via minimum distance methods, vol.55, pp.73-89, 1981.

T. P. Minka, Expectation propagation for approximate bayesian inference, Proceedings of the 17th Conference in Uncertainty in Artificial Intelligence, UAI '01, pp.362-369, 2001.

S. Minsker, Geometric median and robust estimation in banach spaces, Bernoulli, vol.21, pp.2308-2335, 2015.

A. Mishkin, F. Kunstner, D. Nielsen, M. Schmidt, M. E. Khan et al., Slang: Fast structured covariance approximations for bayesian deep learning with natural gradient, Advances in Neural Information Processing Systems, vol.31, pp.6245-6255, 2018.

K. Moridomi, K. Hatano, and E. Takimoto, Online linear optimization with the log-determinant regularizer, IEICE Transactions on Information and Systems, vol.101, issue.6, pp.1511-1520, 2018.

K. Muandet, K. Fukumizu, B. Sriperumbudur, and B. Schölkopf, Kernel mean embedding of distributions: A review and beyond. Foundations and Trends R in Machine Learning, vol.10, pp.1-141, 2017.

T. Nakagawa and S. Hashimoto, Robust bayesian inference via ?-divergence, Communications in Statistics-Theory and Methods, pp.1-18, 2019.

N. Nasios and A. Bors, Variational learning for Gaussian mixture models, IEEE Transactions on Systems, Man, and Cybernetics, vol.36, pp.849-862, 2006.

R. M. Neal, Bayesian learning for neural networks, 1995.

R. M. Neal, Sampling from multimodal distributions using tempered transitions, Statistics and computing, vol.6, issue.4, pp.353-366, 1996.

A. Nemirovski, A. Juditsky, G. Lan, and A. Shapiro, Robust stochastic approximation approach to stochastic programming, SIAM Journal on optimization, vol.19, issue.4, pp.1574-1609, 2009.
URL : https://hal.archives-ouvertes.fr/hal-00976649

A. Nemirovski and B. Yudin, Problem complexity and method efficiency in optimization, 1983.

B. Neyshabur, S. Bhojanapalli, and N. Srebro, A PAC-bayesian approach to spectrally-normalized margin bounds for neural networks, International Conference on Learning Representations, 2018.

C. V. Nguyen, T. D. Bui, Y. Li, and R. E. Turner, Online variational Bayesian inference: Algorithms for sparse gaussian processes and theoretical bounds, Time Series Workshop, 2017.

C. V. Nguyen, Y. Li, T. D. Bui, and R. E. Turner, , 2017.

Q. Nguyen, M. C. Mukkamala, and M. Hein, On the loss landscape of a class of deep neural networks with no bad local valleys, International Conference on Learning Representations, 2019.

A. O'hagan, T. B. Murphy, and I. C. Gormley, Computational aspects of fitting mixture models via the expectation-maximization algorithm, Computational Statistics & Data Analysis, vol.56, issue.12, pp.3843-3864, 2012.

M. Opper and C. Archambeau, The variational gaussian approximation revisited, Neural computation, vol.21, pp.786-92, 2008.

K. Osawa, S. Swaroop, M. E. Khan, A. Jain, R. Eschenhagen et al., Practical deep learning with bayesian principles, Advances in Neural Information Processing Systems, pp.4289-4301, 2019.

W. Pan, J. Lin, L. , and C. , A mixture model approach to detecting differentially expressed genes with microarray data, Functional & Integrative Genomics, vol.3, pp.117-124, 2003.

M. Park, W. Jitkrittum, and D. Sejdinovic, K2-abc: Approximate Bayesian computation with kernel embeddings, Proceedings of the Fourteenth International Conference on Artificial Intelligence and Statistics, vol.51, p.51, 2016.

W. Parr, Minimum distance estimation: a bibliography, Communications in Statistics: Theory and Methods, vol.10, issue.12, pp.1205-1224, 1981.

W. C. Parr and W. R. Schucany, Minimum distance and robust estimation, Journal of the American Statistical Association, vol.75, issue.371, pp.616-624, 1980.

E. Parrado-hernández, A. Ambroladze, J. Shawe-taylor, and S. Sun, Pacbayes bounds with data dependent priors, Journal of Machine Learning Research, vol.13, pp.3507-3531, 2012.

P. Petersen and F. Voigtländer, Optimal approximation of piecewise smooth functions using deep relu neural networks, Neural Networks, 2017.

G. Peyré and M. Cuturi, Computational optimal transport. Foundations and Trends R in Machine Learning, vol.11, pp.355-607, 2019.

C. R. Rao and Y. Wu, On model selection, Lecture Notes-Monograph Series, vol.38, pp.1-57, 2001.

J. Ridgway, Probably approximate Bayesian computation: nonasymptotic convergence of abc under misspecification, 2017.

J. Ridgway, P. Alquier, N. Chopin, F. Liang, Z. Ghahramani et al., PAC-Bayesian AUC classification and scoring, Advances in Neural Information Processing Systems, vol.27, pp.658-666, 2014.

L. Rigouste, O. Cappé, and F. Yvon, Inference and evaluation of the multinomial mixture model for text clustering, Information Processing & Management, vol.43, pp.1260-1280, 2007.
URL : https://hal.archives-ouvertes.fr/hal-00080133

E. Rio, On mcdiarmid's concentration inequality. Electronic Communications in Probability, vol.18, 2013.

E. Rio, Asymptotic theory of weakly dependent random processes, 2017.
URL : https://hal.archives-ouvertes.fr/hal-02063543

E. Rio, Inégalités de hoeffding pour les fonctions lipschitziennes de suites dépendantes, Comptes Rendus de l'Académie des Sciences -Series I -Mathematics, vol.330, issue.10, pp.905-908, 2017.

J. Rissanen, Modeling by shortest data description, Automatica, vol.14, issue.5, pp.465-471, 1978.

V. Rivoirard and J. Rousseau, Posterior concentration rates for infinite dimensional exponential families, Bayesian Analysis, vol.7, issue.2, pp.311-334, 2012.
URL : https://hal.archives-ouvertes.fr/hal-00634432

C. Robert, The Bayesian choice: from decision-theoretic foundations to computational implementation, 2007.

C. Robert and G. Casella, , 2013.

, Monte Carlo statistical methods

V. Rockova, N. Polson, S. Bengio, H. Wallach, H. Larochelle et al., Posterior concentration for sparse deep learning, Advances in Neural Information Processing Systems, vol.31, pp.930-941, 2018.

D. Rolnick and M. Tegmark, The power of deeper networks for expressing natural functions, 6th International Conference on Learning Representations, 2018.

M. Rosenblatt, A central limit corollary and a strong mixing condition, Proc. Natl, pp.43-47, 1956.

J. Rousseau, On the frequentist properties of bayesian nonparametric methods, Annual Review of Statistics and Its Application, vol.3, pp.211-231, 2016.
URL : https://hal.archives-ouvertes.fr/hal-01252919

D. E. Rumelhart, G. E. Hinton, and R. J. Williams, Learning Representations by Back-propagating Errors, Nature, vol.323, issue.6088, pp.533-536, 1986.

J. Salmon and A. Dalalyan, Optimal aggregation of affine estimators, Proceedings of the 24th Annual Conference on Learning Theory, vol.19, pp.635-660, 2011.
URL : https://hal.archives-ouvertes.fr/hal-00654251

M. Sato, Online model selection based on the variational Bayes, Neural computation, vol.13, issue.7, pp.1649-1681, 2001.

J. Schmidt-hieber, Nonparametric regression using deep neural networks with relu activation function. arXiv, 2017.

L. Schwartz, On bayes procedures. Zeitschrift für Wahrscheinlichkeitstheorie und verwandte Gebiete, vol.4, pp.10-26, 1965.

G. Schwarz, Estimating the dimension of a model, The Annals of Statistics, vol.6, issue.2, pp.461-464, 1978.

Y. Seldin, P. Auer, J. Shawe-taylor, R. Ortner, and F. Laviolette, Pac-bayesian analysis of contextual bandits, Advances in Neural Information Processing Systems, pp.1683-1691, 2011.

Y. Seldin and N. Tishby, PAC-Bayesian analysis of co-clustering and beyond, Journal of Machine Learning Research, vol.11, pp.3595-3646, 2010.

S. Shalev-shwartz, Online learning and online convex optimization. Foundations and Trends R in Machine Learning, vol.4, pp.107-194, 2012.

J. Shawe-taylor and R. C. Williamson, A pac analysis of a bayesian estimator, Proceedings of the Tenth Annual Conference on Computational Learning Theory, COLT '97, pp.2-9, 1997.

J. Shawe-taylor and R. C. Williamson, A PAC analysis of a Bayesian estimator, Tenth annual conference on Computational learning theory, vol.6, pp.2-9, 1997.

X. Shen, Asymptotic normality of semiparametric and nonparametric posterior distributions, Journal of the American Statistical Association, vol.97, issue.457, pp.222-235, 2002.

R. Sheth, R. Khardon, I. Guyon, U. V. Luxburg, S. Bengio et al., Excess risk bounds for the bayes risk using variational inference in latent gaussian models, Advances in Neural Information Processing Systems, vol.30, pp.5151-5161, 2017.

D. Silver, J. Schrittwieser, K. Simonyan, I. Antonoglou, A. Huang et al., Mastering the game of go without human knowledge, Nature, vol.550, p.354, 2017.

Y. Singer and M. K. Warmuth, Batch and on-line parameter estimation of Gaussian mixtures based on the joint entropy, Advances in Neural Information Processing Systems 11, 1999.

L. Song, Learning via Hilbert Space Embedding of Distributions, 2008.

L. Song, A. Gretton, Y. Low, and C. Guestrin, and bickson, Proceedings of the Fourteenth International Conference on Artificial Intelligence and Statistics, pp.707-715, 2011.

D. Soudry and Y. Carmon, No bad local minima: Data independent training error guarantees for multilayer neural networks, 2016.

N. Srivastava, G. Hinton, A. Krizhevsky, I. Sutskever, and R. Salakhutdinov, Dropout: A simple way to prevent neural networks from overfitting, Journal of Machine Learning Research, vol.15, pp.1929-1958, 2014.

J. Stanford, K. Giardina, G. Gerhardt, K. Fukumizu, and S. Amari, Local minima and plateaus in hierarchical structures of multilayer perceptrons, Neural Networks, p.13, 2000.

C. J. Stoneking, Bayesian inference of Gaussian mixture models with noninformative priors, 2014.

E. Sudderth and M. Jordan, Shared segmentation of natural scenes using dependent pitman-yor processes, Advances in Neural Information Processing Systems, pp.1585-1592, 2009.

T. Suzuki, PAC-Bayesian bound for Gaussian process regression and multiple kernel additive model, Conference on Learning Theory, pp.8-9, 2012.

T. Suzuki, Fast generalization error bound of deep learning from a kernel perspective, Proceedings of the Twenty-First International Conference on Artificial Intelligence and Statistics, vol.84, pp.1397-1406, 2018.

T. Suzuki, Adaptivity of deep reLU network for learning in besov and mixed smooth besov spaces: optimal rate and curse of dimensionality, International Conference on Learning Representations, 2019.

N. Syring and R. Martin, Calibrating general posterior credible regions, Biometrika, vol.106, issue.2, pp.479-486, 2018.

Y. Tat-lee, Z. Song, and S. S. Vempala, Algorithmic Theory of ODEs and Sampling from Well-conditioned Logconcave Densities. arXiv e-prints, 2018.

M. K. Titsias, M. Lázaro-gredilla, J. Shawe-taylor, R. S. Zemel, P. L. Bartlett et al., Spike and slab variational inference for multi-task and multiple kernel learning, Advances in Neural Information Processing Systems, vol.24, pp.2339-2347, 2011.

I. Tolstikhin, B. K. Sriperumbudur, and K. Muandet, Minimax estimation of kernel mean embeddings, Journal of Machine Learning Research, vol.18, issue.1, pp.3002-3048, 2017.

F. Tonolini, B. S. Jensen, M. , and R. , Variational sparse coding. Conference on Uncertainty in Artificial Intelligence, 2019.

Y. Tsuzuku, I. Sato, and M. Sugiyama, Normalized flat minima: Exploring scale invariant definition of flat minima for neural networks using PAC-Bayesian analysis, 2019.

A. B. Tsybakov, Introduction to Nonparametric Estimation, 2008.

J. W. Tukey, Mathematics and the picturing of data, Proceedings of the International Congress of Mathematicians, 1975.

A. W. Van-der-vaart, Asymptotic Statistics. Cambridge Series in Statistical and Probabilistic Mathematics, 2000.

T. Van-erven and P. Harremos, Rényi divergence and kullback-leibler divergence, IEEE Transactions on Information Theory, vol.60, issue.7, pp.3797-3820, 2014.

V. Vapnik, Principles of risk minimization for learning theory, Advances in neural information processing systems, pp.831-838, 1992.

A. Vehtari, V. Tolvanen, T. Mononen, and O. Winther, Bayesian leave-oneout cross validation approximations for gaussian latent variable models, Journal of Machine Learning Research, p.17, 2014.

M. Vladimirova, J. Verbeek, P. Mesejo, A. , and J. , Understanding priors in Bayesian neural networks at the unit level, Proceedings of the 36th International Conference on Machine Learning, vol.97, pp.6458-6467, 2019.
URL : https://hal.archives-ouvertes.fr/hal-02177151

V. Mises and R. , Wahrscheinlichkeitsrechnung. Vienna: Deuticke, 1931.

V. G. Vovk, Aggregating strategies, Proceedings of the Third Annual Workshop on Computational Learning Theory, 1990.

S. Walker and N. L. Hjort, On bayesian consistency, Journal of the Royal Statistical Society: Series B (Statistical Methodology), vol.63, issue.4, pp.811-821, 2001.

C. Wang, J. Paisley, and D. Blei, Online variational inference for the hierarchical Dirichlet process, Proceedings of the Fourteenth International Conference on Artificial Intelligence and Statistics, pp.752-760, 2011.

Y. Wang and D. M. Blei, Frequentist consistency of variational Bayes, Journal of the American Statistical Association, pp.1-85, 2018.

L. Watier, S. Richardson, and P. Green, Using gaussian mixtures with unknown number of components for mixed model estimation, 14th International Workshop on Statistical Modeling, 1999.

J. Wolfowitz, The minimum distance method, The Annals of Mathematical Statistics, vol.28, issue.1, pp.75-88, 1957.

Y. Wu and P. Yang, Optimal estimation of gaussian mixtures via denoised method of moments, 2018.

Y. Yang, Can the strengths of AIC and BIC be shared? A conflict between model identification and regression estimation, Biometrika, vol.92, issue.4, pp.937-950, 2005.

D. Yarotsky, Error bounds for approximations with deep relu networks, Neural Networks, p.94, 2016.

Y. G. Yatracos, Rates of convergence of minimum distance estimators and kolmogorov's entropy. The Annals of Statistics, pp.768-774, 1985.

C. Zeno, I. Golan, E. Hoffer, and D. Soudry, Bayesian gradient descent: Online variational Bayes learning with increased robustness to catastrophic forgetting and weight pruning, 2018.

C. Zhang, S. Bengio, M. Hardt, B. Recht, and O. Vinyals, Understanding deep learning requires rethinking generalization, International Conference on Learning Representations, 2017.

F. Zhang and C. Gao, Convergence rates of variational posterior distributions, 2017.

T. Zhang, From -entropy to kl-entropy: Analysis of minimum information complexity density estimation, The Annals of Statistics, vol.34, issue.5, pp.2180-2210, 2006.

S. Zhao, J. Song, and S. Ermon, InfoVAE: Information maximizing variational autoencoders. arXiv 1706, p.2262, 2017.

M. Zinkevich, Online convex programming and generalized infinitesimal gradient ascent, Proceedings of the Twentieth International Conference on International Conference on Machine Learning, ICML'03, pp.928-935, 2003.