Making Friends with Probability

probability
bayes
philosophy
Author

Bizovi Mihai

Published

August 7, 2017

Modified

April 30, 2023

Foreword from 2023

Thinking deeply about uncertainty and probability theory changed my worldview and eventually became an essential part of my job. In 2023, I’m looking back at the game-changing encounters which instilled a passion for probability, followed by some paragraphs from the bachelor’s thesis1.

  • 1 Bizovi Mihai - Stochastic Modeling and Bayesian Inference (2017)

    • In 2013, it was Richard Feynman’s “The pleasure of finding things out”, which put the question of uncertainty somewhere on the back of my mind
    • In 2014, Taleb Nassim’s “Fooled by randomness” and “Black Swan” deeply resonated with me. There were many instances when I asked myself: “What would the academic do?”, “What would Taleb and Fat Tony do?”, “What would Hume and Popper say about that?”
    • In 2015, I was obsessed about systems’ dynamics and nonlinear differential equations for modeling the stocks and flows in an economy. The big question arose: how to actually represent the uncertainty and do inference for those parameters? It was also when the “Visual guide to Bayesian thinking” by Julia Galef popped up on my youtube recommendation feed.
    • In 2016, it was clear that the best job I could get was in ML and Data Science. Probabilistic ML and Probabilistic Graphical Models were the thing which I wanted to master, after Zoubin Ghahramani’s brilliant lectures in the Tubingen MLSS.
      • A bit earlier, we had our probability and mathematical statistics class – which I skipped to study after Joe Blitzstein’s lectures and book. Brilliant story proofs, relatable examples!
    • In 2017, after the brilliant lectures in Multidimensional Data Analysis by Gheorghe Ruxanda at our university, I realized I know nothing about probability. There was so much more depth, nuance, philosophy, and history to it: Kolmogorov’s breakthrough2, measure-theoretic foundations and those stochastic processes.
  • 2 

    Andrey Kolmogorov

  • In the thesis, I was interested in probabilistic approaches to modeling complex, dynamic economic phenomena. What stayed the same all these years, is a preference for the development of custom models for specific applications, where it is possible to explicitly declare domain and statistical assumptions, over an out-of-the-box Machine Learning solution.

    Since then, after years of experience and practice, you can find an evolving presentation of the subject in the course website. It is a work-in-progress, with the purpose of reminding students why did they study probability and what really matters.

    Thoughts from the thesis

    ”I can live with doubt and uncertainty and not knowing. We absolutely must leave room for doubt or there is no progress and there is no learning. There is no learning without having to pose a question. And a question requires doubt. People search for certainty, but there is no certainty.” — Richard Feynman, The pleasure of finding things out

    The need for stochastic modeling and Bayesian thinking

    Current challenges in economics need an interdisciplinary, multidimensional approach, and different perspectives. In order to have more powerful models in our toolbox, we’ll investigate statistical modeling based on three schools of thought: Stochastic Modeling3, Machine Learning, and Bayesian Analysis – which has applications ranging from economics and finance to genetics, linguistics and artificial intelligence.

  • 3 Stochastic, meaning probabilistic, specifically in the Fisherian and Neyman-Pearson frameworks.

  • This approach is extremely general and allows us to make inferences about latent quantities and relationships, to identify what is hidden beneath appearance, to account for the uncertainty and nonlinearities of economic phenomena. The idea of learning from data, flexibility of representing uncertainty and the parsimony principle leads to adaptive and robust modeling, able to capture non-trivial aspects of the problem. The focus will be very much on models and a deep understanding of concepts on which they’re built.

    The ugly truth is that humans are not good at prediction, neither consistent with probability theory. We are “suffering” from multiple evolutionary mechanisms, cognitive biases and even statisticians, formally trained in the language of uncertainty, aren’t guarded against pitfalls that arise in practice and everyday life.

    The same story is in reconciliation of new evidence with prior beliefs, processing multidimensional data, ignorance of feedback loops and dynamics. The reason we are building models is to better understand the world, phenomena around us and in the end to improve our mental representations4. Predictions that are less wrong usually take into account a variety of perspectives, a proven fact in ensemble modeling.

  • 4 The only thing which I would add here in 2023, is an appreciation for Action and Decisions. Back then, knowledge was foregrounded for me.

  • Three perspectives

    Machine learning models have become an essential part of online services and is taking over new fields, bringing firms and people lots of value. The multidimensional approach is what makes it all this possible. In contrast with classical statistics and NHST (Null Hypothesis Significance Testing)5, our focus will be on generalization performance and the idea of a model. Even though these models have their roots in statistics, the field can be viewed in its own right.

  • 5 In 2023, I wouldn’t equivocate between NHST and the Neyman-Pearson, action-oriented frequentism, which can be done well

  • An important extension for which there is an acute need is dynamical models. A lot of interesting problems are multidimensional time series, which might invalidate the hypotheses behind classical data mining models. Coupled with the need of accounting for uncertainty, this brings us to the idea of introducing stochastic elements and to extend the idea of learning and generalization.

    The third perspective which brings together stochastic elements with machine learning is bayesian inference, as a method to update beliefs and knowledge based on data from different sources. Bayesian modeling enables us to build hierarchical models with latent structures, while encoding the uncertainty in parameters and operating with probability distributions. On the shoulders of these three “giants”, a stochastic perspective over Machine Learning emerges over the last two decades, which is flexible, general and more transparent than neural networks, works on “thin”6 data.

  • 6 Perhaps I meant small samples at group level, which would make sense in the context of multilevel models.

  • There are several reasons the field is not yet in the mainstream: scalability and complexity of the models, in the sense of necessary knowledge and skill to start modeling. Once the software implementations, approximation methods and probabilistic programming languages will become better, the Bayesian methods will see another rise in popularity.

    The next chapters will be dedicated to each of these fundamental perspectives, in which we’ll try to get to the essence. The last ones will be on supervised and unsupervised probabilistic models such as Gaussian Processes and Mixture Models. An essential concept will be the ARD (Automatic Relevance Determination) and Kernels which encourage sparse solutions.

    The goal

    The motivation behind the choice of probabilistic machine learning models is that there are a lot of opportunities to extend them and reuse as parts of a more complicated model.

    For example, Gaussian Processes7 can be transformed into a Relevance Vector Machine (the probabilistic version of SVM) or used for Bayesian Optimization of expensive functions, in learning of complex dynamical phenomena, spatio-temporal modeling (also known as Kreiging). The Probabilistic Principal Component Analysis can be used for a latent representation of a multidimensional time series and intrinsic dimensionality determination. On the over hand, Mixture Models are a powerful model-based clustering technique, especially coupled with nonparametric techniques.

  • 7 This is still true, although fallen a bit out-of-fashion. However, the Bayesian Additive Regression Trees have risen to the challenge, especially in the field of Causal Inference.

  • This approach is extremely powerful, intellectually fascinating and vibrant. It is also in the spirit of economic complexity, with ideas which bridge together different branches in data analysis. The catch is, that there are enormous barriers for beginners to start developing such models. First, a qualitative jump in understanding of probability theory and stochastic processes, Bayesian inference and theory of statistical learning is needed. Another barrier is how to go from understanding the models towards the implementation in actual software products.

    So, the main goal is to find connections between the three perspectives described before, to recognize Bayesian generalizations of classical machine learning models. The most important is the process of exploration and understanding, internalization of models, which will suggest new areas of study and practical opportunities.