Speaking as a 30+ year veteran researcher in AI and ML, I should first make a confession: as much as I wish it were otherwise, I learned very little from the classes I took, both as an undergraduate student, and later as a PhD student. A rare exception was a beautiful course on graph theory that I took at IIT Kanpur, taught by Professor Mohanty, which opened my eyes to the beauty of discrete math. In the same department, I took a linear algebra class, from which I learned almost nothing. That was rather typical with many classes I took. I don’t think I want to lay the blame on the instructors. I’m sure I was not all that attentive as a student! Also, like Richard Feynman, I believe that Gibbons was right when he said:
“the power of instruction is seldom of much efficacy except in those happy dispositions when it is almost superfluous!”
So, almost everything I subsequently learned about AI and ML was through self-learning, which enabled me to spend 30+ years doing research in this field, publishing 150+ papers and getting elected Fellow of AAAI (usually the “Fellow” distinction in any professional field is limited to about 1% of the researchers in the field, and based on a competitive analysis of one’s lifetime of research, comparing the candidates nominated in a year from all over the world — trust me, it’s hard to earn this distinction!).
So, almost everything I know, including a great deal of the math needed for ML, I learned from textbooks. For this, I give great thanks to the writers of these wonderful books, a few of which I have listed below. Writing a textbook can be a thankless task, but to people like me, who didn’t learn what they were supposed to learn in a traditional student role, it can be a real blessing.
Part of my lack of education was due to the inability to understand what the foundations of ML were, and in the early 1980s, when I began research in the field, almost no one knew that statistics was at the core of ML. The irony is that my wife was training to be a statistician, and on the way to earning her PhD in the field. She wisely told me to spend some time learning statistics in graduate school. Mr. Know It All ignored her wise words, and paid dearly for his ignorance later in his professional career, when I went begging to her on my knees, wanting her to teach me a little stats. (Lesson: when your better half gives you a suggestion like this, it’s worth paying some attention!). So, I had to learn basically all of stats as well, on my own, the hard way!
But, there was a silver lining. I came to believe firmly that the best way to learn is on one’s own, plodding through a set of great textbooks, at one’s own pace, and not dictated by the vagaries of an instructor catering to hundreds of students. Even though I spent the bulk of my life as a professor (rising from an untenured assistant professor to a tenured full professor), and ended up teaching dozens of courses, I have never felt that classroom learning is a better mode of learning than self-learning. It may be more efficient, but one always learns better when one learns on his or her own.
So, with this long prelude out of the way, let’s go through a list of textbooks that had the most influence on me as a machine learning researcher. It goes without saying that such a list is bound to be idiosyncratic, and what I found to be a great book to learn from, may not appeal to some of you. But, I can only speak from my own experience. I will steal a bit of the material below from an earlier Quora reply that I gave on the top 20 textbooks in math to learn ML from. If you didn’t see that reply (which was upvoted around 2400 times, you might want to look at that long reply as well).
If you could buy 20 math books for machine learning, what books would you buy?
- Introduction to Linear Algebra by Strang. He writes math like few folks do, no endless parade of definitions and theorems. He tells you why something is important. He wears his heart on his sleeve. If you want to spend a lifetime doing ML, sleep with this book under your pillow. Read it when you go to bed and wake up in the morning. Repeat to yourself: “eigen do it if I try”. The book I wish I had read at IIT Kanpur, when I failed to learn any linear algebra almost 40 years ago (I won’t mention the text that was used, which was also written by authors at the same institution, but much less approachable and far less relevant to ML).
- In All Likelihood by Yudi Pawitan. Fisher’s concept of likelihood is the most important idea in statistics you need to understand and no book I’ve read explains this core idea better than this gem of a book by Yudi. Likelihoods are not probabilities. Repeat to yourself. Yudi wisely avoids complex examples and sticks to simple 1 dimensional examples for the most part. You’ll come away with a much deeper appreciation of statistics from this fine book. I wish every author of an ML textbook would read this little gem of a book, and see why it is so important to stick to simple examples to explain basic concepts.
- Optimization by Vector Space Methods by Luenberger. At some point in reading ML papers, you’ll start encountering phrases like “inner product spaces” or “Hilbert” spaces. The latter was popularized by the founder of computer science John von Neumann to formalize quantum mechanics. The joke is he gave a talk at Göttingen on Hilbert spaces and the great mathematician David Hilbert was in the audience. He asked a colleague after the talk: what in the world are these so-called Hilbert spaces? Luenberger covers optimization in infinite dimensional spaces. He explains the most important and profound theorem in optimization: the Hahn Banach theorem. Why do neural nets with sigmoid nonlinear activations represent any smooth function? The HB theorem is the reason. Slim book but a tough one to master. Warning: Luenberger’s book is far more advanced in its treatment than the above two. Don’t be disheartened if you struggle with it. Go away and come back to it later. Unlike iPhones and computer software, which gets upgraded every month, so it seems, Banach and Hilbert spaces have been around for a hundred years or more, and aren’t going to get changed. Nothing you learn here will become obsolete!
- The Symmetric Group by Sagan. Group theory comes in two flavors: finite groups and continuous infinite groups. Sagan digs deep into finite groups and their linear algebraic representations in this slim beautiful tome. Think you really understand linear algebra? Reading the first few pages of this book will have you scurrying back to Strang when you realize what you haven’t yet mastered. The beautiful concept of the character of a group is explained here. Unlike their linear algebraic cousins, group representations are basis independent (like the trace of a matrix, which is the same in any basis). If you can master even Chapter 1 of Sagan, congratulations! That’s a sign that you are well on your way to “getting” abstract math.
- Analysis of Incomplete Mulitivariate Data by J. L. Shafer. The book to learn EM from, the famous expectation maximization algorithm presented in the way statisticians developed them, not the confusing way it is presented in ML textbooks using mixture models and HMMs. General advice: the statistics you need to learn for ML is best learned from statistics books, not ML textbooks. Learn statistics from the folks who developed the ideas, not second hand regurgitation from ML authors.
- Best Approximation in Inner Product Spaces by Deutsch. If you want to see how mathematicians think of machine learning, you need to read this book. Mathematicians tend to think in generalities. This book captures beautifully the way mathematicians think of learning from data, e.g. least squares methods as projections in Hilbert spaces. Even more beautiful ideas like von Neumann’s famous algorithm using alternating projections, the most rediscovered and reinvented algorithm in history, is explained here. Yes, you’ll find that many ideas you thought that came from ML or statistics can all be viewed as special cases of von Neumann’s work (EM, non-negative matrix approximation, and a dozen other ideas). This book teaches you the power of abstraction.
- Computational Science and Engineering by Strang. You’ll need to understand differential equations at some point, even to understand the dynamics of deep learning models, so you’ll benefit from Strang’s tour de force of a survey through a vast landscape of ideas, from numerical analysis to Fourier transforms. Strang teaches math in a way completely different from traditional books, so every chapter is a compelling read. You get the rare feeling that the author is sitting in your living room, lecturing to you personally.
- Neuro-Dynamic Programming by Tsitsiklis and Bertsekas. Not a traditional textbook, but covers very important material in math that is hard to get from other sources. Still the most authoritative treatment of reinforcement learning. Valuable in many other ways, including a superb treatment of nonlinear function approximation by neural network models. The most enjoyable bus ride of my life was in the company of these two eminent MIT professors a decade ago going to a workshop in a remote region of Mexico. If you really want to understand why Q-learning works, this is your salvation. You’ll quickly discover how weak your math background is, and why you need to understand the deep concept of martingales, which capture the notion of a fair betting game.
- Linear Statistical Inference and its Applications by C. R. Rao. For most of you who haven’t heard of this “living god” of statistics, your statistics professor’s PhD advisor likely learned statistics from this book. The famous Rao-Blackwell theorem is at the heart of the foundational concept of sufficient statistics, one of the cornerstones of machine learning. The equally famous Rao-Cramer theorem relates the ability to learn effectively from samples to the curvature of the likelihood function. In a dazzling paper written in his 20s, he showed that the space of probability distributions was not Euclidean, but a curved Riemannian manifold. This idea shows up in machine learning in a hundred different ways currently. Rao invented multivariate statistics as a young postdoctoral researcher at Cambridge. Hard to believe, but this “Gauss” of statistics is still alive, in his 90s, teaching at a university in India named after him. No living person has had a bigger influence on statistics or on machine learning than Rao. Short of inventing time travel, and being able to visit Newton at Cambridge or Darwin at his house, your best bet is to travel to India to meet this “living God of statistics”, who’s still doing research and teaching!
- Causal Inference in Statistics by Judea Pearl. For the past 25 years, Pearl has single handedly pursued this problem. To anyone who listens, he will tell you why above all, causality is the most important idea after likelihood in statistics, which however cannot be expressed in the language of probabilities. For all its power, probability theory cannot express such a basic concept like diseases causes symptoms, not the other way. Correlation is symmetric. Causality is fundamentally asymmetric. Pearl explains when and whether one can go from the former to the latter. Pearl is the Isaac Newton of modern AI.
OK, there you have it. My shortlist of the 10 most influential textbooks from which I have learned much, and continue to learn from. What’s amazing about these books is that even if you have read them a dozen times, you can still learn something from them the 13th time! There’s something timeless and bottomless about foundational concepts, and I’ve said this before, and I will say this again. Spend less time mastering the vagaries of the latest whizbang deep learning toolbox, and more time with these master textbooks. It will make you a better ML researcher, and a better human being!