As a Data Scientist, machine learning is our arsenal to do our job. I am pretty sure in this modern times, everyone who is employed as a Data Scientist would use machine learning to analyze their data to produce valuable patterns. Although, why we need to learn math for machine learning? There is some argument I could give, this includes:

- Math helps you select the correct machine learning algorithm. Understanding math gives you insight into how the model works, including choosing the
**right model parameter and the validation strategies**. - Estimating how confident we are with the model result by producing the
**right confidence interval and uncertainty measurements**needs an understanding of math. - The right model would consider many aspects such as
**metrics, training time, model complexity, number of parameters, and number of features**which need math to understand all of these aspects. - You could
**develop a customized model**that fits your own problem by knowing the machine learning model’s math.

The main problem is what math subject you need to understand machine learning? Math is a vast field, after all. That is why in this article, I want to outline the math subject you need for machine learning and a few important point to starting learning those subjects.

# Machine Learning Math

We could learn many topics from the math subject, but if we want to focus on the math used in machine learning, we need to specify it. In this case, I like to use the necessary math references explained in the Machine Learning Math book by M. P. Deisenroth, A. A. Faisal, and C. S. Ong, 2021.

In their book, there are math foundations that are important for Machine Learning. The math subject is:

Six math subjects become the foundation for machine learning. Each subject is intertwined to develop our machine learning model and reach the “best” model for generalizing the dataset.

Let’s dive deeper for each subject to know what they are.

# Linear Algebra

What is Linear Algebra? This is a branch of mathematic that concerns the study of the vectors and certain rules to manipulate the vector. When we are formalizing intuitive concepts, the common approach is to construct a set of objects (symbols) and a set of rules to manipulate these objects. This is what we knew as *algebra*.

If we talk about Linear Algebra in machine learning, it is defined as the part of mathematics that uses vector space and matrices to represent **linear equations.**

When talking about vectors, people might flashback to their high school study regarding the vector with direction, just like the image below.

This is a vector, but not the kind of vector discussed in the Linear Algebra for Machine Learning. Instead, it would be this image below we would talk about.

What we had above is also a Vector, but another kind of vector. You might be familiar with matrix form (the image below). The vector is a matrix with only 1 column, which is known as a column vector. In other words, we can think of a matrix as a group of column vectors or row vectors. In summary, vectors are special objects that can be added together and multiplied by scalars to produce another object of the same kind. We could have various objects called vectors.

Linear algebra itself s a systematic representation of data that computers can understand, and all the operations in linear algebra are systematic rules. That is why in modern time machine learning, Linear algebra is an important study.

An example of how linear algebra is used is in the linear equation. Linear algebra is a tool used in the Linear Equation because so many problems could be presented systematically in a Linear way. The typical Linear equation is presented in the form below.

To solve the linear equation problem above, we use Linear Algebra to present the linear equation in a systematical representation. This way, we could use the matrix characterization to look for the most optimal solution.

To summary the Linear Algebra subject, there are three terms you might want to learn more as a starting point within this subject:

- Vector
- Matrix
- Linear Equation

**Analytic Geometry (Coordinate Geometry)**

**Analytic** **geometry** is a study in which we learn the data (point) position using an ordered pair of coordinates. This study is concerned with defining and representing geometrical shapes numerically and extracting numerical information from the shapes numerical definitions and representations. We project the data into the plane in a simpler term, and we receive numerical information from there.

Above is an example of how we acquired information from the data point by projecting the dataset into the plane. How we acquire the information from this representation is the heart of Analytical Geometry. To help you start learning this subject, here are some important terms you might need.

**Distance Function**

A **distance function **is a function that provides numerical information for the distance between the elements of a set. If the distance is zero, then elements are equivalent. Else, they are different from each other.

An example of the distance function is Euclidean Distance which calculates the linear distance between two data points.

**Inner Product**

The inner product is a concept that introduces intuitive geometrical concepts, such as the **length of a vector **and the **angle or distance between two vectors**. It is often denoted as ⟨x,y⟩ (or occasionally (x,y) or ⟨x|y⟩).

**Matrix Decomposition**

Matrix Decomposition is a study that concerning the way to reducing a matrix into its constituent parts. Matrix Decomposition aims to simplify more complex matrix operations on the decomposed matrix rather than on its original matrix.

A common analogy for matrix decomposition is like factoring numbers, such as factoring 8 into 2 x 4. This is why matrix decomposition is synonymical to matrix factorization. There are many ways to decompose a matrix, so there is a range of different matrix decomposition techniques. An example is the LU Decomposition in the image below.

**Vector Calculus**

**Calculus** is a mathematical study that concern with continuous change, which mainly consists of functions and limits. **Vector** **calculus **itself is concerned with the differentiation and integration of the **vector** **fields**. Vector Calculus is often called **multivariate calculus, **although it has a slightly different study case. Multivariate calculus deals with calculus application functions of the multiple independent variables.

There are a few important terms I feel people need to know when starting learning the Vector Calculus, they are:

**Derivative**and**Differentiation**

**The derivative **is a function of real numbers that measure the change of the function value (output value) concerning a change in its argument (input value). **Differentiation** is the action of computing a derivative.

**Partial Derivative**

**The partial derivative **is a derivative function where several variables are calculated within the derivative function with respect to one of those variables could be varied, and the other variable are held constant (as opposed to the** total derivative**, in which all variables are allowed to vary).

**Gradient**

The **gradient** is a word related to the derivative or the rate of change of a function; you might consider that gradient is a fancy word for derivative. The term gradient is typically used for functions with several inputs and a single output (scalar). The **gradient has a direction to move** from their current location, e.g., up, down, right, left.

**Probability and Distribution**

**Probability **is a study of uncertainty (loosely terms). The probability here can be thought of as a time where the event occurs or the degree of belief about an event’s occurrence. The** probability distribution **is a function that measures the probability of a particular outcome (or probability set of outcomes) that would occur associated with the random variable. The common probability distribution function is shown in the image below.

**Probability theory **and **statistics** are often associated with a similar thing, but they concern different aspects of uncertainty:

•In math, we define probability as a model of some process where random variables capture the underlying uncertainty, and we use the rules of probability to summarize what happens.

•In statistics, we try to figure out the underlying process observe of something that has happened and tries to explain the observations.

When we talk about machine learning, it is close to statistics because its goal is to construct a model that adequately represents the process that generated the data.

# Optimization

In the learning objective, training a machine learning model is all about finding a **good **set of parameters. What we consider “good” is determined by the objective function or the probabilistic models. This is what **optimization algorithms **are for; given an objective function, we try to find the best value.

Commonly, objective functions in machine learning are trying to **minimize the function. **It means the best value is the minimum value. Intuitively, if we try to find the best value, it would like finding the valleys of the objective function where the gradients point us uphill. That is why we want to move downhill (opposite to the gradient) and hope to find the lowest (deepest) point. This is the concept of gradient descent.

There are few terms as a starting point when learning optimization. They are:

**Local Minima**and**Global Minima**

The point at which a function best values takes the minimum value is called the **global minima.** However, when the goal is to minimize the function and solved it using optimization algorithms such as **gradient descent**, the function could have a minimum value at different points. Those several points which appear to be minima but are not the point where the function actually takes the minimum value are called **local minima**.

**Unconstrained Optimization**and**Constrained Optimization**

**Unconstrained Optimization **is an optimization function where we find a minimum of a function under the assumption that the parameters can take any possible value (no parameter limitation). **Constrained Optimization **simply limits the possible value by introducing a set of constraints.

Gradient descent is an Unconstrained optimization if there is no parameter limitation. If we set some limit, for example, x > 1, it is an unconstrained optimization.

**Conclusion**

Machine Learning is an everyday tool that Data scientists use to obtain the valuable pattern we need. Learning the math behind machine learning could provide you an edge in your work. There are many math subjects out there, but there are 6 subjects that matter the most when we are starting learning machine learning math, and that is:

- Linear Algebra
- Analytic Geometry
- Matrix Decomposition
- Vector Calculus
- Probability and Distribution
- Optimization

If you start learning math for machine learning, you could read my other article to avoid the study pitfall. I also provide the math material you might want to check out in that article.

.

**Critics:**

**Machine learning** (**ML**) is the study of computer algorithms that improve automatically through experience and by the use of data. It is seen as a part of artificial intelligence. Machine learning algorithms build a model based on sample data, known as “training data“, in order to make predictions or decisions without being explicitly programmed to do so. Machine learning algorithms are used in a wide variety of applications, such as in medicine, email filtering, speech recognition, and computer vision, where it is difficult or unfeasible to develop conventional algorithms to perform the needed tasks.^{}

*A subset of machine learning is closely related to computational statistics, which focuses on making predictions using computers; but not all machine learning is statistical learning. The study of mathematical optimization delivers methods, theory and application domains to the field of machine learning. Data mining is a related field of study, focusing on exploratory data analysis through unsupervised learning. ^{} In its application across business problems, machine learning is also referred to as predictive analytics.*

*Machine learning approaches are traditionally divided into three broad categories, depending on the nature of the “signal” or “feedback” available to the learning system:*

*Supervised learning: The computer is presented with example inputs and their desired outputs, given by a “teacher”, and the goal is to learn a general rule that maps inputs to outputs.**Unsupervised learning: No labels are given to the learning algorithm, leaving it on its own to find structure in its input. Unsupervised learning can be a goal in itself (discovering hidden patterns in data) or a means towards an end (feature learning).**Reinforcement learning: A computer program interacts with a dynamic environment in which it must perform a certain goal (such as driving a vehicle or playing a game against an opponent). As it navigates its problem space, the program is provided feedback that’s analogous to rewards, which it tries to maximize.*^{}

*References*

*Mitchell, Tom (1997). Machine Learning. New York: McGraw Hill. ISBN 0-07-042807-7. OCLC 36417892.**The definition “without being explicitly programmed” is often attributed to Arthur Samuel, who coined the term “machine learning” in 1959, but the phrase is not found verbatim in this publication, and may be a paraphrase that appeared later. Confer “Paraphrasing Arthur Samuel (1959), the question is: How can computers learn to solve problems without being explicitly programmed?” in Koza, John R.; Bennett, Forrest H.; Andre, David; Keane, Martin A. (1996). Automated Design of Both the Topology and Sizing of Analog Electrical Circuits Using Genetic Programming. Artificial Intelligence in Design ’96. Springer, Dordrecht. pp. 151–170. doi:10.1007/978-94-009-0279-4_9.**Hu, J.; Niu, H.; Carrasco, J.; Lennox, B.; Arvin, F., “Voronoi-Based Multi-Robot Autonomous Exploration in Unknown Environments via Deep Reinforcement Learning” IEEE Transactions on Vehicular Technology, 2020.**Bishop, C. M. (2006), Pattern Recognition and Machine Learning, Springer, ISBN 978-0-387-31073-2**Machine learning and pattern recognition “can be viewed as two facets of the same field.”*^{[4]}^{:vii}*Friedman, Jerome H. (1998). “Data Mining and Statistics: What’s the connection?”. Computing Science and Statistics.***29**(1): 3–9.*Ethem Alpaydin (2020). Introduction to Machine Learning (Fourth ed.). MIT. pp. xix, 1–3, 13–18. ISBN 978-0262043793.**Samuel, Arthur (1959). “Some Studies in Machine Learning Using the Game of Checkers”. IBM Journal of Research and Development.***3**(3): 210–229. CiteSeerX 10.1.1.368.2254. doi:10.1147/rd.33.0210.*R. Kohavi and F. Provost, “Glossary of terms,” Machine Learning, vol. 30, no. 2–3, pp. 271–274, 1998.**Nilsson N. Learning Machines, McGraw Hill, 1965.**Duda, R., Hart P. Pattern Recognition and Scene Analysis, Wiley Interscience, 1973**S. Bozinovski “Teaching space: A representation concept for adaptive pattern classification” COINS Technical Report No. 81-28, Computer and Information Science Department, University of Massachusetts at Amherst, MA, 1981. https://web.cs.umass.edu/publication/docs/1981/UM-CS-1981-028.pdf**Mitchell, T. (1997). Machine Learning. McGraw Hill. p. 2. ISBN 978-0-07-042807-2.**Harnad, Stevan (2008), “The Annotation Game: On Turing (1950) on Computing, Machinery, and Intelligence”, in Epstein, Robert; Peters, Grace (eds.), The Turing Test Sourcebook: Philosophical and Methodological Issues in the Quest for the Thinking Computer, Kluwer, pp. 23–66, ISBN 9781402067082**“Introduction to AI Part 1”. Edzion. 2020-12-08. Retrieved 2020-12-09.**“AN EMPIRICAL SCIENCE RESEARCH ON BIOINFORMATICS IN MACHINE LEARNING – Journal”. Retrieved 28 October 2020.**“rasbt/stat453-deep-learning-ss20” (PDF). GitHub.**Sarle, Warren (1994). “Neural Networks and statistical models”. CiteSeerX 10.1.1.27.699.**Russell, Stuart; Norvig, Peter (2003) [1995]. Artificial Intelligence: A Modern Approach (2nd ed.). Prentice Hall. ISBN 978-0137903955.**Langley, Pat (2011). “The changing science of machine learning”. Machine Learning.***82**(3): 275–279. doi:10.1007/s10994-011-5242-y.*Garbade, Dr Michael J. (14 September 2018). “Clearing the Confusion: AI vs Machine Learning vs Deep Learning Differences”. Medium. Retrieved 28 October 2020.**“AI vs. Machine Learning vs. Deep Learning vs. Neural Networks: What’s the Difference?”. http://www.ibm.com. Retrieved 28 October 2020.**“Chapter 1: Introduction to Machine Learning and Deep Learning”. Dr. Sebastian Raschka. 5 August 2020. Retrieved 28 October 2020.**August 2011, Dovel Technologies in (15 May 2018). “Not all Machine Learning is Artificial Intelligence”. CTOvision.com. Retrieved 28 October 2020.**“AI Today Podcast #30: Interview with MIT Professor Luis Perez-Breva — Contrary Perspectives on AI and ML”. Cognilytica. 28 March 2018. Retrieved 28 October 2020.**“rasbt/stat453-deep-learning-ss20” (PDF). GitHub. Retrieved 28 October 2020.**Pearl, Judea; Mackenzie, Dana (15 May 2018). The Book of Why: The New Science of Cause and Effect (2018 ed.). Basic Books. ISBN 9780465097609. Retrieved 28 October 2020.**Poole, Mackworth & Goebel 1998, p. 1.**Russell & Norvig 2003, p. 55.**Definition of AI as the study of intelligent agents: * Poole, Mackworth & Goebel (1998), which provides the version that is used in this article. These authors use the term “computational intelligence” as a synonym for artificial intelligence.*^{[28]}* Russell & Norvig (2003) (who prefer the term “rational agent”) and write “The whole-agent view is now widely accepted in the field”.^{[29]}* Nilsson 1998 * Legg & Hutter 2007*Le Roux, Nicolas; Bengio, Yoshua; Fitzgibbon, Andrew (2012). “Improving+First+and+Second-Order+Methods+by+Modeling+Uncertainty&pg=PA403 “Improving First and Second-Order Methods by Modeling Uncertainty”. In Sra, Suvrit; Nowozin, Sebastian; Wright, Stephen J. (eds.). Optimization for Machine Learning. MIT Press. p. 404. ISBN 9780262016469.**Bzdok, Danilo; Altman, Naomi; Krzywinski, Martin (2018). “Statistics versus Machine Learning”. Nature Methods.***15**(4): 233–234. doi:10.1038/nmeth.4642. PMC 6082636. PMID 30100822.*Michael I. Jordan (2014-09-10). “statistics and machine learning”. reddit. Retrieved 2014-10-01.**Cornell University Library. “Breiman: Statistical Modeling: The Two Cultures (with comments and a rejoinder by the author)”. Retrieved 8 August 2015.**Gareth James; Daniela Witten; Trevor Hastie; Robert Tibshirani (2013). An Introduction to Statistical Learning. Springer. p. vii.**Mohri, Mehryar; Rostamizadeh, Afshin; Talwalkar, Ameet (2012). Foundations of Machine Learning. USA, Massachusetts: MIT Press. ISBN 9780262018258.**Alpaydin, Ethem (2010). Introduction to Machine Learning. London: The MIT Press. ISBN 978-0-262-01243-0. Retrieved 4 February 2017.**Russell, Stuart J.; Norvig, Peter (2010). Artificial Intelligence: A Modern Approach (Third ed.). Prentice Hall. ISBN 9780136042594.**Mohri, Mehryar; Rostamizadeh, Afshin; Talwalkar, Ameet (2012). Foundations of Machine Learning. The MIT Press. ISBN 9780262018258.Alpaydin, Ethem (2010). Introduction to Machine Learning. MIT Press. p. 9. ISBN 978-0-262-01243-0.*