(sec:introduction)=
# Introduction
Why machine learning for the sciences?
--------------------------------------
Machine learning and artificial neural networks are everywhere and
change our daily life more profoundly than we might be aware of.
However, these concepts are not a particularly recent invention. Their
foundational principles emerged already in the 1940s. The *perceptron*,
the predecessor of the artificial neuron, the basic unit of many neural
networks to date, was invented by Frank Rosenblatt in 1958, and even
cast into a hardware realization by IBM.
It then took half a century for these ideas to become technologically
relevant. Now, artificial intelligence based on neural-network
algorithms has become an integral part of data processing with
widespread applications. The reason for its tremendous success is
twofold. First, the availability of big and structured data caters to
machine learning applications. Second, the realization that deep
(feed-forward) networks (made from many “layers” of artificial neurons)
with many variational parameters are tremendously more powerful than
few-layer ones was a big leap, the “deep learning revolution”. Machine
learning refers to algorithms that infer information from data in an
implicit way. If the algorithms are inspired by the functionality of
neural activity in the brain, the term *cognitive* or *neural* computing
is used. *Artificial neural networks* refer to a specific, albeit most
broadly used, ansatz for machine learning. Another field that concerns
iteself with inferring information from data is statistics. In that
sense, both machine learning and statistics have the same goal. However,
the way this goal is achieved is markedly different: while statistics
uses insights from mathematics to extract information, machine learning
aims at optimizing a variational function using available data through
learning. The mathematical foundations of machine learning with neural
networks are poorly understood: we do not know why deep learning works.
Nevertheless, there are some exact results for special cases. For
instance, certain classes of neural networks are a complete basis of
smooth functions, that is, when equipped with enough variational
parameters, they can approximate any smooth high-dimensional function
with arbitrarily precision. Other variational functions with this
property we commonly use are Taylor or Fourier series (with the
coefficients as “variational” parameters). We can think of neural
networks as a class or variational functions, for which the parameters
can be efficiently optimized with respect to a desired objective.
As an example, this objective can be the classification of handwritten
digits from ‘0’ to ‘9’. The input to the neural network would be an
image of the number, encoded in a vector of grayscale values. The output
is a probability distribution saying how likely it is that the image
shows a ‘0’, ‘1’, ‘2’, and so on. The variational parameters of the
network are adjusted until it accomplishes that task well. This is a
classical example of *supervised learning*. To perform the network
optimization, we need data consisting of input data (the pixel images)
and labels (the integer number shown on the respective image).
Our hope is that the optimized network also recognizes handwritten
digits it has not seen during the learning. This property of a network
is called *generalization*. It stands in opposition to a tendency called
*overfitting*, which means that the network has learned specificities of
the data set it was presented with, rather than the abstract features
necessary to identify the respective digit. An illustrative example of
overfitting is fitting a polynomial of degree $9$ to $10$ data points,
which will always be a perfect fit. Does this mean that this polynomial
best characterizes the behavior of the measured system? Of course not!
Fighting overfitting and creating algorithms that generalize well are
key challenges in machine learning. We will study several approaches to
achieve this goal.
```{figure} ../../_static/lecture_specific/introduction/mnist_digits.png
---
height: 150px
name: fig:MNIST
---
Examples of the digits from the handwritten
MNIST dataset.
```
Handwritten digit recognition has become one of the standard benchmark
problems in the field. Why so? The reason is simple: there exists a very
good and freely available data set for it, the MNIST database [^1], see
{numref}`fig:MNIST`. This curious fact highlights an important aspect of
machine learning: it is all about data. The most efficient way to
improve machine learning results is to provide more and better data.
Thus, one should keep in mind that despite the widespread applications,
machine learning is not the hammer for every nail. It is most beneficial
if large and **balanced** data sets, meaning roughly that the algorithm
can learn all aspects of the problem equally, in a machine-readable way
are available.
This lecture is an introduction specifically targeting the use of
machine learning in different domains of science. In scientific
research, we see a vastly increasing number of applications of machine
learning, mirroring the developments in industrial technology. With
that, machine learning presents itself as a universal new tool for the
exact sciences, standing side-by-side with methods such as calculus,
traditional statistics, and numerical simulations. This poses the
question, where in the scientific workflow, summerized in
{numref}`fig:scientific_workflow`, these novel methods are best
employed.
Once a specific task has been identified, applying machine learning to
the sciences does, furthermore, hold its very specific challenges: (i)
scientific data has often very particular structure, such as the nearly
perfect periodicity in an image of a crystal; (ii) typically, we have
specific knowledge about correlations in the data which should be
reflected in a machine learning analysis; (iii) we want to understand
why a particular algorithm works, seeking a fundamental insight into
mechanisms and laws of nature; (iv) in the sciences we are used to
algorithms and laws that provide deterministic answers while machine
learning is intrinsically probabilistic - there is no absolute
certainty. Nevertheless, quantitative precision is paramount in many
areas of science and thus a critical benchmark for machine learning
methods.
```{figure} ../../_static/lecture_specific/introduction/scientific_workflow.png
---
height: 200px
name: fig:scientific_workflow
---
From observations, via
abstraction to building and testing hypothesis or laws, to finally
making predictions.
```
**A note on the concept of a model**\
In both machine learning and the sciences, models play a crucial role.
However, it is important to recognize the difference in meaning: In the
natural sciences, a model is a conceptual representation of a
phenomenon. A scientific model does not try to represent the whole
world, but only a small part of it. A model is thus a simplification of
the phenomenon and can be both a theoretical construct, for example the
ideal gas model or the Bohr model of the atom, or an experimental
simplification, such as a small version of an airplane in a wind
channel.
In machine learning, on the other hand, we most often use a complicated
variational function, for example a neural network, to try to
approximate a statistical model. But what is a model in statistics?
Colloquially speaking, a statistical model comprises a set of
statistical assumptions which allow us to calculate the probability
$P(x)$ of *any* event $x$. The statistical model does not correspond to
the true distribution of all possible events, it simply approximates the
distribution. Scientific and statistical models thus share an important
property: neither claims to be a representation of reality.
Overview and learning goals
---------------------------
This lecture is an introduction to basic machine learning algorithms for
scientists and students of the sciences. We will cover
- the most fundamental machine learning algorithms,
- the terminology of the field, succinctly explained,
- the principles of supervised and unsupervised learning and why it is
so successful,
- various architectures of artificial neural networks and the problems
they are suitable for,
- how we find out what the machine learning algorithm uses to solve a
problem.
The field of machine learning is full of lingo which to the uninitiated
obscures what is at the core of the methods. Being a field in constant
transformation, new terminology is being introduced at a fast pace. Our
aim is to cut through slang with mathematically precise and concise
formulations in order to demystify machine learning concepts for someone
with an understanding of calculus and linear algebra.
As mentioned above, data is at the core of most machine learning
approaches discussed in this lecture. With raw data in many cases very
complex and extremely high dimensional, it is often crucial to first
understand the data better and reduce their dimensionality. Simple
algorithms that can be used before turning to the often heavy machinery
of neural networks will be discussed in the next section,
{ref}`sec:structuring_data`.
The machine learning algorithms we will focus on most can generally be
divided into two classes of algorithms, namely *discriminative* and
*generative* algorithms as illustrated in {numref}`fig:overview`.
Examples of discriminative tasks include classification problems, such
as the aforementioned digit classification or the classification into
solid, liquid and gas phases given some experimental observables.
Similarly, regression, in other words estimating relationships between
variables, is a discriminative problem. More specifically, we try to
approximate the conditional probability distribution $P(y|x)$ of some
variable $y$ (the label) given some input data $x$. As data is provided
in the form of input and target data for most of these tasks, these
algorithms usually employ supervised learning. Discriminative algorithms
are most straight-forwardly applicable in the sciences and we will
discuss them in Secs. {ref}`sec:linear-methods-for-supervised-learning`
and {ref}`sec:supervised`.
```{figure} ../../_static/lecture_specific/introduction/overview.png
:name: fig:overview
**Overview over the plan of the lecture from the perspective of learning
probability
distributions.**
```
Generative algorithms, on the other hand, model a probability
distribution $P(x)$. These approaches are—once trained—in principle more
powerful, since we can also learn the joint probability distribution
$P(x,y)$ of both the data $x$ and the labels $y$ and infer the
conditional probability of $y$. Still, the more targeted approach of
discriminative learning is better suited for many problems. However,
generative algorithms are useful in the natural sciences, as we can
sample from a known probability distribution, for example for image
denoising, or when trying to find new compounds/molecules resembling
known ones with given properties. These algorithms are discussed in
Sec. {ref}`sec:unsupervised`. The promise of artificial *intelligence* may
trigger unreasonable expectations in the sciences. After all, scientific
knowledge generation is one of the most complex intellectual processes.
Computer algorithms are certainly far from achieving anything on that
level of complexity and will in the near future not formulate new laws
of nature independently. Nevertheless, researchers study how machine
learning can help with individual segments of the scientific workflow
({numref}`fig:scientific_workflow`). While the type of abstraction
needed to formulate Newton’s laws of classical mechanics seems
incredibly complex, neural networks are very good at *implicit knowledge
representation*. To understand precisely how they achieve certain tasks,
however, is not an easy undertaking. We will discuss this question of
*interpretability* in Sec. {ref}`sec:interpretability`.
A third class of algorithms, which does not neatly fit the framework of
approximating a statistical model and thus the distinction into
discriminative and generative algorithms is known as reinforcement
learning. Instead of approximating a statistical model, reinforcement
learning tries to optimize strategies (actions) for achieving a given
task. Reinforcement learning has gained a lot of attention with Google’s
AlphaGo Zero, a computer program that beat the best Go players in the
world. As an example for an application in the sciences, reinforcement
learning can be used to decide on what experimental configuration to
perform next. While the whole topic is beyond the scope of this lecture,
we will give an introduction to the basic concepts of reinforcement
learning in Sec. {ref}`sec:RL`.
A final note on the practice of learning. While the machine learning
machinery is extremely powerful, using an appropriate architecture and
the right training details, captured in what are called
*hyperparameters*, is crucial for its successful application. Though
there are attempts to learn a suitable model and all hyperparameters as
part of the overall learning process, this is not a simple task and
requires immense computational resources. A large part of the machine
learning success is thus connected to the experience of the scientist
using the appropriate algorithms. We thus strongly encourage solving the
accompanying exercises carefully and taking advantage of the exercise
classes.
Resources
---------
While it may seem that implementing ML tasks is computationally
challenging, actually almost any ML task one might be interested in can
be done with relatively few lines of code simply by relying on external
libraries or mathematical computing systems such as Mathematica or
Matlab. At the moment, most of the external libraries are written for
the Python programming language. Here are some useful Python libraries:
1. **TensorFlow.** Developed by Google, Tensorflow is one of the most
popular and flexible library for machine learning with complex
models, with full GPU support.
2. **PyTorch.** Developed by Facebook, Pytorch is the biggest rival
library to Tensorflow, with pretty much the same functionalities.
3. **Scikit-Learn.** Whereas TensorFlow and PyTorch are catered for
deep learning practitioners, Scikit-Learn provides much of the
traditional machine learning tools, including linear regression and
PCA.
4. **Pandas.** Modern machine learning is largely reliant on big
datasets. This library provides many helpful tools to handle these
large datasets.
Prerequisites
-------------
This course is aimed at students of the (natural) sciences with a basic
mathematics education and some experience in programming. In particular,
we assume the following prerequisites:
- Basic knowledge of calculus and linear algebra.
- Rudimentary knowledge of statistics and probability theory
(advantageous).
- Basic knowledge of a programming language. For the teaching
assignments, you are free to choose your preferable one. The
solutions will typically be distributed in Python in the form of
Jupyter notebooks.
Please, don’t hesitate to ask questions if any notions are unclear.
References
----------
For further reading, we recommend the following books:
- **ML without neural networks**: *The Elements of Statistical
Learning*, T. Hastie, R. Tisbshirani, and J. Friedman (Springer)
- **ML with neural networks**: *Neural Networks and Deep
Learning*, M. Nielson ()
- **Deep Learning Theory**: *Deep Learning*, I. Goodfellow, Y.
Bengio and A. Courville ()
- **Reinforcement Learning**: *Reinforcement Learning*, R. S.
Sutton and A. G. Barto (MIT Press)
[^1]: http://yann.lecun.com/exdb/mnist