(sec:supervised)= # Computational neurons The basic building block of a neural network is the neuron. Let us consider a single neuron which we assume to be connected to $k$ neurons in the preceding layer, see {numref}`fig:NN_act` left side. The neuron corresponds to a function $f:\mathbb{R}^k\to \mathbb{R}$ which is a composition of a linear function $q:\mathbb{R}^k\to \mathbb{R}$ and a non-linear (so-called *activation function*) $g:\mathbb{R}\to \mathbb{R}$. Specifically, ```{math} f(z_1,\ldots,z_k) = g(q(z_1,\ldots,z_k)), ``` where $z_1, z_2, \dots, z_k$ are the outputs of the neurons from the preceding layer to which the neuron is connected. The linear function is parametrized as ```{math} q(z_1,\ldots,z_k) = \sum_{j=1}^k w_jz_j + b. ``` Here, the real numbers $w_1, w_2, \dots, w_k$ are called *weights* and can be thought of as the “strength” of each respective connection between neurons in the preceding layer and this neuron. The real parameter $b$ is known as the *bias* and is simply a constant offset [^1]. The weights and biases are the variational parameters we will need to optimize when we train the network. The activation function $g$ is crucial for the neural network to be able to approximate any smooth function, since so far we merely performed a linear transformation. For this reason, $g$ has to be nonlinear. In analogy to biological neurons, $g$ represents the property of the neuron that it “spikes”, i.e., it produces a noticeable output only when the input potential grows beyond a certain threshold value. The most common choices for activation functions, shown in {numref}`fig:NN_act`, include: ```{figure} ../../_static/lecture_specific/supervised-ml_w_NN/act_functions.png :name: fig:NN_act **Left: schematic of a single neuron and its functional form. Right: examples of the commonly used activation functions: ReLU, sigmoid function and hyperbolic tangent.** ``` - *ReLU*: ReLU stands for rectified linear unit and is zero for all numbers smaller than zero, while a linear function for all positive numbers. - *Sigmoid*: The sigmoid function, usually taken as the logistic function, is a smoothed version of the step function. - *Hyperbolic tangent*: The hyperbolic tangent function has a similar behaviour as sigmoid but has both positive and negative values. - *Softmax*: The softmax function is a common activation function for the last layer in a classification problem (see below). The choice of activation function is part of the neural network architecture and is therefore not changed during training (in contrast to the variational parameters weights and bias, which are adjusted during training). Typically, the same activation function is used for all neurons in a layer, while the activation function may vary from layer to layer. Determining what a good activation function is for a given layer of a neural network is typically a heuristic rather than systematic task. Note that the softmax provides a special case of an activation function as it explicitly depends on the output of the $q$ functions in the other neurons of the same layer. Let us label by $l=1,\ldots,n $ the $n$ neurons in a given layer and by $q_l$ the output of their respective linear transformation. Then, the *softmax* is defined as ```{math} g_l(q_1,\ldots, q_n)= \frac{e^{-q_{l}}}{\sum_{l'=1}^ne^{-q_{l'}}} ``` for the output of neuron $l$. A useful property of softmax is that $\sum_l g_l(q_1,\ldots, q_n)=1,$ so that the layer output can be interpreted as a probability distribution. The softmax function is thus a continuous generalization of the argmax function introduced in the previous chapter. A simple network structure -------------------------- Now that we understand how a single neuron works, we can connect many of them together and create an artificial neural network. The general structure of a simple (feed-forward) neural network is shown in {numref}`fig:simple_network`. The first and last layers are the input and output layers (blue and violet, respectively, in {numref}`fig:simple_network`) and are called *visible layers* as they are directly accessed. All the other layers in between them are neither accessible for input nor providing any direct output, and thus are called *hidden layers* (green layer in {numref}`fig:simple_network`. ```{figure} ../../_static/lecture_specific/supervised-ml_w_NN/simple_network.png :name: fig:simple_network **Architecture and variational parameters.** ``` Assuming we can feed the input to the network as a vector, we denote the input data with ${\boldsymbol{x}}$. The network then transforms this input into the output ${\boldsymbol{F}}({\boldsymbol{x}})$, which in general is also a vector. As a simple and concrete example, we write the complete functional form of a neural network with one hidden layer as shown in {numref}`fig:simple_network`, ```{math} :label: eq:2-layer NN {\boldsymbol{F}}({\boldsymbol{x}}) = {\boldsymbol{g}}^{[2]}\left( W^{[2]}{\boldsymbol{g}}^{[1]} \left(W^{[1]}{\boldsymbol{x}}+{\boldsymbol{b}}^{[1]}\right)+{\boldsymbol{b}}^{[2]} \right). ``` Here, $W^{[n]}$ and ${\boldsymbol{b}}^{[n]}$ are the weight matrix and bias vectors of the $n$-th layer. Specifically, $W^{[1]}$ is the $k\times l$ weight matrix of the hidden layer with $k$ and $l$ the number of neurons in the input and hidden layer, respectively. $W_{ij}^{[1]}$ is the $j$-the entry of the weight vector of the $i$-th neuron in the hidden layer, while $b_i^{[1]}$ is the bias of this neuron. The $W_{ij}^{[2]}$ and ${\boldsymbol{b}}_i^{[2]}$ are the respective quantities for the output layer. This network is called *fully connected* or *dense*, because each neuron in a given layer takes as input the output from all the neurons in the previous layer, in other words all weights are allowed to be non-zero. Note that for the evaluation of such a network, we first calculate all the neurons’ values of the first hidden layer, which feed into the neurons of the second hidden layer and so on until we reach the output layer. This procedure, which is possible only for feed-forward neural networks, is obviously much more efficient than evaluating the nested function of each output neuron independently. [^1]: Note that this bias is unrelated to the bias we learned about in regression.