**Output:** Visible vector $\mathbf{v}(r)$

**for:** $n=1$\dots $r$

$\quad$ sample $\mathbf{h}(n)$ from $P_{\rm rbm}(\mathbf{h}|\mathbf{v}=\mathbf{v}(n-1))$

$\quad$ sample $\mathbf{v}(n)$ from $P_{\rm rbm}(\mathbf{v}|\mathbf{h}=\mathbf{h}(n))$

**end** ``` With sufficiently many steps $r$, the vector $\mathbf{v}(r)$ is an unbiased sample drawn from $P_{\textrm{rbm}}(\mathbf{v})$. By repeating the procedure, we can obtain multiple samples to estimate the summation. Note that this is still rather computationally expensive, requiring multiple evaluations on the model. The key innovation which allows the training of an RBM to be computationally feasible was proposed by Geoffrey Hinton (2002). Instead of obtaining multiple samples, we simply perform the Gibbs sampling with $r$ steps and estimate the summation with a single sample, in other words we replace the second summation in Eq. [](eqn:RBM-derivatives) with ```{math} \sum_{\mathbf{v}} v_{i} P_{\textrm{rbm}}(h_{j}=1|\mathbf{v}) P_{\textrm{rbm}}(\mathbf{v}) \rightarrow v'_{i} P_{\textrm{rbm}}(h_{j}=1|\mathbf{v}'), ``` where $\mathbf{v}' = \mathbf{v}(r)$ is simply the sample obtained from $r$-step Gibbs sampling. With this modification, the gradient, Eq. [](eqn:RBM-derivatives), can be approximated as ```{math} \frac{\partial\log P_{\textrm{rbm}}(\mathbf{x})}{\partial W_{ij}} \approx x_{i}P_{\textrm{rbm}}(h_{j}=1|\mathbf{x}) - v'_{i} P_{\textrm{rbm}}(h_{j}=1|\mathbf{v}'). ``` This method is known as *contrastive divergence*. Although the quantity computed is only a biased estimator of the gradient, this approach is found to work well in practice. The complete algorithm for training a RBM with $r$-step contrastive divergence can be summarised as follows: ```{admonition} Contrastive divergence :name: alg:contrastive-divergence **Input:** Dataset $\mathcal{D} = \lbrace \ \mathbf{x}_{1}, \ \mathbf{x}_{2}, \dots \ \mathbf{x}_{M} \rbrace$ drawn from a distribution $P(x)$}

initialize the RBM weights $\lbrace \mathbf{a},\mathbf{b},W \rbrace$

Initialize $\Delta W_{ij} = \Delta a_{i} = \Delta b_{j} =0$

**while:** not converged **do**

$\quad$ select a random batch $S$ of samples from the dataset $\mathcal{D}$

$\quad$ **forall** $\mathbf{x} \in S$

$\quad\quad$ Obtain $\ \mathbf{v}'$ by $r$-step Gibbs sampling starting from $\ \mathbf{x}$

$\quad\quad$ $\Delta W_{ij} \leftarrow \Delta W_{ij} - x_{i}P_{\textrm{rbm}}(h_{j}=1|\ \mathbf{x}) + v'_{i} P_{\textrm{rbm}}(h_{j}=1|\ \mathbf{h}')$

$\quad$ **end**

$\quad$ $W_{ij} \leftarrow W_{ij} - \eta\Delta W_{ij}$

$\quad$ (and similarly for $\mathbf{a}$ and $\mathbf{b}$)

**end** ``` Having trained the RBM to represent the underlying data distribution $P(\mathbf{x})$, there are a few ways one can use the trained model: 1. **Pretraining:** We can use $W$ and $\mathbf{b}$ as the initial weights and biases for a deep network (c.f. Chapter 4), which is then fine-tuned with gradient descent and backpropagation. 2. **Generative Modelling:** As a generative model, a trained RBM can be used to generate new samples via Gibbs sampling. Some potential uses of the generative aspect of the RBM include *recommender systems* and *image reconstruction*. In the following subsection, we provide an example, where an RBM is used to reconstruct a noisy signal. ## Example: signal or image reconstruction/denoising A major drawback of the simple RBMs for their application is the fact that they only take binary data as input. As an example, we thus look at simple periodic waveforms with 60 sample points. In particular, we use sawtooth, sine, and square waveforms. In order to have quasi-continuous data, we use eight bits for each point, such that our signal can take values from 0 to 255. Finally, we generate samples to train with a small variation in the maximum value, the periodicity, as well as the center point of each waveform. After training the RBM using the contrastive divergence algorithm, we now have a model which represents the data distribution of the binarized waveforms. Consider now a signal which has been corrupted, meaning some parts of the waveform have not been received, in other words they are set to 0. By feeding this corrupted data into the RBM and performing a few iterations of Gibbs sampling, we can obtain a reconstruction of the signal, where the missing part has been repaired, as can been seen at the bottom of {numref}`fig:RBM_reconstruction`. Note that the same procedure can be used to reconstruct or denoise images. Due to the limitation to binary data, however, the picture has to either be binarized, or the input size to the RBM becomes fairly large for high-resolution pictures. It is thus not surprising that while RBMs have been popular in the mid-2000s, they have largely been superseded by more modern and architectures such as *generative adversarial networks* which we shall explore later in the chapter. However, they still serve a pedagogical purpose and could also provide inspiration for future innovations, in particular in science. A recent example is the idea of using an RBM to represent a quantum mechanical state. ```{figure} ../../_static/lecture_specific/unsupervised-ml/rbm_reconstr.png :name: fig:RBM_reconstruction **Signal reconstruction.** Using an RBM to repair a corrupted signal, here a sine and a sawtooth waveform. ```