Boltzmann machines can be seen as the stochastic, generative counterpart of Hopfield networks. They were one of the first neural networks capable of learning internal representations, and are able to represent and (given sufficient time) solve difficult combinatoric problems.
They are theoretically intriguing because of the locality and Hebbian nature of their training algorithm (being trained by Hebb's rule), and because of their parallelism and the resemblance of their dynamics to simple physical processes. Boltzmann machines with unconstrained connectivity have not proven useful for practical problems in machine learning or inference, but if the connectivity is properly constrained, the learning can be made efficient enough to be useful for practical problems.
They are named after the Boltzmann distribution in statistical mechanics, which is used in their sampling function. That's why they are called "energy based models" (EBM). They were invented in 1985 by Geoffrey Hinton, then a Professor at Carnegie Mellon University, and Terry Sejnowski, then a Professor at Johns Hopkins University.
A Boltzmann machine, like a Hopfield network, is a network of units with an "energy" (Hamiltonian) defined for the overall network. Its units produce binary results. Unlike Hopfield nets, Boltzmann machine units are stochastic. The global energy in a Boltzmann machine is identical in form to that of a Hopfield network as well as the Ising model:
- is the connection strength between unit and unit .
- is the state, , of unit .
- is the bias of unit in the global energy function. ( is the activation threshold for the unit.)
Often the weights are represented as a symmetric matrix with zeros along the diagonal.
Unit state probability
The difference in the global energy that results from a single unit equaling 0 (off) versus 1 (on), written , assuming a symmetric matrix of weights, is given by:
This can be expressed as the difference of energies of two states:
Substituting the energy of each state with its relative probability according to the Boltzmann factor (the property of a Boltzmann distribution that the energy of a state is proportional to the negative log probability of that state) gives:
where is Boltzmann's constant and is absorbed into the artificial notion of temperature . We then rearrange terms and consider that the probabilities of the unit being on and off must sum to one:
Solving for , the probability that the -th unit is on gives:
The network runs by repeatedly choosing a unit and resetting its state. After running for long enough at a certain temperature, the probability of a global state of the network depends only upon that global state's energy, according to a Boltzmann distribution, and not on the initial state from which the process was started. This means that log-probabilities of global states become linear in their energies. This relationship is true when the machine is "at thermal equilibrium", meaning that the probability distribution of global states has converged. Running the network beginning from a high temperature, its temperature gradually decreases until reaching a thermal equilibrium at a lower temperature. It then may converge to a distribution where the energy level fluctuates around the global minimum. This process is called simulated annealing.
To train the network so that the chance it will converge to a global state is according to an external distribution over these states, the weights must be set so that the global states with the highest probabilities get the lowest energies. This is done by training.
The units in the Boltzmann machine are divided into 'visible' units, V, and 'hidden' units, H. The visible units are those that receive information from the 'environment', i.e. the training set is a set of binary vectors over the set V. The distribution over the training set is denoted .
As is discussed above, the distribution over global states converges as the Boltzmann machine reaches thermal equilibrium. We denote this distribution, after we marginalize it over the hidden units, as .
Our goal is to approximate the "real" distribution using the produced (eventually) by the machine. To measure how similar the two distributions are, the Kullback–Leibler divergence, is
where the sum is over all the possible states of . is a function of the weights, since they determine the energy of a state, and the energy determines , as promised by the Boltzmann distribution. Hence, we can use a gradient descent algorithm over , so a given weight, is changed by subtracting the partial derivative of with respect to the weight.
There are two alternating phases to Boltzmann machine training. One is the "positive" phase where the visible units' states are clamped to a particular binary state vector sampled from the training set (according to ). The other is the "negative" phase where the network is allowed to run freely, i.e. no units have their state determined by external data. Surprisingly enough, the gradient with respect to a given weight, , is given by the simple proven equation:
- is the probability that units i and j are both on when the machine is at equilibrium on the positive phase.
- is the probability that units i and j are both on when the machine is at equilibrium on the negative phase.
- denotes the learning rate
This result follows from the fact that at thermal equilibrium the probability of any global state when the network is free-running is given by the Boltzmann distribution (hence the name "Boltzmann machine").
Remarkably, this learning rule is fairly biologically plausible because the only information needed to change the weights is provided by "local" information. That is, the connection (or synapse biologically speaking) does not need information about anything other than the two neurons it connects. This is far more biologically realistic than the information needed by a connection in many other neural network training algorithms, such as backpropagation.
The training of a Boltzmann machine does not use the EM algorithm, which is heavily used in machine learning. By minimizing the KL-divergence, it is equivalent to maximizing the log-likelihood of the data. Therefore, the training procedure performs gradient ascent on the log-likelihood of the observed data. This is in contrast to the EM algorithm, where the posterior distribution of the hidden nodes must be calculated before the maximization of the expected value of the complete data likelihood during the M-step.
Training the biases is similar, but uses only single node activity:
The Boltzmann machine would theoretically be a rather general computational medium. For instance, if trained on photographs, the machine would theoretically model the distribution of photographs, and could use that model to, for example, complete a partial photograph.
Unfortunately, there is a serious practical problem with the Boltzmann machine, namely that it seems to stop learning correctly when the machine is scaled up to anything larger than a trivial machine. This is due to a number of effects, the most important of which are:
- the time the machine must be run in order to collect equilibrium statistics grows exponentially with the machine's size, and with the magnitude of the connection strengths
- connection strengths are more plastic when the units being connected have activation probabilities intermediate between zero and one, leading to a so-called variance trap. The net effect is that noise causes the connection strengths to follow a random walk until the activities saturate.
Restricted Boltzmann machine
Although learning is impractical in general Boltzmann machines, it can be made quite efficient in an architecture called the "restricted Boltzmann machine" or "RBM" which does not allow intralayer connections between hidden units. After training one RBM, the activities of its hidden units can be treated as data for training a higher-level RBM. This method of stacking RBMs makes it possible to train many layers of hidden units efficiently and is one of the most common deep learning strategies. As each new layer is added the overall generative model gets better.
Deep Boltzmann machine
A deep Boltzmann machine (DBM) is a type of binary pairwise Markov random field (undirected probabilistic graphical model) with multiple layers of hidden random variables. It is a network of symmetrically coupled stochastic binary units. It comprises a set of visible units and layers of hidden units . No connection links units of the same layer (like RBM). For the , the probability assigned to vector ν is
where are the set of hidden units, and are the model parameters, representing visible-hidden and hidden-hidden interactions. In a DBN only the top two layers form a restricted Boltzmann machine (which is an undirected graphical model), while lower layers form a directed generative model. In a DBM all layers are symmetric and undirected.
Like DBNs, DBMs can learn complex and abstract internal representations of the input in tasks such as object or speech recognition, using limited, labeled data to fine-tune the representations built using a large supply of unlabeled sensory input data. However, unlike DBNs and deep convolutional neural networks, they adopt the inference and training procedure in both directions, bottom-up and top-down pass, which allow the Deep Boltzmann machine to better unveil the representations of the input structures.
However, the slow speed of DBMs limits their performance and functionality. Because exact maximum likelihood learning is intractable for DBMs, only approximate maximum likelihood learning is possible. Another option is to use mean-field inference to estimate data-dependent expectations and approximate the expected sufficient statistics by using Markov chain Monte Carlo (MCMC). This approximate inference, which must be done for each test input, is about 25 to 50 times slower than a single bottom-up pass in DBMs. This makes joint optimization impractical for large data sets, and restricts the use of DBMs for tasks such as feature representation.
The need for deep learning with real-valued inputs, as in Gaussian restricted Boltzmann machines, led to the spike-and-slab RBM (ssRBM), which models continuous-valued inputs with strictly binary latent variables. Similar to basic RBMs and its variants, a spike-and-slab RBM is a bipartite graph, while like GRBMs, the visible units (input) are real-valued. The difference is in the hidden layer, where each hidden unit has a binary spike variable and a real-valued slab variable. A spike is a discrete probability mass at zero, while a slab is a density over continuous domain; their mixture forms a prior.
An extension of ssRBM called µ-ssRBM provides extra modeling capacity using additional terms in the energy function. One of these terms enables the model to form a conditional distribution of the spike variables by marginalizing out the slab variables given an observation.
The idea of using stochastic weights in Ising models considered by:
- David Sherrington and Scott Kirkpatrick, Solvable Model of a Spin-Glass, Phys. Rev. Lett. 35, 1792 (1975)
However, in cognitive sciences, it is often thought to have been first described by:
- Geoffrey E. Hinton and Terrence J. Sejnowski, Analyzing Cooperative Computation. In Proceedings of the 5th Annual Congress of the Cognitive Science Society, Rochester, New York, May 1983.
- Geoffrey E. Hinton and Terrence J. Sejnowski, Optimal Perceptual Inference. In Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition (CVPR), pages 448–453, IEEE Computer Society, Washington, D.C., June 1983.
However, these articles appeared after the seminal publication by John Hopfield, where the connection to physics and statistical mechanics was made in the first place, mentioning spin glasses:
- John J. Hopfield, Neural networks and physical systems with emergent collective computational abilities, Proc. Natl. Acad. Sci. USA, vol. 79 no. 8, pp. 2554–2558, April 1982.
- Hofstadter, Douglas R., The Copycat Project: An Experiment in Nondeterminism and Creative Analogies. MIT Artificial Intelligence Laboratory Memo No. 755, January 1984.
- Hofstadter, Douglas R., A Non-Deterministic Approach to Analogy, Involving the Ising Model of Ferromagnetism. In E. Caianiello, ed. The Physics of Cognitive Processes. Teaneck, New Jersey: World Scientific, 1987.
Similar ideas (with a change of sign in the energy function) are also found in Paul Smolensky's "Harmony Theory".
The explicit analogy drawn with statistical mechanics in the Boltzmann Machine formulation led to the use of terminology borrowed from physics (e.g., "energy" rather than "harmony"), which has become standard in the field. The widespread adoption of this terminology may have been encouraged by the fact that its use led to the importation of a variety of concepts and methods from statistical mechanics. However, there is no reason to think that the various proposals to use simulated annealing for inference described above were not independent. (Helmholtz made a similar analogy during the dawn of psychophysics.)
Ising models are now considered to be a special case of Markov random fields, which find widespread application in various fields, including linguistics, robotics, computer vision, and artificial intelligence.
- Restricted Boltzmann machine
- Markov Random Field
- Ising Model
- Hopfield network
- Learning rule that uses conditional "local" information can be derived from the reversed form of ,
- Hinton, Geoffrey E. (2007-05-24). "Boltzmann machine". Scholarpedia. 2 (5): 1668. doi:10.4249/scholarpedia.1668. ISSN 1941-6016.
- Osborn, Thomas R. (1 January 1990). "Fast Teaching of Boltzmann Machines with Local Inhibition". International Neural Network Conference. Springer Netherlands. p. 785. doi:10.1007/978-94-009-0643-3_76. ISBN 978-0-7923-0831-7.
- Ackley, David H; Hinton Geoffrey E; Sejnowski, Terrence J (1985), "A learning algorithm for Boltzmann machines" (PDF), Cognitive Science, 9 (1): 147–169, doi:10.1207/s15516709cog0901_7
- Ackley, David H.; Hinton, Geoffrey E.; Sejnowski, Terrence J. (1985). "A Learning Algorithm for Boltzmann Machines" (PDF). Cognitive Science. 9 (1): 147–169. doi:10.1207/s15516709cog0901_7. Archived from the original (PDF) on 18 July 2011.
- Yu, Dong; Dahl, George; Acero, Alex; Deng, Li (2011). "Context-Dependent Pre-trained Deep Neural Networks for Large Vocabulary Speech Recognition" (PDF). Microsoft Research. 20.
- Hinton, Geoffrey; Salakhutdinov, Ruslan (2012). "A better way to pretrain deep Boltzmann machines" (PDF). Advances in Neural. 3: 1–9. Archived from the original (PDF) on 2017-08-13. Retrieved 2017-08-18.
- Hinton, Geoffrey; Salakhutdinov, Ruslan (2009). "Efficient Learning of Deep Boltzmann Machines" (PDF). 3: 448–455. Archived from the original (PDF) on 2015-11-06. Retrieved 2017-08-18. Cite journal requires
- Bengio, Yoshua; LeCun, Yann (2007). "Scaling Learning Algorithms towards AI" (PDF). 1: 1–41. Cite journal requires
- Larochelle, Hugo; Salakhutdinov, Ruslan (2010). "Efficient Learning of Deep Boltzmann Machines" (PDF): 693–700. Archived from the original (PDF) on 2017-08-14. Retrieved 2017-08-18. Cite journal requires
- Courville, Aaron; Bergstra, James; Bengio, Yoshua (2011). "A Spike and Slab Restricted Boltzmann Machine" (PDF). JMLR: Workshop and Conference Proceeding. 15: 233–241. Archived from the original (PDF) on 2016-03-04. Retrieved 2019-08-25.
- Courville, Aaron; Bergstra, James; Bengio, Yoshua (2011). "Unsupervised Models of Images by Spike-and-Slab RBMs" (PDF). Proceedings of the 28th International Conference on Machine Learning. 10. pp. 1–8.
- Mitchell, T; Beauchamp, J (1988). "Bayesian Variable Selection in Linear Regression". Journal of the American Statistical Association. 83 (404): 1023–1032. doi:10.1080/01621459.1988.10478694.
- Liou, C.-Y.; Lin, S.-L. (1989). "The other variant Boltzmann machine". International Joint Conference on Neural Networks. Washington, D.C., USA: IEEE. pp. 449–454. doi:10.1109/IJCNN.1989.118618.
- Hinton, G. E.; Sejnowski, T. J. (1986). D. E. Rumelhart, J. L. McClelland, and the PDP Research Group (eds.). "Learning and Relearning in Boltzmann Machines" (PDF). Parallel Distributed Processing: Explorations in the Microstructure of Cognition. Volume 1: Foundations: 282–317. Archived from the original (PDF) on 2010-07-05.CS1 maint: uses editors parameter (link)
- Hinton, G. E. (2002). "Training Products of Experts by Minimizing Contrastive Divergence" (PDF). Neural Computation. 14 (8): 1771–1800. CiteSeerX 10.1.1.35.8613. doi:10.1162/089976602760128018. PMID 12180402.
- Hinton, G. E.; Osindero, S.; Teh, Y. (2006). "A fast learning algorithm for deep belief nets" (PDF). Neural Computation. 18 (7): 1527–1554. CiteSeerX 10.1.1.76.1541. doi:10.1162/neco.2006.18.7.1527. PMID 16764513.