# Understanding Mental Categories with Markov Chain Monte Carlo

Categorization is fundamental to human cognition. As we learn and communicate with others, we start carving the world around us into meaningful chunks that we can then make inferences about. Learning basic categories like “food,” “water,” and “friend” is necessary for survival, while higher-order categories like “computer” and “variational auto-encoder” enable the abstractions that underpin the modern world.

Some categories might be clear-cut, like the definitions of mathematical objects, but it is much more common for boundaries to be fuzzy and for some members of a category to feel more representative than others. Sparrows and ostriches are both birds, but the sparrow somehow feels birdier. Asking “Is a hot dog a sandwich?” could lead to a fistfight, but nobody would even think to ask such a question about a ham and cheese. Different degrees of membership or representativeness seems to be a key features of human categorization, and psychologists' theories reflect that.

Psychologists have proposed that we represent categories either as probability distributions over a space of concepts or as non-negative representativeness functions that map from objects to representativeness values (Rosch, 1983; Nosofsky, 1986). If we want, we can think of representativeness functions as un-normalized probability distributions. For instance, here is a contour plot illustrating what a representativeness distribution for birds might look like: Now suppose we want to learn more about the specifics people’s mental categories. We can’t just ask people “What does your representativeness function for sandwiches look like?” because then we’d get confused looks instead of useful data, so we need some way of uncovering this distribution without having an explicit representation of it.

This sounds like a job for Markov chain Monte Carlo! MCMC is sometimes used in cognitive psychology to investigate people’s mental representations of categories. The remainder of this blog post will describe the method, give a few examples of its use, and describe some recent advances in the area.

## The Metropolis-Hastings Algorithm

The Metropolis-Hastings algorithm (MH) is a method that lets us sample from arbitrary probability distributions. Most distributions we see in nature are ugly and have no straightforward way of drawing samples. Some aren’t even normalized! MH is a Markov chain Monte Carlo method, meaning it relies on constructing a Markov chain to generate samples.

A Markov chain is a sequence of random variables, also called states, each of which depends only on the state that came immediately before it. Mathematically, we can write the probability of the variables in a Markov chain taking on values $x_1,x_2,\ldots,x_n$ as:

$$p(x_1, x_2, …, x_n) = p(x_1)p(x_2|x_1)…p(x_n|x_{n-1})$$

In designing a Markov chain, we need to specify the transition function, which defines $p(x_n|x_{n-1})$, the probability distribution of the $n$th element in the sequence given the $n-1$st element. Under some basic assumptions, the variables in the Markov chain eventually converge to a stationary distribution. This means that after many states are generated, the distribution of states will get closer and closer to following some fixed distribution $\pi$. Our goal, then, is to construct a Markov chain that converges to a particular stationary distribution: the distribution we want to sample from. If we do this, we can treat the states of the Markov chain after it has reasonably converged as samples from our distribution.

The Metropolis method is one way of constructing Markov chains that converge to specified distributions. It breaks up transition probabilities into a proposal distribution and an acceptance function. In generating the next state in the Markov chain, we first sample from the proposal distribution, then use the acceptance function to decide whether to accept or reject the proposed state. The acceptance function is usually probabilistic, accepting a state with a certain probability. If it accepts the proposed state, congratulations! That’s the next state of our Markov chain. If it rejects the state, then we need to draw another sample from our proposal distribution, apply the acceptance function to the new state, and repeat until we get a sample that the function accepts. Proposal distributions and acceptance functions tend to be defined with respect to the most recent state in the chain and assign higher probability density to states around it.

Metropolis et al. (1952) proved that any Markov chain whose proposal distribution and acceptance function have the property of detailed balance will converge to the desired stationary distribution. If $q(x';x)$ is the proposal distribution, $a(x';x)$ is the acceptance function, and $\pi(x)$ is the stationary distribution we want the Markov chain to converge to, then a Markov chain obeys detailed balance if

$$\pi(x)q(x';x)a(x';x) = \pi(x')q(x;x')a(x;x')$$

Where $x'$ is a possible next state and $x$ is the current state. Essentially, this condition says that the ratio of the probabilities of going from state x to state x' in the Markov chain should be proportional to the ratio of $\pi(x)$ and $\pi(x')$.

Hastings (1970) (A UofT alumnus and former professor!) showed that if our acceptance function comes from a particular class of functions, our Markov chain is guaranteed to obey detailed balance. One special case of his finding is that if the proposal distribution is symmetric (i.e. $q(x';x) = q(x;x')$ for all $x, x'$) then using the following acceptance function guarantees convergence to the right stationary distribution:

$$a(x'|x) = \frac{\pi(x')}{\pi(x) + \pi(x')}$$

This called the Barker acceptance function and it favours proposals that are more likely in the stationary distribution than the current state (Barker, 1965).

If we just set up a Markov chain using this acceptance function and run it for a while, eventually its states will be samples from $\pi$. That’s super handy!

## MCMC with People

Now that we have a general framework for sampling from mental distributions, we just need to figure out how to design an appropriate acceptance function based on human’s . Luce’s choice rule might be of help! It is a common assumption in mathematical modeling of human behaviour, borne out by previous experiments, which states that people probability match in decision tasks (Luce, 1963). If someone has to choose between two options, $a$ and $b$, their probability of choosing $a$ is

$$\frac{f(a)}{f(a) + f(b)}$$

Where $f$ is a value function that maps an option to the latent value that the person assigns to it. This looks a lot like the Barker acceptance function! In fact, if $f$ is a representativeness function that assigns values to inputs based on how well they represent a given category, then we can use people’s decisions in a pairwise comparison task as an acceptance function. For instance, if we have a current state and a proposed state, with corresponding images, we can show both images to a person and ask “which of these is a better example of a [category]?” If they choose the current item, we reject the sample. If they choose the proposed item, we accept the sample. Repeating this over and over again with a symmetric proposal distribution gives us a distribution over the mental category of apples. Sanborn et al. (2009) did exactly this, using MCMC to get samples of concepts like fruit and fish.

Below, you can see an example of what one of these comparisons looks like, from León-Villagrá et al. (2020): The setup of the state space and proposal distribution vary from study to study. Older work has involved hand-designed state spaces that correspond to the properties of simple images, while more recent work has used similarity data with complex stimuli. Proposal distributions also differ, but we generally seek to balance favouring proposals that are similar to the current state with exploring the state space to find regions of high representativeness.

## Some Interesting Results

From the paper that first introduced this method, here are samples of fruits that people described as resembling an apple, projected from a hand-designed six-dimensional space to two dimensions via multi-dimensional scaling (Sanborn et al. 2009). Note that the distribution consists of two clusters: one for red apples and one for green apples. A more recent paper by Hsu et al. (2019) applies this method with more complex images. The authors gathered 4,000 images of seasons, quantified their similarity using various metrics over images, and applied this method to discover which images are most typical of each season. Below are their most representative images from each season. Each season has distinctive colours and items. Red and falling leaves are most typical of autumn, while green and flowers are most typical of spring. One drawback of this approach is that it can take a long time for Markov chains to converge to the stationary distribution. This means we often need to get study participants to sit in front of a computer clicking on the item that looks more like an apple for about four hours straight. Unsurprisingly, it can be hard to find people willing to do that. León-Villagrá et al. (2020) ran a version of the fruit experiment where instead of having the same participant complete an entire chain, they ran many participants for 15 minutes each, passing off the last state of the chain from one participant to the next until they collectively generated the desired number of samples. They then compared the results from their linked approach against those of the original fruit experiments. They were fairly similar to the non-linked results from the original study, suggesting that people’s mental representations of cartoon apples, oranges, and grapes are quite similar to each other.

We often hear about applications of statistical machine learning in flashy areas like medicine and autonomous vehicles, though the influence of these methods streches much wider. It is a testament to its power and generality that Markov chain Monte Carlo, a method which originated in statistical physics, has informed fields as distant as psychological method design.

Barker, A. A. (1965). Monte Carlo calculations of the radial distribution functions for a proton–electron plasma. Australian Journal of Physics, 18, 119–133.

Hastings, W. K. (1970). Monte Carlo sampling methods using Markov chains and their applications. Biometrika, 57, 97–109.

Hsu, A. S., Martin, J. B., Sanborn, A. N., & Griffiths, T. L. (2019). Identifying category representations for complex stimuli using discrete Markov chain Monte Carlo with people. Behavior research methods, 51(4), 1706-1716.

Lakoff, G. (1987). Women, fire, and dangerous things: What categories reveal about the mind. Chicago: University of Chicago Press.

León-Villagrá, P., Otsubo, K., Lucas, C. G., & Buchsbaum, D. (2020). Uncovering Category Representations with Linked MCMC with people.

Luce, R. D. (1963). Detection and recognition. In R. D. Luce, R. R. Bush, & E. Galanter (Eds.), Handbook of mathematical psychology, volume 1(p. 103-190). New York and London: John Wiley and Sons, Inc.

Metropolis, A. W., Rosenbluth, A. W., Rosenbluth, M. N., Teller, A. H., & Teller, E. (1953). Equations of state calculations by fast computing machines. Journal of Chemical Physics, 21, 1087–1092.

Nosofsky, R. M. (1988). Exemplar-based accounts of relations between classification, recognition, and typicality. Journal of Experimental Psychology: Learning, Memory, and Cognition, 14, 700–708.

Rosch, E. (1983). Prototype classification and logical classification: The two systems. New trends in conceptual representation: Challenges to Piaget’s theory, 73-86.

Sanborn, A. N., Griffiths, T. L., & Shiffrin, R. M. (2010). Uncovering mental representations with Markov chain Monte Carlo. Cognitive psychology, 60(2), 63-106.