Elections, Opinion Polling, and Bayesian Statistics

13 min readJul 1, 2024

Introduction

The year 2024 is a big one for elections — nearly half the world will have cast a ballot by December! Some of these elections will be more free-and-fair than others, but taken together they are immensely consequential.

Given this great impact, everyone always wants to answer one question:

What’s going to happen?

How do we even start to answer that? Well, psephology — the science of elections and opinion polling — is a venerable field. I’ve been interested in it for over a decade, seeing it as a perfect combination of my interests in data and in politics.

The key insight is this: in order to get an accurate sense of what the result will be, the number of people you need to ask is surprisingly small.

There’s a catch, though. The people you ask need to be randomly selected from the electorate. Fail at that and it all goes wrong.

The foundational legend of opinion polling was forged in 1936, when George Gallup successfully predicted the result of the American presidential election after asking thousands of people their voting intentions. Famously, his more scientific approach triumphed over the efforts of the Literary Digest magazine, which recorded responses from over two million people … responses that were hopelessly biased.

It hasn’t been a century of glorious victories for the pollsters, though. They’ve had their fair share of mishaps along the way, from calling the 1992 UK election the wrong way to failing to predict Donald Trump’s 2016 victory in the USA. Even Gallup himself made mistakes. Polling is difficult. But it’s a worthy endeavour — to understand what people think … and perhaps even why they think it.

How to predict the future

First up, let’s explain why pollsters only need to ask a few people their views.

In an election there are many voters and few choices they can make. What’s important is which choice will garner the most votes — which choice will attract the highest percentage of the available voters.

Let’s imagine that 20% of voters will vote for the Red Party, 30% for the Blue Party, and 50% for the Yellow Party (I know, I know — exciting, descriptive names). Then, if we choose a voter at random we have a 20% chance of finding a Red voter, a 30% chance of finding a Blue voter, etc.

Let’s try it! A little bit of simulation goes a long way:

import matplotlib.pyplot as plt
import numpy as np
import pandas as pd


def count_choices(sim: np.ndarray) -> np.ndarray:
    choices, counts = np.unique(sim, return_counts=True)
    result = {choice: count for choice, count in zip(choices, counts)}
    return result


def make_simulations(
    poll_size: int,
    parties: list[str],
    truths: list[float],
    n_simulations: int,
) -> pd.DataFrame:
    """
    Returns a DataFrame containing one row per simulation result

    """
    sims = np.random.choice(
        parties,
        size=(n_simulations, poll_size),
        p=truths,
    )
    poll_results = pd.DataFrame(map(count_choices, sims)) * 100 / poll_size
    return poll_results


# poll 1000 voters. Do it 10000 times.
poll_size = 1000
n_sims = 10000

parties = ["R", "B", "Y"]
ground_truth = [0.2, 0.3, 0.5]

poll_results = make_simulations(poll_size, parties, ground_truth, n_sims)

# plot the results
colours = ["#DC241F", "#0087DC", "#FAA61A"]

fig, ax = plt.subplots()

for party, colour, truth in zip(parties, colours, ground_truth):
    res = poll_results[party]
    ax.hist(res, bins=30, color=colour, label=party, alpha=0.4)
    ax.axvline(truth*100, color=colour, linewidth=3)

ax.set_xlim(0, 100)
ax.legend(ncols=3)
ax.set_title("Results of simulated polls")
ax.set_xlabel("%")

plt.show()

And here is what we get:

The vertical bars are the true percentages. The histograms show the results of the 10,000 simulated polls. Thanks to randomness, you can see that asking 1,000 people their view rarely gets you the exact right answer — but it usually gets you pretty close.

Importantly, you can get reliably closer to the truth by asking more people … but there are diminishing returns. Ask four times as many people in each poll and the width of the histogram only halves.

All of this is highly idealised. The biggest problem with polling is that it’s nigh-on impossible to pick voters at random and get a representative sample. How do you find them? How do you get them to talk to you? How can you be sure they told you the truth?

Managing bias

These difficulties lead to bias.

Remember the Literary Digest story? They polled their readers (and some other easily contactable groups) who were very much not random voters — they were more affluent types who tended to vote a certain way.

Bias is hard to manage. Over time psephologists (great word) developed techniques to mitigate it. Bias creeps in via real-world data collection methods. Whether you’re accosting people on the street, calling landlines at random, building up a mailing list, or asking questions in online advertisements, you’re never going to have a truly representative sample.

But you do know what a representative sample ought to look like. Of your 1000 respondents you should have a certain number of pensioners and students, manual labourers and captains of industry, country dwellers and urbanites.

This suggests a way forward. If you should have — in your sample of 1000 — 50 retired professionals from small villages, but you actually have 25, you can rejig things so their responses count for twice as much. You can do this across all the different combinations of characteristics that strongly correlate with voting behaviour — you can construct a representative sample.

The tradeoff

There is, as always, a price.

Your final result now depends more on a small subsample, which (because of its small size) could vary quite a lot. This is a classic bias-variance tradeoff: reduce the bias, increase your statistical uncertainty. So you should still try to get as representative a sample as possible in the first place.

There are still some problems. What if your retired-professional-villagers are strange? Maybe most such people don’t have landlines anymore. Maybe the ones that still do have a landline tend to like the Blue Party much more than their peers. There are no easy answers to this — it was among the causes of the 2015 polling miss in the UK. Data collection will always be one of the hardest parts of opinion polling.

For now we’ll leave this issues behind; we’ll assume you collected and/or constructed a representative sample. Now what should you do with it?

Opinion polls: the Bayesian way

It’s a real bugbear of mine that the media often abuse opinion polls.

We saw earlier that there will always be natural variation between the results of polls, even if they are taken at exactly the same time. We’ve seen that this variation can be reduced by collecting a bigger sample. But we also know that this raises costs, has diminishing returns, and there will always be a small amount of bias that can’t be eliminated. In the end it makes more sense to have pollsters run lots of smaller polls using a variety of techniques.

This being so, ideally the media would consider the polls mostly in aggregate, being careful to understand statistical uncertainty and to be cautious about individual results. Instead we frequently see headlines making a great fuss out of a two percentage point week-on-week change, along with attempts to link those changes to whatever news stories have broken in the last few days.

The default assumption should be that this level of variation is pure randomness.

So what is the proper way to treat a single opinion poll? In a previous article I wrote about the advantages of Bayesian statistics — and in fact I promised to write something about how to use this approach in practice. It took me a while, but here we are!

Let’s go back to our idealised poll, in which 1000 randomly selected voters are asked which parties they intend to vote for. We can treat this as a set of samples drawn from a Multinomial distribution with parameters X, which are to be estimated from the data. A Multinomial distribution is analogous to a weighted die; when we roll the die, there are a number of possible outcomes that occur with different probabilities. X represents these probabilities. It is best written as a vector that sums to one, such as (0.2, 0.3, 0.5).

This distribution represents the ‘ground truth’ — i.e. the result if the election were held in the near future.

Recalling my introductory article, we seek P(X|data), the ‘probability that X has a particular value, given the data’, i.e. given the opinion poll results. Various values of X can probably be supported, so P(X|data) is also a probability distribution — one that tells us which values of X are more or less likely.

Using Bayes’ theorem, we can write this as

Remember that P(X) is the prior probability distribution representing your beliefs about X before you conduct your opinion poll.

How should we describe this probability distribution? Usefully, the Multinomial distribution has a conjugate prior, the Dirichlet distribution. This means there is a simple way to update from P(X) to P(X|data) for every randomly selected voter you poll.

(See the introductory article if you’re unsure what a conjugate prior is.)

We’ll write the Dirichlet distribution as Dir(α), where α is a vector of positive, real numbers that parameterise the distribution. When using the Dirichlet distribution as a conjugate prior, this α vector has a straightforward interpretation: its length is the number of categories (parties), and each element is ‘1 + the number of times we observe the category’ (i.e. the number of voters who say they’ll vote for each party).

Concretely, if we poll 997 people and observe 199 Red, 299 Blue, and 499 Yellow voters, then α = (200, 300, 500). In fact, I am going to write it as α = (200, 300, 500, 1); the final element represents an unobserved ‘Other’ category. I’m doing this to illustrate how our non-observation of another category in nearly 1000 samples constrains the maximum level of support for other parties.

Dir(α) is being used to represent a probability distribution over the parameters of the Multinomial distribution, X (the ‘weights of the different sides of the die’). The easiest way to demonstrate this is with another simulation.

Once again we’ll take the ground truth as being known. We’ll simulate a single opinion poll of 1000 people and use the result to define a Dirichlet distribution. This distribution will represent our beliefs about how the wider population would vote:

import matplotlib.pyplot as plt
import numpy as np

from scipy.stats import dirichlet

def count_choices(sim: np.ndarray) -> np.ndarray:
    choices, counts = np.unique(sim, return_counts=True)
    result = {choice: count for choice, count in zip(choices, counts)}
    return result

parties = ["R", "B", "Y", "O"]
ground_truth = [0.2, 0.3, 0.5, 0.0]

# let's make this simulation reproducible:
np.random.seed(321)

sim = np.random.choice(parties, size=1000, p=ground_truth)
poll = count_choices(sim)

dist = dirichlet([poll["R"] + 1, poll["B"] + 1, poll["Y"] + 1, 1])

# easiest way to map out the probability distribution function
# is to draw samples from it:
draws = dist.rvs(size=10000)

# plot the samples:
colours = ["#DC241F", "#0087DC", "#FAA61A", "#C68FF3"]

fig, axarr = plt.subplots(ncols=2, nrows=2)

for component, ax in enumerate(axarr.ravel()):

    plot_colour = colours[component]
    truth = ground_truth[component]
    party = parties[component]
    
    # we'll multiply by 100 to make it a percentage:
    ax.hist(draws[:, component]*100, bins=50, color=plot_colour, alpha=0.7)
    ax.axvline(truth*100, color=plot_colour, linewidth=3)

    ax.set_title(party)
    ax.set_xlabel("%")

plt.show()

The results are gratifying:

The way to read this plot is as follows: the random draws from Dir(α) (the histograms) represent possible values of X. The true value of X (the real level of support for each party) is marked by the vertical lines.

What if this was the real world, where we don’t know the true value? Well, as I’ve said all along, we need to get comfortable with uncertainty and start talking in terms of probability.

We should define some credible intervals. We can generate these as follows:

# ... following previous Python code

for component in range(4):
    party = parties[component]
    quantiles = [0.025, 0.5, 0.975]  # 2.5th, 50th, and 97.5th percentile
    examples = draws[:, component]

    # once again, express as a percentage
    low, med, high = 100 * np.quantile(examples, quantiles)

    print(f"Party {party}: {med:.1f}% ({low:.1f}-{high:.1f})")


# MY RESULTS - median % estimate and 95% credible interval
#
# Party R: 21.1% (18.7-23.7)
# Party B: 28.9% (26.1-31.7)
# Party Y: 49.9% (46.8-52.9)
# Party O: 0.1% (0.0-0.4)

The advantage of this particular method is that for each party you get an estimate and a credible interval where:

it’s 50/50 whether the true value is above or below the estimate
there’s a 95% chance the true value is within the stated range
there’s an equal, outside chance (2.5% on each side) that it’s beyond the stated range

This isn’t the only choice, and it’s important to remember that summary statistics aren’t a true substitute for looking at the distribution. For example, given that the ‘Other’ category has a modal value of zero (that is, the peak in the probability distribution is at zero), you might prefer to do this:

upper_limit = 100 * np.quantile(draws[:, -1], [0.95])[0]

print(f"Other parties: <{upper_limit:.1f}% (with 95% probability)")

# MY RESULT:
# Other parties: <0.3% (with 95% probability)

Finally, with scipy it’s possible to generate the ‘standard’ summary statistics directly from the probability distribution (without drawing random samples):

for component in range(4):
    party = parties[component]
    mean = 100 * dist.mean()[component]  # distribution mean
    variance = 100 * dist.var()[component]  # distribution variance
    std_dev = variance ** 0.5  # calculate the standard deviation

    print(f"({party}) Mean: {mean:.1f}%")
    print(f"({party})Standard Deviation: {std_dev:.2f} percentage points")

# MY RESULTS
#
# (R) Mean: 21.2%
# (R)Standard Deviation: 0.13 percentage points
# (B) Mean: 28.9%
# (B)Standard Deviation: 0.14 percentage points
# (Y) Mean: 49.9%
# (Y)Standard Deviation: 0.16 percentage points
# (O) Mean: 0.1%
# (O)Standard Deviation: 0.01 percentage points

As you can see, whether you estimate with the mean or the median you end up in the same ballpark.

On the other hand, be cautious about blindly using the standard deviation as a measure of uncertainty! For a Normal distribution, the mean plus-or-minus two standard deviations (a range of four standard deviations) would encompass about 95% of the probability density. And indeed here we see that, for most of the categories, the 95% credible interval is about 4x the standard deviation. However, for the ‘Other’ category (where the probability distribution is extremely skewed), the standard deviation is much smaller than than the credible interval.

Don’t forget the prior!

So far I’ve breezed past the most important aspect of Bayesian statistics: the fact that you need to specify a ‘prior’, P(X), representing your beliefs about X before you conduct your opinion poll. I did specify one, but until now I haven’t drawn attention to it.

The prior I selected above was a Dirichlet distribution with α = (1, 1, 1, 1). I mentioned that the ‘number of observations plus one’ was to be used to find α … so you may have already guessed that Dir((1, 1, 1, 1)) was the distribution I was using for P(X) — our beliefs about X before any observations have been made.

The special thing about Dir((1, 1, …, 1)) — all elements of α set to 1, for any number of categories — is that it represents a uniform distribution. In other words: “any possible combination of votes shares is equally likely” (provided they sum to 100%, which is a built-in property of the Dirichlet distribution). This is what we call ‘an uninformative prior’.

At this point you might quibble. After all, that seems like a bad choice; probably you’ve observed a bunch of elections and previous opinion polls, and while — sure — public opinion changes over time, there’s no way that it’s just as likely that 100% of voters vote for the Red party as it is to get a less extreme result.

You would have a point. However, it’s important to realise that ~1000 data points is a lot of data. You could pick a different, more informative prior, but it wouldn’t make much difference to the final estimate.

On the other hand, if you only had time to poll a small number of voters (say, ten) your choice of prior would become pretty important! I modified my earlier code to simulate this, and here’s what I got with the uninformative prior:

If I were to instead recognise that

in polls taken last week, the parties were on roughly 20%, 30%, and 50% (to the nearest ten percent, let’s say)
public opinion tends to move only a few percentage points at most in a week
nothing spectacular happened last week that might make this a bad assumption

… then I might feel comfortable using a prior like Dir((3, 4, 6, 1)). This is a weakly informative prior.

(Using something like α = (21, 31, 51, 1) would be more strongly informative … but unjustifiably so, given that public opinion changes over time.)

I can use α = (3, 4, 6, 1) like so:

# the modified section of my earlier code:
sim = np.random.choice(parties, size=10, p=ground_truth)
poll = count_choices(sim)

dist = dirichlet([poll["R"] + 3, poll["B"] + 4, poll["Y"] + 6, 1])

If I do that, here’s the result:

I’m sure it’s clear how big an impact specifying an informative prior had here — in fact, in my simulation (I promise I didn’t fiddle it!) we got no observations of the Blue party out of ten.

Since our uninformative prior didn’t encode any information suggesting this was somewhat unlikely, we got a very bad estimate of the Blue party’s support. By contrast, the informative prior at least allows us to distinguish the Blue party from ‘others’, even though we had hardly any data and used an only weakly informative prior.

In short, priors are important. If you aren’t thinking about them, you’re probably doing something unreasonable without realising it. A truckload of data can cover all manner of sins, but when you’re data-poor you can’t afford to ignore the prior.

Are pollsters data-poor?

You might be thinking this is overkill!

I’m not so sure. Yes, pollsters usually collect data from thousands of people … but don’t forget what I said earlier about that bias-variance tradeoff. Real polling results are more like a weighted sum of lots of small polls covering different demographics. If any of these mini-polls are too small they can have an outsized effect on the overall statistical uncertainty. In these scenarios, using a Bayesian approach could be very rewarding.

Wrapping up

I hope you’ve enjoyed this article. I wanted to demonstrate a very practical application of Bayesian statistics — and in this year of many elections it seemed timely to tackle opinion polling.

If this has piqued your interest in psephology, I also have a blog — The Limits of Data — in which I often write about politics and polling. I tend to use Medium to write articles that are more technical; that way everyone can follow only the bits they are interested in!

I’d love to hear from you. What did you think of this article? Is there anything you’re keen to hear more about? Let me know in the comments.