ELECTORATE 2024

A machine-learning based prediction model 

By Andrew Park (2021-Present)

andrewjparkus@gmail.com 

Our proprietary election model predicts nationwide shifts based on a shift in a singular or multiple counties. It uses gigabytes of election data from 1932-2020, incorporating multi-linear regression and machine learning to predict shifts in the other ~3,000 counties.

This is a prototype example of the app that users can use for their own predictions. It takes multiple user inputs and can be adapted to any base election from 1932-2020. Although its main use is for predicting elections, it can also be used to display historical shifts and demographic changes in counties.

This shift has been completely produced using 2020 as a base election year, but displays 1960s-1980s voting coalitions with a Democratic Solid South.

What I’m going to be building over the next couple of months is a 2024 primary model. I was originally planning on looking into the 2023 elections or a 2024 congressional model, but the former doesn’t have enough data, and the latter is already heavily covered by predictive models. On top of that, building a primary model is really challenging! (Nate Silver’s spot on here). 

Since I don’t have the foresight to give you an overarching scope of this project, let’s just dive right into what I’m working on!

The first thing I did was read– especially articles from 538, with some statistics sprinkled in (538 copiously cites statistics studies, especially in their older articles). They go into satisfying detail regarding their methodology, so you should check them out.

Next, I borrowed one of their ideas: CANTOR. CANTOR models the similarity of states to each other based on their demographics, geographic region, and more. As a proof of concept, I created a simple version that uses around 31 variables (Ranging from home ownership to education levels) and outputs a similarity matrix, rating a state’s similarity with other states.

To do this, we can use the Euclidean distance formula in 31-dimensional space (Descartes is probably rolling in his grave!). Specifically, each variable becomes an axis (x, y, z…) , so the formula would look like this:

This outputs a number representing the 31-D distance. Comparing a state to each other gives 0, as expected, but other numbers range from as low as 1,070 (The smallest I noticed at a glance) to nearly 500,000. This is alright though, since we can manipulate our data to make it more accessible.

Using Google Sheets, we can modify the matrix so that these values represent the percentile rank of the similarity score in the entire dataset. However, a higher percentile means it’s farther away, so the data’s flipped around! To get our similarity score, I just subtracted this value from 1.

This strategy is most likely not mathematically sound, and I’d love recommendations and tips on how to fix it or make it more representative. It seems okay, but not rigorous enough.  But anyway, here's the similarity map for Alabama.

Surprisingly, besides Oklahoma, its most similar states actually lie in the Midwest, especially Kentucky, Indiana, and Ohio. Unsurprisingly, states like California and Hawaii have nothing to do with Alabama.

So what is this for? 538’s CANTOR was built to fill in the gaps. In an election, we never have polling for all 50 states. Instead, pollsters look at key swing states or the nation as a whole to see which way the political race is going. Unfortunately for us, that leaves states like North Dakota and Idaho out of the running (No hate to them though!). That’s where this comes in. If we have polling for a particular state, e.g. Alabama, we can use its similarity with other states to infer how that state would vote. Theoretically, if we have polling for multiple states, our predictive model should become more and more accurate (converging to some “true” value?). 

For example, let’s say we want to figure out how North Dakota is leaning. 

According to this similarity map, we should infer how North Dakota would vote based on its most similar states, which are Illinois, Maine, Wyoming, Wisconsin and Pennsylvania. My idea is to weigh all 50 states (+DC) exponentially, so while the most similar states have large weights, the least similar ones have virtually no say.

First of all, we take the 5th powers (arbitrarily decided, not sound) of each of the percent similarities. This brings larger similarity numbers closer to 1, whereas smaller numbers are forced towards zero. Now we take the sum of these modified similarities, and then for each state, we divide the similarity by the sum to find its weight. 

This leaves states like Illinois with a 12.2% weight, states like Pennsylvania with a 5.2% weight, and states like Delaware with a 0.6% weight. For simplicity’s sake and to save some time, for our example we’ll only look at the top 5 weights. Once we re-weight them so they sum to 100%, we get the following weights:

Now, let’s get some test data. Specifically, let’s look at the 2016 Democratic primaries.

(Yellow is Clinton, while green is Sanders.  Electoral maps are pretty interesting!)

2016 Results:

Taking the weighted average of these results (with our weights), we expect Clinton to win 41.9% of the vote in North Dakota, while Sanders takes 57.5%. 

In reality, it's not as simple as it looks. North Dakota actually uses a caucus system, meaning instead of a popular vote, each state legislative district elects delegates, who vote on their own accord and then are sent to the state convention, where they nominate delegates to the national convention. Sanders ended up winning 64.2% of these district delegates, while Clinton won 25.6% and 10.2% were uncommitted. So we were able to predict the winner, but the actual margins weren’t great. 

So where does that leave us?

First: Predicting caucuses is hard. Our North Dakota projection was off, and we also used results from 2 other caucuses: Maine and Wyoming. Since they don’t do popular vote, similarity scores might not fully apply (do these delegates represent the state well?). I haven’t figured out a way to fix this, but I’ll be consulting 538 as per the usual. 

Second: What other metrics of similarity can we use? Although the 31 variables we used in this were definitely useful, it leaves a lot to be desired. For example, I find it hard to believe that North Dakota, a Great Plains state, has a lot in common with Maine. Some ways I’ve thought of improving this similarity score is weighting geographical regions heavily (e.g. Great Plains states correlate with each other) and also including political lean as a variable. 

Third: What’s next? Probably continuing this state similarity model. My first checkpoint is to create a polling tracker that can “predict” the results of all 50 states in the primaries, instead of just the ones that have been polled. Filling the gaps is what I’ve committed to doing first, but another hurdle to overcome is creating that average too, so there are a lot of challenges that lay ahead.

Unfortunately, it’s more likely than not I won’t be able to post every day. I’m just hoping that incremental updates will be beneficial to track my own progress and share my work with others. Until next time!