2016 US Presidential Election

US 2016 Presidential Election

Let's see which counties voted which way and why.


During the presidential election campaign, I heard many predictions for how different people will vote. I heard that factors like a person's age, income, race and so on would correlate well with how they would vote. Frustratingly, these different predictions would usually only mention one or two of these possible variables, so I couldn't learn how people thought they compared with each other. I bet that whether or not someone prefers orangutans or gorillas has some predictive power for how that person will vote, but I think one could do better by inspecting something else.

So I want to grab a bunch of demographic data, and the results of the election, and try to work out which information is the most useful in predicting how someone will vote. Later, I'd like to do this for other elections, to see how much of the change in election outcomes can be explained by changes in these demographic features.

Everything I do here is measuring correlations. I don't know how direct the causality is between, for example, a person's income and their vote. It may be that income affects some other variable that I haven't got data on, like their preference to keep laws the same, and this may affect yet another variable, like their opinion on some policy, which eventually affects their vote. On the other hand, it could be that income more directly, causally affects someone's vote, and there are fewer causal links to follow. This analysis will not distinguish cases like these.

As an aside, I don't think causality is a binary property of a pair of predictive and predicted variables. Outside of fundamental physics, few pairs of variables are directly causally linked. A predictive variable can be more or less strongly causally linked to some predicted variable, but to only say one is 'causal' of the other is not specific. I would prefer the saying to be re-phrased to something like 'correlation does not imply any degree of causation', to make clearer that even if any causation is present, this is a matter of degree, not of kind.

Load the labels: the presidential results

Firstly, I want a data-set containing the voting outcomes. This will be the variable I try to predict, often called the 'label'. I can't get the vote per person, but I want the smallest voting unit possible, that I can also find good demographic data for. The best I can find is voting outcomes by county, which Michael W. Kearney made available here.

import pandas as pd
import numpy as np
import requests
from io import BytesIO

results_url = 'https://raw.githubusercontent.com/mkearney/presidential_election_county_results_2016/master/pres.elect16.results.wide.votes.dec9.csv'
results_response = requests.get(results_url)
# This BytesIO class lets me make a byte-string look like a file stream,
# which is what Pandas expects.
dp_raw = pd.read_csv(BytesIO(results_response.content))

A convenience of the US two-party dominance, is that a county's voting outcome can be represented quite well by one number: the difference between the vote share of the Democrat and Republican candidates. This number is the label I will try to predict. More specifically, the label I use is,

$$\text{Democrat advantage} := \left( \text{Democrat votes} - \text{Republican votes} \right)\ / \ \text{Total votes} \; .$$

The analysis and outcome would be the same if the reverse definition, the 'Republican advantage', were used. This number ranges from -1, or -100%, if all votes are for the Republican candidate, to 0, or 0%, if both dominant parties receive equal votes, to 1, or 100%, if all votes are for the Democrat candidate.

# Remove rows that aren't at a per-county level (like per-state and per-country)
dp = dp_raw[dp_raw.county.notnull()].copy()
# Cast the country identifier to an int so we can join to features correctly later.
dp.loc[:, 'fips'] = dp.fips.astype(int)
# Compute the difference in vote share, which will be our target label.
dp['dem_adv'] = (dp['Hillary Clinton'] - dp['Donald Trump']) / dp['total_votes']
# Remove extra information I don't need.
dp = dp.loc[:, ('county', 'fips', 'dem_adv', 'total_votes')]
from bokeh.plotting import figure, show
from bokeh.io import output_notebook
from bokeh.charts import Bar, Scatter, Histogram, Line
from bokeh.palettes import brewer
from bokeh.models import FuncTickFormatter

palette = brewer['Set3'][12]

Let's look at how this number is distributed across counties. All plots here are made using Bokeh, a fantastic plotting library for Python. The plots are interactive, so you can pan and zoom around as you like.

show(Histogram(100 * dp['dem_adv'],
               legend=False, palette=palette,
               ylabel='Number of counties',
               xlabel='(Democrat vote share - Republican vote share) / percent',
               title='Distribution of county vote share'))

It looks like there's a large bias towards the Republican candidate, more than one might expect, knowing that the overall vote was quite close. But this is because people from counties with smaller populations are more likely to vote Republican, so although there are more counties with a Republican majority, they contribute less to the national vote count, to the extent that in fact the total number of Democrat votes was higher. (This wasn't reflected in who was elected President because of how the Electoral college works.)

dtmp = dp.copy()
dtmp['logvotes'] = np.log10(dtmp['total_votes'])
dtmp['dem_adv_pc'] = 100 * dtmp['dem_adv']
p = Scatter(dtmp, x='logvotes', y='dem_adv_pc',
            legend=False, palette=palette,
            title='Approximate correlation of vote share with county population',
            xlabel='Total votes (rough proxy for population)',
            ylabel='Excess democrat vote share / percent',
p.xaxis.formatter = FuncTickFormatter(code='return "10^" + tick')

Load the features: demographic data

Now that I have a target number associated with each county, I need some attributes of these counties, so that I can build a model to predict this target. Even though I won't actually use this model to predict anything, I can inspect the trained model to analyse what features are important in predicting how a county votes.

I got this data from a Kaggle competition page, which aggregates data from the US Census Bureau. If you want to run this notebook yourself, you'll have to get the file, county_facts.csv, from the Kaggle page yourself, because Kaggle requires authentication before you can download the file, so I can't just fetch from a URL.

dcf = pd.read_csv('county_facts.csv')
# Remove data not at the per-county level.
dcf = dcf[dcf.state_abbreviation.notnull()]
# Remove extra information I don't need.
# dcf = dcf.drop('state_abbreviation', axis=1)

The table's columns are named according to some coding scheme, which doesn't make much sense to humans. There's another table where human-readable names for these codes can be looked up, which is also on the Kaggle page, called county_facts_dictionary.csv. These names are actually a bit too verbose to use as column names though, so I renamed the columns myself, to something in the middle: meaningful, but short.

# Rename the columns to make more sense.
dcf = dcf.rename(columns={
    'PST045214': 'people_2014',
    'PST040210': 'people_2010',
    'PST120214': 'people_percent_change',
    'POP010210': 'people_2010_two',
    'AGE135214': 'percent_people_under_5',
    'AGE295214': 'percent_people_under_18',
    'AGE775214': 'percent_people_over_65',
    'SEX255214': 'percent_people_female',
    'RHI125214': 'percent_people_white_only',
    'RHI225214': 'percent_people_black_only',
    'RHI325214': 'percent_people_native_mainland',
    'RHI425214': 'percent_people_asian_only',
    'RHI525214': 'percent_people_native_pacific',
    'RHI625214': 'percent_people_multiple_race',
    'RHI725214': 'percent_people_hispanic',
    'RHI825214': 'percent_people_white_non_hispanic',
    'POP715213': 'percent_people_living_in_same_house',
    'POP645213': 'percent_people_foreign_born',
    'POP815213': 'percent_people_non_english_at_home',
    'EDU635213': 'percent_people_high_school_graduate',
    'EDU685213': 'percent_people_bachelors',
    'VET605213': 'veterans',
    'LFE305213': 'mean_commute_time_minutes',
    'HSG010214': 'housing_units',
    'HSG445213': 'home_ownership_rate',
    'HSG096213': 'percent_housing_units_in_multi_unit_structures',
    'HSG495213': 'median_value_owner_occupied_housing_units',
    'HSD410213': 'households',
    'HSD310213': 'people_per_household',
    'INC910213': 'money_income_per_capita',
    'INC110213': 'median_household_income',
    'PVY020213': 'percent_people_in_poverty',
    'BZA010213': 'private_nonfarm_places',
    'BZA110213': 'private_nonfarm_employment',
    'BZA115213': 'private_nonfarm_employment_percent_change',
    'NES010213': 'nonemployer_places',
    'SBO001207': 'number_firms',
    'SBO315207': 'percent_firms_black_owned',
    'SBO115207': 'percent_firms_native_mainland_owned',
    'SBO215207': 'percent_firms_asian_owned',
    'SBO515207': 'percent_firms_pacific_owned',
    'SBO415207': 'percent_firms_hispanic_owned',
    'SBO015207': 'percent_firms_women_owned',
    'MAN450207': 'manufacturer_shipments_thousands',
    'WTN220207': 'merchant_wholesaler_sales_thousands',
    'RTN130207': 'retail_sales_thousands',
    'RTN131207': 'retail_sales_per_capita',
    'AFN120207': 'accommodation_and_food_services_sales_thousands',
    'BPS030214': 'building_permits',
    'LND110210': 'land_area_square_miles',
    'POP060210': 'people_per_square_mile',

Combine the features with the labels

Now we need to merge the labels and features together. Each county has a unique 'FIPS County Code', that is present in both datasets, that I can use to join them.

# Merge the feature and label data, as an outer join so we can inspect the missing rows.
# An outer join means any rows in either table that don't exist in the other will be kept.
dcfp = dcf.merge(dp, on='fips', how='outer').drop('fips', axis=1)
# Count the counties that are missing in the feature data.
# Count the counties that are missing in the label data.

There are 32 rows in the demographic data that aren't in the voting data. Let's see which,

dcfp[dcfp.county.isnull()].loc[:, ('area_name', 'state_abbreviation')]
area_name state_abbreviation
67 Aleutians East Borough AK
68 Aleutians West Census Area AK
69 Anchorage Municipality AK
70 Bethel Census Area AK
71 Bristol Bay Borough AK
72 Denali Borough AK
73 Dillingham Census Area AK
74 Fairbanks North Star Borough AK
75 Haines Borough AK
76 Hoonah-Angoon Census Area AK
77 Juneau City and Borough AK
78 Kenai Peninsula Borough AK
79 Ketchikan Gateway Borough AK
80 Kodiak Island Borough AK
81 Lake and Peninsula Borough AK
82 Matanuska-Susitna Borough AK
83 Nome Census Area AK
84 North Slope Borough AK
85 Northwest Arctic Borough AK
86 Petersburg Borough AK
87 Prince of Wales-Hyder Census Area AK
88 Sitka City and Borough AK
89 Skagway Municipality AK
90 Southeast Fairbanks Census Area AK
91 Valdez-Cordova Census Area AK
92 Wade Hampton Census Area AK
93 Wrangell City and Borough AK
94 Yakutat City and Borough AK
95 Yukon-Koyukuk Census Area AK
548 Kalawao County HI
2417 Shannon County SD
2916 Bedford city VA

29 of the missing counties are in Alaska, with an extra three from Hawaii, South Dakota and Virginia. These are few enough, about 1% of the counties, that I'm just going to ignore them.

# Ignore the counties that are missing from the label data.
dcfp = dcfp[dcfp.county.notnull()]

As further validation, I'll check that the county names from both sets match (since I joined on the FIPS code, this need not be true).

# Find cases where the county names do not match between the two data-sets.
dcfp[dcfp.area_name != dcfp.county].loc[:, ('area_name', 'county')]
area_name county
1802 Do±a Ana County Dona Ana County

There's a single mismatch, which is just a data entry mistake in the demographics table, so I can safely say that each row's data represents a single geographical area.

# Remove one of the duplicate columns containing the counties' names.
dcfp = dcfp.drop('area_name', axis=1)
dcfp = dcfp.drop(['state_abbreviation', 'total_votes'], axis=1)

Scale the features

One last technical step before training the model, is to scale some of the demographic features so that the model can distinguish different scales easily. For example, county population spans many orders of magnitude,

from bokeh.models import FuncTickFormatter

# Ignore the one county with apparently zero population,
# to stop the logarithm operation breaking.
p = Histogram(np.log10(dcf['people_2014'][dcf['people_2014'] > 0]),
              legend=False, palette=palette,
              ylabel='Number of counties', xlabel='Population',
              title='Distribution of county populations')
# Format the x-axis tick labels as a power of ten.
p.xaxis.formatter = FuncTickFormatter(code='return "10^" + tick')


For the Random Forest model I use here, I don't need to scale features linearly, meaning that numbers that are distributed between zero and one hundred can stay as they are, they don't need to be scaled to lie between zero and one. But numbers that are distributed between one and ten million are typically very unevenly distributed on a linear scale, so putting them on a logarithmic scale makes it easier for the model to distinguish between different magnitudes of values.

For the current data-set, this scaling is appropriate for raw counts of things like people, businesses and land area.

log_cols = (
dcfp_scaled = dcfp.copy()
for log_col in log_cols:
    # Replace each raw feature with a log-scaled equivalent.
    # Add one to each feature, because the numbers are never
    # small enough that this would matter,
    # but there are zeros, presumably for 'missing data', that would otherwise 
    # break the logarithm operation.
    dcfp_scaled.loc[:, '{}_log'.format(log_col)] = np.log10(dcfp[log_col] + 1)
    dcfp_scaled.drop(log_col, axis=1, inplace=True)

Train a model

Finally I can train the model. I use scikit-learn, a popular machine learning library in the Python ecosystem. I've chosen to use a Random Forest regressor, as it allows ranking features by their predictive power, which is the point of this analysis.

A simple linear regression would allow the same analysis, but it could not easily capture interactions between the different features, so it would explain less of the variation between counties, so I would have less confidence in its output. (I tested a linear regression and indeed the explained variance was significantly lower.)

from sklearn.ensemble import RandomForestRegressor

# Get the feature columns: all but the label and county name.
cols = [c for c in dcfp_scaled.columns
        if c not in ('county', 'dem_adv')]

# Shuffle the rows, to ensure the training and test data is similar.
inds = np.arange(len(dcfp))
rng = np.random.RandomState(seed=1)
dcfp_shuff = dcfp_scaled.iloc[inds]

X = dcfp_shuff.loc[:, cols].values
y = dcfp_shuff.loc[:, 'dem_adv'].values

# Split the data into training and test data.
i_split = int(round(4 * X.shape[0] / 5))
X_train, X_test = X[:i_split], X[i_split:]
y_train, y_test = y[:i_split], y[i_split:]

# Specify the model.
regr = RandomForestRegressor(n_estimators=50, n_jobs=3, random_state=rng)

# Train the model.
regr.fit(X_train, y_train)
from IPython.core.display import display, HTML

# This wrapper is because of how I render the notebook to HTML.
# I hide text output by default, so when I *do* want to show it,
# I have to pretend it's HTML output.
display(HTML('<p>Explained variance: {:.0f}%</p>'
             .format(100 * regr.score(X_test, y_test))))

Explained variance: 81%

The trained model can explain around 80% of the variation in voting outcomes between counties. We can now see which features are most important in explaining this variation,

# Get the most important features
importances = pd.DataFrame({'importance': regr.feature_importances_},
importances.sort_values('importance', inplace=True)
percent_people_under_18 0.013466
percent_people_high_school_graduate 0.016210
percent_people_black_only 0.025043
people_per_square_mile 0.029189
building_permits_log 0.030706
median_value_owner_occupied_housing_units 0.050702
percent_people_bachelors 0.053230
percent_people_white_only 0.090893
percent_people_white_non_hispanic 0.222278
percent_housing_units_in_multi_unit_structures 0.255692

The most important feature is the percentage of housing units in multi-unit structures. This presumably measures how urban a county is, since apartments presumably count as multi-unit structures.

The second and third most important features are the percentage of people who are white, and white but non-hispanic, respectively. These are presumably well correlated, so I consider them to measure approximately the same thing.

The fourth is the percentage of people with at least a Bachelor's degree, and the fifth is the median house value.

These outcomes seem sensible to me. As extra validation, let's see a few of the least important features. Hopefully, they will seem intuitively like variables that would not provide much information.

# Get the least important features
percent_firms_pacific_owned 0.000137
percent_firms_native_mainland_owned 0.000981
percent_firms_asian_owned 0.001158
percent_people_native_pacific 0.001189
people_2010_log 0.001676

It seems reasonable to me that two of these features relate to Pacific islander and Hawaiian groups, as these numbers are likely not predictive outside of a few counties where these groups are common.

I am a bit surprised that the county population is in this list, since I showed above that it correlates well with the vote share, but this type of feature is actually present in the dataset in three different forms: two measurements of population in 2010, and another in 2014. It is possible the the other two are adding all of the information that this feature contains, so it gives no extra insight.

Investigate some important features

Finally, let's validate these results by looking directly at how well correlated the most important features are with the vote share.

             'percent_housing_units_in_multi_unit_structures', 'dem_adv',
             xlabel='Share of housing units in multi-unit structures',
             ylabel='Excess democrat vote share',
             title='Correlation of multi-unit accommodation (apartments?) with vote share'))
show(Scatter(dcfp_scaled, 'percent_people_white_non_hispanic', 'dem_adv',
             xlabel='Share of people white and non-hispanic',
             ylabel='Excess democrat vote share',
             title='Correlation of white ethnicity with vote share'))
show(Scatter(dcfp_scaled, 'median_value_owner_occupied_housing_units', 'dem_adv',
             xlabel='Median house value of owner-occupied housing units',
             ylabel='Excess democrat vote share (normalized)',
             title='Correlation of house value with vote share'))

Indeed, it does look like there is some correlation in these quantities, which could help a prediction.

Let's look at some of the least important features, to see if this agrees with our intuition.

show(Scatter(dcfp_scaled, 'percent_firms_pacific_owned', 'dem_adv',
             xlabel='Percentage of firms owned by Pacific islanders',
             ylabel='Excess democrat vote share (normalized)',
             title='Correlation of Pacific islander firm ownership with vote share'))
show(Scatter(dcfp_scaled, 'percent_people_native_pacific', 'dem_adv',
             xlabel='Share of people Pacific islanders',
             ylabel='Excess democrat vote share (normalized)',
             title='Correlation of Pacific island ethnicity with vote share'))

Indeed, for these variables it seems like, except for a few counties, the range of values is low, and not very well correlated.

By the way, if you are thinking at this point, "Boy, I wish the author had clipped the x-axis to exclude those outliers, so I can see the distribution of the main body", then you're thinking in the wrong century: there's a zoom tool on that figure, go do it yourself! I don't want to bias your interpretation by putting my opinions on the axis limits.


I wanted to do this analysis because I was curious about how much changes in demographics like income, age, population density and education could explain changes in voting patterns. For example, I think it's interesting that the outcome of a vote can change as the percentage of older people increases, even if the typical political belief of a person in any given age group stays constant. Although seemingly obvious, this isn't an observation that I see made very often, and I think it might sometimes be able to explain some trends better than more ephemeral and hard-to-measure societal shifts.


Thanks to,

  • Jake Vanderplas for writing the plugin allowing me to host this notebook on my blog
  • Michael W. Kearney for gathering the voting data
  • The US Census Bureau, for gathering and providing demograpic data
  • Toby Searle for providing common sense
  • Find this notebook here