# UK Government spending¶

Let's investigate what the UK government spends its money on.

## Introduction¶

At ukpublicspending.co.uk, someone called Christopher Chantrill has compiled an impressively detailed set of data to do with the finances of the UK government. This includes data on revenue, debt, interest payments and spending, in some cases back to 1692.

The data on the website is impressive, but the web design and visualization are ugly. Let's get some of that data, and make some analyses that are more pleasant, and easier to understand.

from enum import Enum
from io import StringIO
import json

import requests
from bs4 import BeautifulSoup
import pandas as pd

class UnitType(Enum):
"""Little convenience to remember URL argument to set units."""

billion = 'b'
percent_gdp = 'p'

def year_to_data(year, unit_type=UnitType.billion):
url = uk_pub_spend_url_template.format(year, unit_type.value, unit_type.value)
html_doc = requests.get(url).content
soup = BeautifulSoup(html_doc, 'html.parser')
# Get the part of the page with the data as tab-separated values.
s = soup.find(attrs={'name': 'text1'}).contents[0]

# Massage the data into a form that can be parsed by Pandas.
lns = s.split('\n')
lns = lns[4:15]
s = '\n'.join(lns)

dt = pd.read_csv(StringIO(s), sep='\t', thousands=',')
dt.loc[:, 'Year'] = year
dt.rename(columns={'Unnamed: 0': 'Sector'}, inplace=True)
return dt

years_of_interest = list(range(1990, 2018))

dts = [year_to_data(year) for year in years_of_interest]
d = pd.concat(dts).loc[:, ('Sector', 'Total', 'Year')]
d.sort_values(['Year', 'Sector'], inplace=True)


The reported numbers are not adjusted for inflation, so I used an API at statbureau.org to grab inflation figures; the values I report are all adjusted for inflation.

def year_to_inflation_value(year):
# Inflation-adjust £1000, to keep a reasonable amount of numerical accuracy, since
# the values are rounded to the nearest penny.
factor = 1000.0
response = requests.get(
'https://www.statbureau.org/calculate-inflation-value-json',
params={'country': 'united-kingdom',
'amount': factor,
'start': '{}-01-01'.format(years_of_interest[-1]),
'end': '{}-01-01'.format(year)})
price_raw = response.content.decode()
price_raw = price_raw.replace(' ', '').replace('£', '').replace('"', '')
price = float(price_raw)
price /= factor
return price

year_data = pd.DataFrame({'Year': d.Year.unique()})
year_data.loc[:, 'inflation_price'] = year_data.Year.apply(year_to_inflation_value)
d = pd.merge(d, year_data, on='Year')
d.loc[:, 'Total_adj'] = d.Total * d.inflation_price


## Spending by sector in 2015¶

2015 has been and gone, so let's see what got spent. Although it's exciting to see figures in the distant past, I'm going to stick to figure in the past few decades. Since 1983, the data I use comes ultimately from the Public Expenditure Statistical Analyses (PESA), carried out by the UK treasury. However, the categorization is not one used by them; instead, it's processed by Christopher into these categories shown below.

I'm going to make my charts using bokeh, a fantastic, new-ish plotting library designed to output interactive visualizations for viewing in the browser. Its high-level 'charts' API is a joy to use, and I thoroughly recommend it. Matplotlib had its day, but I really think Bokeh is now a viable successor. I haven't tried as hard as I might otherwise, to set sensible axis limits, legend locations and font sizes, because the interactivity makes this less of a burden: just pan and zoom about!

I think this interactivity becoming the norm, as well as being good for usability, would also be good for openness and reproducibility: from my time in academia I know many times when, even if only half-consciously, people have framed their plots to better support their position.

from bokeh.plotting import figure, show
from bokeh.io import output_notebook
from bokeh.charts import Bar, Scatter, defaults, Line
from bokeh.charts.attributes import  CatAttr
from bokeh.palettes import brewer

# Bokeh comes bundled with [colorbrewer](colorbrewer2.org) palettes, which I like.
palette = brewer['Set3'][12]
# We can output interactive plots right in the Jupyter notebook.
output_notebook(hide_banner=True)


Anyway, here's the breakdown,

label_tot_adj = 'Spending / inflation-adjusted billions of pounds'
label_tot = 'Spending / raw billions of pounds'
label_tot_gdp = 'Spending / percentage of year\'s GDP'
label = CatAttr(df=d, columns='Sector', sort=False)

show(Bar(d[d.Year == 2015].sort_values('Total_adj', ascending=False),
label=label, values='Total_adj', legend=False, palette=palette,
ylabel=label_tot_adj, title='UK government spending by sector in 2015'))


I was surprised by how much is spent on pensions; health-care was less of a surprise. That 'other spending' is a bit frustrating; I'll come back to that later. Let's see how these spendings have changed over time.

## Spending trends, by sector¶

show(Line(d, x='Year', y='Total_adj', color='Sector',
legend='top_left', palette=palette,
ylabel=label_tot_adj, title='Spending trends, by sector'))


Looks like spending on pensions is increasing inexorably. The sectors that seem to show short-term effects, whether from government policy or world events, are welfare, education and interest payments.

However it's hard to compare these raw numbers over time when the total is also changing, even when adusting for inflation. There are two effects shown here, total spending, and its distribution, which it would be nice to decouple. Let's get total spending out of the way, then we can feel comfortable normalizing by that to show its distribution.

d_tot = d.groupby('Year').agg({'Total_adj': sum, 'Total': sum}).rename(
d_tot = pd.merge(d_tot, year_data, left_index=True, right_on='Year')
show(Line(d_tot, x='Year', y=['Inflation-adjusted', 'Raw'],
legend='top_left', palette=palette, ylabel='Spending / billions of pounds',
title='Trend in spending'))


On the adjusted line there is a sharp change from 2011 to 2012. Even ignoring inflation, there is a clear change in the raw total in that year, and inspecting the inflation factor does not show extreme changes in inflation (until around 2014, which looks to my eye like a distinct story).

show(Line(d_tot, x='Year', y='inflation_price',
palette=palette,
title='Effective value of one 2016 pound stirling in other years',
ylabel='Pounds stirling'))


I'm interested in what was cut to bring about this reduction in spending from 2011 to 2012, so let's look at the change year-to-year in each sector.

def sector_diff(group):
pct_diffs = 100 * group.sort_values('Year').Total_adj.pct_change()
left_index=True, right_index=True)
left_index=True, right_index=True)
return new_group
d_with_diffs = d.groupby(d.Sector).apply(sector_diff)

d_2012 = d_with_diffs[d_with_diffs.Year == 2012]
d_2012 = d_2012.rename(
columns={'Total_adj': 'Spending in 2012 / billions of pounds',
'Total_adj_pct_change': 'Percentage change from 2011',
'Total_adj_abs_change': 'Absolute change from 2011 / billions of pounds'})
d_2012.astype(int)

Spending in 2012 / billions of pounds Absolute change from 2011 / billions of pounds Percentage change from 2011
Sector
Education 92 -9 -9
Other Spending 112 -6 -5
Health Care 129 -3 -2
Transport 20 -3 -13
Defence 46 -2 -5
Protection 33 -2 -7
General Government 14 -1 -10
Welfare 125 1 0
Interest 53 1 2
Pensions 137 2 1

It seems like the biggest decrease in spending was in Education, and (frustratingly) 'Other spending'. From the sectors' own perspectives, the hardest hit were Transport, 'General Government' and Education.

Now let's look at how the government has split up its spending, while accounting for the fact that absolute spending amounts have changed. This might highlight government priorities.

def sector_pct(year_group):
new_group = year_group.merge(year_pct, left_index=True, right_index=True)
return new_group
d_with_pcts = d.groupby(d.Year).apply(sector_pct)

show(Line(d_with_pcts, x='Year', y='Total_adj_pct', color='Sector',
legend='bottom_left', palette=palette,
ylabel='Percentage of year\'s total spending',
title='Distribution of spending by sector'))


From this perspective, it's possible to see that defense has a long-term decline in priority; it's also interesting to note that interest payments were becoming gradually less of an issue until 2003.

The winners in terms of government priority appear to be health-care, along with 'Other spending'.

## Accounting for the size of the economy¶

Along with inflation, there is another adjustment that should be made to these numbers. Government spending will naturally scale with the size of the economy. In one obvious sense, spending should be expected to increase if the country's population increases; in perhaps a slightly less obvious sense, spending should also be expected to increase if the typical person is more productive: some of their extra money will be worth spending on government services like health-care.

Both of these are captured (albeit imperfectly) by the country's Gross Domestic Product (GDP), which roughly scales with the number and productivity of people in the country. So let's revisit some of these charts, but showing the fraction of the country's product that is spent. Within a given year, the GDP is just a constant number, like £1 billion; the analysis is only changed when comparing different years.

As a silly example to illustrate that this is accounting for something on top of just inflation, if an economy were to double in size in one year, prices need not necessarily also double; however, we would still expect government spending to significantly increase simply because the economy has grown, and so we want to normalize for this fact.

dts_gdp = [year_to_data(year, unit_type=UnitType.percent_gdp)
for year in years_of_interest]
d_gdp = pd.concat(dts_gdp).loc[:, ('Sector', 'Total', 'Year')]
d = pd.merge(d, d_gdp, on=['Year', 'Sector'], suffixes=('', '_gdp'))

show(Line(d, x='Year', y='Total_gdp', color='Sector',
legend='top_left', palette=palette,
ylabel=label_tot_gdp,
title='Spending by sector, normalized by GDP'))


This shows much the same story as the distribution of spending above, apart from a few interesting details since 2010, where increasing GDP has made some apparent increases less marked. The increase in spending on pensions looks much less dramatic, and spending on healthcare is in fact decreasing in this representation.

d_tot_gdp = d.groupby('Year').Total_gdp.sum()
show(Line(d_tot_gdp, legend=False, ylabel=label_tot_gdp,
title='Trend in spending, normalized by GDP',
palette=palette))


In this representation, the story of total government spending changes completely. The increase in inflation-adjusted spending since 2014 looks to be a result of increasing GDP, and in fact has been continually decreasing since 2011.

## Getting more granular data: to the source¶

As good as Christopher's data is, I'd like to minimise the number of steps between the data source and my analysis. The original data is kindly provided categorised according to the one designed by the United Nations (UN), the 'Classification of the Functions of Government' (COFOG). Sticking to this system, unaware as it may be of the exact makeup of the British government, will allow me in future to compare with other countries, and is robust to changes in department structures.

So let's grab the Excel table from PESA, and parse it into a format suitable for analysis. The code below is very hacky and not at all beautiful, because the table itself does not communicate well with machines.

from io import BytesIO

pesa_response = requests.get(pesa_url)
# This BytesIO class lets me make a byte-string look like a file stream,
# which is what Pandas expects.
sheetname='5.2', index_col=None, skiprows=4)

# Remove pointless empty first column.
d_pesa = d_pesa.drop('Unnamed: 0', axis=1)
# Drop a silly row beneath year indicator.
d_pesa = d_pesa.drop(0)
# Give Section column a meaningful name.
d_pesa = d_pesa.rename(columns={'Unnamed: 1': 'Section'})
# Rename year columns from start-end, to just starting year.
for c in d_pesa.columns[1:]:
d_pesa.rename(columns={c: int(c.split('-')[0])}, inplace=True)

import re

def strip_notes(s):
m = re.search(' $[0-9]+$', s)
if m is not None:
s = s[:m.start()]
return s

# Parse out function-level names and numbers from section dividers.
is_function_head = ((d_pesa.Section.str.contains(r'^[0-9]+\. ') == True) &
~(d_pesa.Section.str.contains('continued') == True))
def get_first(x):
function_name = ' '.join(x[1:])
function_name = strip_notes(function_name)
return pd.Series({'function_nr': int(x[0][:-1]),
'function_name': function_name})
function_names = d_pesa.Section[is_function_head].str.split(' ').apply(get_first)

# Unpivot each year as one column, into one column indicating the year.
d_pesa_melt = pd.melt(d_pesa, id_vars=['Section'],
var_name='Year', value_name='Amount')
d_pesa_melt.loc[:, 'Year'] = d_pesa_melt.Year.astype(int)

# This is a bag of poo because of the government's poo data.
# The health sub-function rows are not labelled with COFOG categories,
# so match them with these, insensitive of case.
cofog_map = {
'medical services': [7, 3],
'health services': [7, 4],
'medical research': [7, 5],
}

# Get sub-function level amount rows, where the actual amounts are.
is_sub_function = d_pesa_melt.Section.str.contains(r'^[0-9]+\.[0-9]+ ') == True
# Account for bad health rows.
for k in cofog_map:
is_sub_function |= d_pesa_melt.Section.str.contains(k, case=False) == True
d_pesa_body = d_pesa_melt.loc[is_sub_function]

def get_cofog_entry(s):
for k in cofog_map:
if k in s.lower():
return cofog_map[k]

# Parse the function and sub-function numbers from the 'x.y ' section start,
# and sub-function name from remainder.
def section_to_function_nr(s):
cofog_entry = get_cofog_entry(s)
if cofog_entry is not None:
return cofog_entry[0]
else:
return int(s.split('.')[0])

def section_to_sub_function_nr(s):
cofog_entry = get_cofog_entry(s)
if cofog_entry is not None:
return cofog_entry[1]
else:
s_cofog = s.split(' ')[0]
return int(s_cofog.split('.')[1])

def section_to_sub_function_name(s):
cofog_entry = get_cofog_entry(s)
if cofog_entry is not None:
return s
else:
name = ' '.join(s.split(' ')[1:])
name = strip_notes(name)
return name

d_pesa_body.loc[:, 'function_nr'] = d_pesa_body.Section.apply(section_to_function_nr)
d_pesa_body.loc[:, 'sub_function_nr'] = d_pesa_body.Section.apply(section_to_sub_function_nr)
d_pesa_body.loc[:, 'sub_function_name'] = d_pesa_body.Section.apply(section_to_sub_function_name)

# Drop section column as its information is contained in other columns.
# And merge in function name information from earlier.
d_pesa_body = d_pesa_body.drop('Section', axis=1).merge(function_names,
on='function_nr')

# Replace '-' entries for zero with actual zero values, and then convert to float.
# I'd like to do this earlier, but only now can we trust the rows are fairly sane.
d_pesa_body.loc[d_pesa_body.Amount == '-', 'Amount'] = 0.0
d_pesa_body.Amount = d_pesa_body.Amount.astype(pd.np.float)
# Convert to billions, because that is enough detail for what we want.
d_pesa_body.loc[:, 'Amount'] /= 1000.0

# Make later referencing easier.
dp = d_pesa_body


To give you an idea of the COFOG top-level functions, here's a breakdown of 2015.

label_tot_alt = 'Billions of pounds'

show(Bar(dp[dp.Year == 2015], label='function_name',
values='Amount', agg='sum', legend=False, palette=palette,
ylabel=label_tot_alt, xlabel='Government function'))


We can delve into so-called 'sub-functions'. Here are a few breakdowns. Be careful to note the axis limits: they aren't kept the same, as there are big variations in total spending, so remember that some breakdowns amount to small amounts, such as 'Recreation, culture and religion'.

interesting_subs = [1, 3, 6, 8, 9, 10]
for function_nr in interesting_subs:
function_name = dp[dp.function_nr == function_nr].function_name.any()
show(Bar(dp[(dp.Year == 2015) & (dp.function_nr == function_nr)],
label='sub_function_name', values='Amount',
legend=False, title=function_name + ' breakdown', palette=palette,
xlabel='Government sub-function', ylabel=label_tot_alt))


Within the sub-functions, we can see which are most important. Here, you can see the sub-functions which together make up 95% of total spending.

dp_2015 = dp[dp.Year == 2015].sort_values('Amount', ascending=False)
frac_cum_sum = dp_2015.Amount.cumsum() / dp_2015.Amount.sum()
dp_big = dp_2015.loc[frac_cum_sum[frac_cum_sum < 0.95].index]
dp_big['full_name'] = dp_big.function_name + r' -- ' + dp_big.sub_function_name
label = CatAttr(df=dp_big, columns='full_name', sort=False)

show(Bar(dp_big.sort_values('Amount', ascending=False), label=label, values='Amount',
legend=False, palette=palette,
title='Government functions accounting for 95% of all spending',
xlabel='', ylabel=label_tot_alt))


That's all the analysis I want to show for now. I hope to delve into other aspects at a later date.

## Acknowledgments¶

Thanks to,

• Jake Vanderplas for writing the plugin allowing me to host this notebook on my blog.
• Christopher Chantrill for his curation of the data
• The Bureau of Statistics, for providing inflation data and a nice API
• The UK Treasury, for providing government spending data in a reasonably accessible form
• Find this notebook here