This confidential government report proves that we need to liberate the Postcode Address File
This is the Pentagon Papers for the UK address database
It turns out that back in 2016, the British government attempted to reconstruct the Ship of Theseus.
Or if you prefer, it attempted to reconstruct Trigger’s Broom.
The problem it faced was that a few years prior, in 2013, it did something rather silly. When Royal Mail was privatised, it did not just put the responsibility for postal deliveries into private hands, but the ownership and management of the UK Postcode Address File (PAF) too.
The PAF is a critically important national dataset. It’s not personal data – it’s a database that lists literally every physical postal address in the country, and it’s a very useful thing to have access to.
For example, it means that shopping sites can autocomplete your address on the checkout page when given just a postcode, and more importantly, researchers and developers can use it to learn more about our world, and build the tools, services and platforms of tomorrow.
Long term readers will know that I might have mentioned it before a few times. The problem is it’s not an easy dataset to get hold of, as it cost a lot of money. This is because the data has to be licensed from Royal Mail – so it could cost you upwards of £6000 per year if you want to use address data inside, say, a website or app.
I’ve said before that this is crazy. It’s tantamount to a tax on innovation. It’s a major barrier facing bedroom coders and small businesses that makes it harder to build the next big thing. And for the government, it’s a road block to economic growth.
That’s why in my view, the PAF should be released for free as open data, so that anyone who wants to can build cool stuff with accurate address data.
And I’m not the only one who thinks like this.
What makes the PAF situation maddening is that politicians already know the current status quo isn’t working.
For example, in George Osborne’s last budget in 2016, he rustled up £5m to “develop options for an authoritative address register that is open and freely available,” and argued that “making wider use of more precise address data and ensuring it is frequently updated will unlock opportunities for innovation”.
But despite this commitment, eight years later, the PAF remains locked away, in private hands, behind a prohibitively expensive paywall.
So, I know what you’re thinking: What does the Ship of Theseus have to do with this?
Though it’s not clear exactly what the entirety of that £5m was spent on, thanks to the Freedom of Information Act1, I’ve managed to get my hands on the results of a key research paper that it paid for.
What it shows is a study where the (government owned) Ordnance Survey (OS) was instructed to attempt to recreate the PAF dataset – but crucially without using any intellectual property owned by Royal Mail.
In other words, the OS was asked to build a new UK Address database from scratch, so that it could conceivably be freely released as open data, with no licensing fees attached, for anyone to use.
It sounds like a great idea then, except for one thing: The result was, well, pretty disastrous.
As far as I can tell, until now the full report has never been published. And that’s probably because it illustrates beyond reasonable doubt why there is no other way around it. If Britain wants to unlock those opportunities for growth and innovation, then the only option is take back control of the Postcode Address File and release it as open data.
Now let’s dig into it and I’ll explain why.
If you like ultra-nerdy politics and policy stuff, then you will like my newsletter. Subscribe (for free!) to get more of This Sort Of Thing directly in your inbox. I swear not all of it is about postcodes.
I hope you like acronyms
So how do you build an address database without using expensive Royal Mail data? To start with, OS went back to one of the sources that Royal Mail uses to feed into the PAF, called the National Address Gazetteer (NAG)2.
The NAG is a different database of addresses that is compiled using data supplied by local authorities, government departments and the Ordnance Survey itself. When a new building is built locally, your council will submit changes to the NAG.
And by “something” I really do mean “something” – as unlike the PAF, which just contains building addresses, the NAG contains pretty much everything, including the addresses of utilities infrastructure (like electric substations), bus stops and parks.
Then to make matters even trickier, because the NAG is compiled by hundreds of different organisations sending in updates in a slightly freewheeling way, the data inside is also inconsistent and not standardised like it is in the PAF.
In other words, unlike the PAF, the NAG is messy. It’s the raw ore that comes out of the mine, where the useful minerals are mixed in with other rocks and dirt. What Royal Mail does with the PAF is basically refine it into solid gold bars of useful data.
So the first thing the OS team needed to do for their new database was a similar process. They had to figure out how to remove all of the unnecessary stuff in the NAG. And the way they did this first pass was by applying an algorithm to figure out what were useful buildings and what was other street junk.
To do this, they compared the NAG data with the graphic “Master Map” that the OS maintains of the entire country3.
This filtering was more technically complex than it sounds. For example, if I’m understanding the report correctly, it appears that the OS literally analysed the vectors – the shapes of different objects on its map – to try and work out which entries in the NAG corresponded to buildings and individual addresses, and which were just, say, telegraph poles and postboxes.
Needless to say, this was quite a fuzzy matching process. In some cases, OS was forced to intuit addresses it couldn’t identify for certain. For example, it assumed that if it saw a group of three houses on a street, with number 10 and number 14 on either side, that number 12 sits between them.
And then there were some cases where the algorithm couldn’t work out what was in a given location at all. In those cases, the OS first attempted to manually fix the database by having a human eyeball the map and the associated data, and in some cases, the OS even sent staffers out to survey buildings in person.
The result of all of this hard work is pretty impressive: Over the course of the pilot OS managed to assemble a dataset of over 21 million fully complete address records – with a further ten million partially complete. In theory, that makes the new dataset roughly the same size as the PAF. So could it be that the government discovered a way to replicate the PAF without paying Royal Mail a penny?
There was just one problem. The data was crap.
Garbage in, garbage out
To test how good the new database was, OS compared its creation to another address database – called AddressBase. This is another, separate database that OS owns, and it was built in a similar way, using data from NAG and other sources.
The crucial difference, however, is that AddressBase, which OS sells to other companies and organisations who want to mash up address data, also pays money to Royal Mail to license the PAF, to include its data as part of the mix4.
I know, bloody hell these acronyms are confusing, so here’s the worst diagram you’ve ever seen to explain the relationship between all of these datasets.
So the critical question then is… can the new database be anywhere near as accurate without the PAF data?
According to the OS’s own assessment… No.
When compared to AddressBase, the new dataset was only 90.8% accurate, which doesn’t sound too bad until you realise that translates to 2.9 million dud addresses – and that if this dataset was used, there would be 4.2 million addresses that are either missing from the new dataset, or would be rogue entries within it.
That… is not very accurate.
What the OS found was that it was pretty easy to recreate address data where circumstances were relatively simple – when one house only had one address attached, for example. And apparently separating out the irrelevant stuff, like the utilities infrastructure, was actually pretty straightforward too.
The problems instead emerged when addresses became slightly more complicated. For example, when there were buildings with multiple addresses inside – such as a block of flats, where every resident has a distinct address.
And one particularly tricky example was large shopping centres and airports, which on OS Maps are just recorded as one blob with one address – but which in reality contain hundreds of individual shops and businesses, each with their own address.
For example, here’s how the OS Map sees Westfield shopping centre in East London – despite it containing dozens of shops and it feeling like a circle of hell on a Saturday afternoon with thousands of people inside, as far as the new database was concerned, the enormous building had just one singular postbox.
There were also other smaller problems matching address data, such as on houses that are accessed from a private shared drive, and addresses where the house only has a name, and not a number5.
The upshot of the research then, is that building an accurate database is really hard. OS concludes that it would have to check the 4.2m bad addresses manually to make its PAF-less database a viable dataset that would actually be useful. As the document concludes:
“Without this, confidence in the product would be seriously undermined by over 1 in 10 addresses being missing / erroneous, affecting usability, especially by organisations such as emergency services and other customers where accuracy is key.”
However, even if the government still wanted to plough ahead and make this new PAF-free database a thing, to reach the point of viability would take five and a half years to build. And how much it would cost isn’t clear – because the government chose to blank it out from my FOI request6. Bah.
So that’s why, as you can see above, the OS basically rules out recreating the PAF… without the PAF.
Right the ship
In a sense, as a government study this was actually pretty unambitious.
For example, it fails to consider the proactive opportunities that emerge from opening up the Postcode Address File – like the inevitable explosion in creativity and innovation.
But it is also very narrow in how it views the problem of address data. It could have actually built something more accurate with a slightly different approach.
For example, OS chose to not include other sources of “third party” address data in its new PAF-free dataset. It could have conceivably bought millions of addresses from companies like Experian, and mashed those addresses in too.
Or it could have done something really clever, like propose that whenever someone renews their driving licence or files their taxes at HMRC, that the government could collect all of the submitted addresses together, and fold them into the new address dataset. Over a long enough time, this would build up to contain every address in the country7.
And zooming out even further, when considering an alternative the PAF, the problem isn’t just the completeness of the data. Accuracy is important (see above!), but equally the people who want to use address data might value other factors, such as how regularly the database is updated, how quickly errors can be fixed, or whether specific categories of address are accurate.
But I think that to an extent, none of this really matters at this point. What this study proves is something much more important: That recreating the PAF, without using Royal Mail data, is really, really hard. It would be expensive, time consuming and result in a worse product overall. Oh, and it wouldn’t even be legally allowed to include, er, postcodes, as they are specifically owned by Royal Mail – which would render the data useless for a significant number of potential use-cases.
So really the Ordinance Survey has demonstrated that if we actually want the widely-recognised benefits of open address data, there is only one option: The government needs to take back control of this critical national dataset from Royal Mail. For growth and innovation, we need to liberate the Postcode Address File, because there is no other choice. Rebuilding the Ship of Theseus just isn’t a viable option.
Huge, huge thanks to my comrades-in-PAF: Peter Wells for explaining the implications of this study to me, and whose analysis I’m mostly stealing above, and Anna Powell-Smith for confirming that I’m not talking rubbish.
If you want to see the FOI response in full, you can find the docs here.
If you enjoy ultra-nerdy politics, policy, tech and media takes, then subscribe (for free) to get more stuff like this direct to your inbox – it’s not always about postcodes.
And if you *really* like my work, please consider upgrading to a paid subscription, as it is only with your support that I can do journalism like this. Because let’s face it, no one else is going to pay for this much postcode content.
I know, I wish some shadowy figure had handed me a brown envelope in a car park too.
Really sorry, there’s going to be a tonne of annoying acronyms in this, because this shit is complicated.
Remember? The OS are the mapping people first and foremost.
And as a result, if you’re a company wanting to use AddressBase data, you need to pay OS a fee, so they can in turn pay Royal Mail.
Confirming my pre-existing prejudice that posh people are basically the problem with everything.
There’s a number of exceptions that can be applied to FOI requests to stop the release of documents, such as whether it would prejudice current policy making, be commercially sensitive, help criminals commit crimes, or basically has anything to do with the royal family. Long before the Queen died, I once tried to FOI the plans for Operation London Bridge (the Queen’s funeral), and I think I got a full-house with literally every possible legal exception cited as DCMS denied my request.
This is, incidentally, why Amazon and Google probably don’t even need the PAF, because they can just mine data from people buying stuff and from Google Maps, and build their own databases of practically every address in the country.
Every morning I visit Hacker News, which is an aggregator of tech stories. Incredibily this article was #1. You can see the discussion around this here: https://news.ycombinator.com/item?id=41326604
fantastic piece. thank you for writing!