Spoiler alert: I found some intriguing patterns…
As one of the makers of The Coronavirus App (if the link doesn’t work, check this), I’ve received countless emails asking me whether the data that the app displays could be trusted. My answer has always been the same:
“We can’t really know. But that’s the best data we have at our disposal. “
With the Coronavirus App, the job was to synthesize data reported by official sources as quickly and accurately as possible — and making it visually accessible. We’re app developers — not epidemiologists nor government officials. So when government report their counts, we have a duty to display them as reported. And link back to the original source.
That being said…
I was always curious. How true is that data, really?
How it works
You take a set of random numbers. You look at the first digit of each of these numbers. If you have enough numbers and if there isn’t any particular reason why they should behave otherwise*, then the distribution should look like this.
Counter-intuitive, you say? Naturally, one would think the chart would show an equal proportion of 1s, and 2s, and 3s... Except, it doesn’t. And here is why.
Benford’s law is used in many contexts, e.g. to find tax cheaters, detect election fraud, or in drug discovery data. Above is what the natural distribution of leading digits tends to look like when numbers are truly random.
But when supposedly random numbers are actually fabricated by humans, boy… the distribution tends to look very different. Us, humans, are absolutely terrible at creating random numbers.
Does it actually work?
I didn’t have any particular expectations going into this. In fact, I was just really curious to see if Benford’s law even applied to these COVID-19 datasets we’ve spent the past year putting together. In other words — are these numbers really random enough?
Below is an analysis of the leading digits of the numbers for the whole world (=the sum of each country) for each day since this whole thing started (taking in account cumulative cases, new cases, cumulative deaths and new deaths).
Cool! It looks pretty good. World numbers are, as the sum of numbers independently reported by each government, probably the most unbiased dataset we have in the context of COVID-19. If we assume that there are more countries that report real numbers than countries that report doctored numbers (hope so!), then the sum should be random enough for Benford‘s magic to be visible.
Benford’s law seems to work pretty well here.
What should we analyze, exactly?
The next step is figuring out exactly what datasets we should analyze. We essentially have four at our disposal:
- A) Daily new cases
- B) Daily new deaths
- C) Daily cumulative cases
- D) Daily cumulative deaths
In the world chart above, we’ve combined all four. But going forward in this article (as well as in all the Benford’s law charts we display on The Coronavirus App), we’ll group A with B, and C with D. We’ll run two tests for each country. These tests will tell us: how natural are the cumulative numbers reported by governments? And how natural are the new numbers reported by governments?
Note: Some countries do not report every day, so we deduplicate identical values in C and D and exclude 0 values in A and B
Why 113 countries?
We’ve chosen to analyze only countries that have more than 10,000 COVID-19 cases. A dataset that spans several orders of magnitude typically produces more reliable results in a Benford’s law test.
Let’s do some math
The chart alone gives us a pretty decent idea whether the distribution kinda follows Benford’s law or not. But for our approach to be mathematically-sound, we must introduce a measure called Mean Absolute Deviation (or MAD).
MAD essentially quantifies by how much our observed distribution deviates from Benford’s law. The lower the MAD, the more natural the distribution seems. The greater the MAD, the madder our friend Benford (get it?).
So, for all the math wizards reading this article, below is how MAD is computed, with K the initial number of digits (so… well, 9, right?).
It’s actually really simple. For each digit (1 to 9), calculate the difference between the observed distribution (O) and what Benford predicts (B)*. The MAD is simply the mean of these 9 values.
* The result must be an absolute (so positive) number. Otherwise negative values cancel out negative values and the MAD is always 0 (= we have learned nothing)
And the winners are…
So remember, the winners are those with the lowest MAD. The higher the MAD, the more it deviates from the natural distribution, and therefore the greater the likelihood that the distribution is, in fact… not natural.
Top 10 best countries
Top 10 worst countries
Top 10 best countries
Top 10 worst countries
Some interesting findings
- With cumulative cases, the variance in MAD (0.4 for Ukraine to 14.6 for China) is much greater than with new cases (0.68 for Germany to 6.91 for Tajikistan). It’s fairly logical, actually. By definition, new cases are more random than cumulative cases because they don’t depend on the numbers from previous days.
- Italy has the 2nd worst MAD when it comes to total cases. But the 2nd… best MAD when it comes to new cases.
- China has a crazy MAD of 14.1 for total cases. But a very good 2 for new cases.
- Tajikistan is in the bottom 10 for both measurements.
This is where it gets interesting…
The best way to understand what a normal MAD is in the context of our dataset is to plot every country on a chart.
There are pretty clear trends here — normal seems to be between 1 to 3 for new cases and 1 to 6 for cumulative cases. So the further from the bottom-left, the more abnormal the numbers.
And you can see three groups of anomalies.
Do keep in mind that abnormal doesn’t necessarily mean fraudulent. Anything drastic that a country does to slow down the virus (a lockdown, increased testing, decreased testing…), will mess up with the randomness of the data, and therefore offset it further to the right.
For example, if like China, you eventually eradicate the virus when you have about 80,000 cases, you may end up with hundreds of days with values starting with 8. The chart illustrates this extraordinary well: it’s very unordinary. No other country has done anything even close — as of November 2020. (Note: to be fair, China’s position on the vertical axis does seem to indicate that the distribution of new cases does follow Benford’s law.)
To a lesser degree, that’s also what’s happening with all the countries in the green circle. Each in their own way, they’ve done something to mess up with the randomness of their cumulative cases (voluntarily or not, fraudulently or not).
The vertical axis is more revealing, though. While today’s cumulative deaths is highly dependent on yesterday’s (it should theoretically never be less, for example), today’s new deaths isn’t as directly correlated to yesterday’s.
Countries that are high on the vertical axis have a number distribution that widely deviates from Benford’s law. Not only on their own —but in comparison to the rest of the countries.
The chart doesn’t tell us why.
But it sure does tell us that it’s… curious.