
Several days ago I caught the tail end of a television interview with a doctor who was being asked if there is a way to estimate the number of people in the U.S. who have contracted Covid-19 given that only 0.6% of the total population has been tested (at the time of this publication that number is around 1%). The doctor being interviewed didn’t really have an answer to this question, but the answer is yes; it is possible to say something about the actual number even if we can’t say what that number is. Providing this answer isn’t quite within the range of what most medical doctors would normally be expected to do, but as a data scientist, answering this question is the sort of things that does fall within my wheel-house. Today I’ll be going over two approaches I take to provide some hopefully useful insight on this question, one approach that provides a broad answer that reflects the uncertainty of how much we don’t know without widespread testing, and then another that may help provide some direction for where to narrow down a more specific answer to.
To provide some context, the discussion at large within this interview was regarding the potential risks involved in lifting the shelter in place policies currently in place across the country to get the economy back on more stable footing. The heart of the question was really this; has enough of the population been exposed to the virus, and thus developed immunities to it, that any second wave of outbreaks would be significantly slowed by the amount of people that are now already effectively immune through exposure? While I don’t have the medical knowledge to say how much of the population would have to develop immunities in order to significantly slow a viral spread, I can say something about how far the spread has gone, even if only to give a window of likely infection counts.
The tricky part about this question is that only a proportion of those who have contracted the virus have been tested for it, and until testing for antibodies in people’s bloodstreams becomes much more widely available, we have no idea how big this proportion is. Current testing could have identified 90% of everyone who has been infected or 10%; at this point we just don’t know. Especially since an estimated 25% to 30% of people who contract the virus don’t show symptoms according to the CDC.
While we do have data on the small (relative to the total U.S. population) sample of those who have been tested, I would be very hesitant to assume that this slice is representative of the total population. For one thing, different regions within the U.S. are at different stages of outbreak, so if most of the tests have been conducted in areas like New York who have been hit hard, it wouldn’t be accurate to generalize these results to places in the midwest, for example, which haven’t been hit as hard yet. The other bias that I expect to be present here is over sampling of positive cases. Without widespread testing, I think it likely that a large proportion of those who have been tested for the virus are those who either have experienced symptoms serious enough to need to seek medical help or been exposed to others who have been infected and either have pre-existing conditions that could make them susceptible or be in regular contact with others who do. Basically I don’t expect this sample to be at all random, and so generalizing sampling results to the whole population would be misleading, and less than useful.
The first approach I take to addressing this uncertainty is to calculate a range of possible outcomes given what we don’t know from what we do. We know that some proportion of positive Covid-19 cases have been identified, and we have that number, but we don’t know how many people have actually been infected for every one person that has been identified as having the virus. The broadest approach to take is to calculate a wide range of possibilities for this actual ratio. While this won’t provide a specific figure for how many have contracted the virus in total, it will give us a best and worse case scenario, provided that the range of possibilities is wide enough to capture the actual amount of infections. To find this range I start with the assumption that 95% of the total cases have been successfully identified by testing, and work my way down from there, calculating the potential total figure for each percentage identified until I get to 5%. At our best case scenario of 95% of cases identified, for every 95 people who have tested positive, 5 people remain who have gone untested who have had the virus, vs the worst case scenario of 5% identification which would mean that for every one person that has tested positive, 19 more have contracted the virus who have not been tested or accounted for. Both of these scenarios seem a little bit extreme, but this is what we want, to make sure that we capture the actual number within this interval, whatever that number may be. The following figure shows the curve of possibilities of total infections calculated with the current number of positive cases identified through testing, which as of this writing on April 16th is 645,936 within the U.S.

At the low end of this graph, when we assume that we have already identified 95% of the total Covid-19 cases, is 679,933 total people infected, which is only 0.2% of the total U.S. population. At the high end is 12,918,720 total people infected, which still only accounts for 3.9% of the total U.S. population. This is a big window, and while it does give us some useful information, the next question becomes whether or not there is a way to get a better sense of where within this spectrum of possibilities the actual figure may lie. To shed some more light on where within this range the actual number may fall, I turn to Python simulation.
Given that the that we have identified a portion of the total Covid-19 cases, and in the absence of widespread random test sampling of the population, these cases are going to be disproportionately representative of patients who are most susceptible to the virus due to either pre-existing conditions or being in an older age group, we can simulate via random number generation how many out of a every hundred people who have contracted the virus would have been identified through testing. The assumption is that with a lockdown in place, people with healthier immune systems or display mild symptoms have no incentive to go to a medical facility to get tested, so the bulk of those getting tested are people who are being affected enough to need to seek out medical help. This is starting to change as more testing is becoming available, but given that this has been slow to happen, I am confident that these assumptions still hold for the time being.
Using Python’s numpy library I can generate a series of random numbers between 1 and 100. In order to simulate an individual who has contracted the virus, I generate several of these random numbers and use them to assign the individual into one of several categories which would determine whether or not their symptoms would be severe enough to seek medical help and get tested for the virus. For example the CDC has reported that 25% – 30% of Covid-19 victims are asymptotic, or have no outward symptoms at all, assuming that people will fall into this category regardless of age or other conditions, I can say that if the first randomly generated number is lower than 30, the person will not get tested. The second random number per person is interpreted as whether or not they are older than 50. Since 34% of the population is over 50, if this number is greater than 66, , the person is classified as being over 50 and counted as someone who would seek care and get tested. I use this same method to classify individuals as having pre-existing conditions such as diabetes, obesity, and chronic lung disease based on the percentage of the U.S. population who have these conditions as well and categorize anyone who has them and isn’t in the asymptotic 30% to be tested and accounted for.
Having done this, it is easy to take a count and see how many of these simulated individuals would have gone untested. The proportion of those counted to uncounted then gives us a general clue as to what the ratio may be like in the actual world.
When I run this simulation one million times to represent one million individuals who have contracted the virus, I find that 46 out of every hundred infectees are identified through testing, meaning that the other 54 go unaccounted for. This is in the middle of the range that I estimated values for in the graph above, and this specific number would indicate that with the 645,936 cases that have been identified through testing, the actual number of people in the U.S. who have been infected is 1,421,201. This figure represents only 0.43% of the total population.
This simulation is not too complicated, if I wanted to I could go much more in-depth here, but given that there will be decreasing returns of scale on accuracy, I don’t think it worth spending the extra time on at this point. This is to some extent a shot in the dark to provide a possible starting point for thinking about where the real number of cases could be in the ballpark of, and as such should be taken with a grain of salt. While I would be excited to learn that this method ended up producing accurate results, I don’t expect the figures generated by simulation to be spot on, especially given the assumptions that I made in using random numbers to classify categorically. In reality, the proportion of people over 50 who contracted the virus and received medical help is not likely to be exactly 70%, for example.
So what does all of this tell us about the original question? It tells us that we can reasonably expect that the total percentage of our workforce who may have developed immunities to the virus is likely less than 1%, and at most just shy of 4%. I’m not an immunologist, but I would be surprised if this was enough of the population to significantly slow the spread of the virus if we reopened everything tomorrow. Nor would I think it likely that it would be a large enough proportion to prevent our hospital systems from being overwhelmed by the amount of new patients from the remaining 99% to 96% of the populace who are still susceptible to contracting the virus with a lifting of the shelter in place policies that have been adopted across the nation.
Have an idea of how I can improve on either of the methods I used to calculate estimates here? Feel free to leave me a comment, I’m always open to improvement.
In conducting this analysis I used Python’s numpy and matplotlib libraries as well as information from the Center for Disease Control, which can be viewed here and here, as well as outbreak information from Covidly.
Pingback: Reaching Herd Immunity: A Stampede Or a Crawl? | Curious Economist