Technical Appendix: Estimating COVID Caseloads in the States

The Johns Hopkins Center for Systems Science and Engineering deserve kudos for providing daily statistics of the spread of the novel coronavirus known as COVID-19. Data on confirmed cases, deaths, tests conducted, and hospitalizations are available for a variety of geographic units. For the US, there are data for counties and aggregates for states. I’m going to focus on the state-level measures and present a few “regression experiments” using various predictors for the number of cases reported by each state.

The Baseline Model

The dependent variable in all the models I will present is the base-10 logarithm of total number of cases confirmed for each state on April 24, 2020.  These range from a high of 271,590 cases in New York state to a low of 339 cases confirmed in Alaska. In my initial model (1) below I include a state’s area and population size as predictors for the number of cases.  By using logs on both sides of the equation, the coefficient estimates are “elasticities,” measuring the proportional effect of a one-percent increase in a predictor.

COVID’s spread is much more determined by the size of a state’s population than its area. Moreover the coefficient of 1.26 means that states with larger populations have disproportionately more cases, no doubt a consequence of the contagion effect.

At the bottom of the column for model (1) is the coefficient for a “dummy” variable representing New York state.  In this simple size-based model, New York has (10^0.84), or 6.9, times the number of cases that its population and area would predict.  The reason for this will become clear in a moment.

Testing, Testing Testing

In model (2) I add the estimated proportion of the population that has been tested for the virus as of April 17th, a week before the caseload figures. The testing numbers also come from Johns Hopkins. For this measure, and all the proportions that follow, I calculate the “logit” of the estimated proportion. For the testing measure this works out to:

logit(testing) = ln(number_tested/(total_population – number_tested))

The quantity number_tested/(total_population – number_tested) measures the odds that a random person in the state’s population has been tested for the virus. Taking the logarithm of this quantity produces a measure that ranges over the entire number line.

Testing has a strong statistical relationship to the number of identified coronavirus cases in a state. Moreover the coefficient has a plausible magnitude.  If we increase testing by one percent, the expected number of cases will grow by 0.4 percent.  In other words, increasing testing at the margin identifies an infection in about forty percent of those newly tested.

Notice how the effect for a state’s physical area declines when testing is accounted for. One apparent reason why large states have fewer cases is because it is more difficult to test for the virus over a larger area.

Finally, when testing is accounted for, the caseload for the state of New York is no different from any other state with its population size and physical area.

We can simulate the effects of testing by imagining a fairly typical state with five million people living on 50,000 square miles of land area, then using the coefficients from model (1) to see how the estimated number of confirmed cases varies with the testing rate. This chart shows how the infection rate, the proportion of the population found to have the virus, increases with the rate at which the population is tested.

If we test only one percent of the state’s population, we will find about 0.1 percent of the population with a COVID infection. If we test five percent of the population, about 0.6 percent of that state’s people will be identified as having the virus.*

Old Folks in Homes

Now lets turn to some demographic factors that are thought to increase caseloads. First is the age of the population. In general, it is thought that older people have more susceptibility to the virus. However, model (3) shows there is little evidence that states with larger proportions of elderly have greater caseloads. What does matter, as model (4) shows, is the proportion of a state’s 75 and older population living in nursing facilities. When the virus gets into one of these facilities, it can run rampant throughout the resident population and the staff.

Race, Ethnicity, and Location

Reports of higher rates of infection among black and Hispanic Americans appear in these data as well.  In model (5), it appears the effect of larger Hispanic populations is twice that of equivalent black populations.  If we also adjust for the size distribution of a state’s population in model (6), the effect of its proportion Hispanic declines. This pattern suggests that Hispanics are more likely to live in smaller communities than other ethnic groups.

It is important to remember that these analyses apply to states. Finding no relationship between the proportion of a state’s population that is Native American and the state’s number of coronavirus cases does not imply that native populations are more or less at risk.  For that we need data at the individual level where we find that Native populations are more at risk.

I’ve also said nothing about deaths arising from the novel coronavirus.  That is the subject of my next report.



*We have no way of knowing what the “true” number of cases are; we have only the Johns Hopkins figures for “confirmed” cases.