Technical Appendix: Estimating COVID Caseloads in the States

The Johns Hopkins Center for Systems Science and Engineering deserve kudos for providing daily statistics of the spread of the novel coronavirus known as COVID-19. Data on confirmed cases, deaths, tests conducted, and hospitalizations are available for a variety of geographic units. For the US, there are data for counties and aggregates for states. I’m going to focus on the state-level measures and present a few “regression experiments” using various predictors for the number of cases reported by each state.

The Baseline Model

The dependent variable in all the models I will present is the base-10 logarithm of total number of cases confirmed for each state on April 24, 2020.  These range from a high of 271,590 cases in New York state to a low of 339 cases confirmed in Alaska. In my initial model (1) below I include a state’s area and population size as predictors for the number of cases.  By using logs on both sides of the equation, the coefficient estimates are “elasticities,” measuring the proportional effect of a one-percent increase in a predictor.


COVID’s spread is much more determined by the size of a state’s population than its area. Moreover the coefficient of 1.26 means that states with larger populations have disproportionately more cases, no doubt a consequence of the contagion effect.

At the bottom of the column for model (1) is the coefficient for a “dummy” variable representing New York state.  In this simple size-based model, New York has (10^0.84), or 6.9, times the number of cases that its population and area would predict.  The reason for this will become clear in a moment.

Testing, Testing Testing

In model (2) I add the estimated proportion of the population that has been tested for the virus as of April 17th, a week before the caseload figures. The testing numbers also come from Johns Hopkins. For this measure, and all the proportions that follow, I calculate the “logit” of the estimated proportion. For the testing measure this works out to:

logit(testing) = ln(number_tested/(total_population – number_tested))

The quantity number_tested/(total_population – number_tested) measures the odds that a random person in the state’s population has been tested for the virus. Taking the logarithm of this quantity produces a measure that ranges over the entire number line.

Testing has a strong statistical relationship to the number of identified coronavirus cases in a state. Moreover the coefficient has a plausible magnitude.  If we increase testing by one percent, the expected number of cases will grow by 0.4 percent.  In other words, increasing testing at the margin identifies an infection in about forty percent of those newly tested.

Notice how the effect for a state’s physical area declines when testing is accounted for. One apparent reason why large states have fewer cases is because it is more difficult to test for the virus over a larger area.

Finally, when testing is accounted for, the caseload for the state of New York is no different from any other state with its population size and physical area.

We can simulate the effects of testing by imagining a fairly typical state with five million people living on 50,000 square miles of land area, then using the coefficients from model (1) to see how the estimated number of confirmed cases varies with the testing rate. This chart shows how the infection rate, the proportion of the population found to have the virus, increases with the rate at which the population is tested.


If we test only one percent of the state’s population, we will find about 0.1 percent of the population with a COVID infection. If we test five percent of the population, about 0.6 percent of that state’s people will be identified as having the virus.*

Old Folks in Homes

Now lets turn to some demographic factors that are thought to increase caseloads. First is the age of the population. In general, it is thought that older people have more susceptibility to the virus. However, model (3) shows there is little evidence that states with larger proportions of elderly have greater caseloads. What does matter, as model (4) shows, is the proportion of a state’s 75 and older population living in nursing facilities. When the virus gets into one of these facilities, it can run rampant throughout the resident population and the staff.

Race, Ethnicity, and Location

Reports of higher rates of infection among black and Hispanic Americans appear in these data as well.  In model (5), it appears the effect of larger Hispanic populations is twice that of equivalent black populations.  If we also adjust for the size distribution of a state’s population in model (6), the effect of its proportion Hispanic declines. This pattern suggests that Hispanics are more likely to live in smaller communities than other ethnic groups.

It is important to remember that these analyses apply to states. Finding no relationship between the proportion of a state’s population that is Native American and the state’s number of coronavirus cases does not imply that native populations are more or less at risk.  For that we need data at the individual level where we find that Native populations are more at risk.

I’ve also said nothing about deaths arising from the novel coronavirus.  That is the subject of my next report.

 

____________________

*We have no way of knowing what the “true” number of cases are; we have only the Johns Hopkins figures for “confirmed” cases.

The Politics of Stay-At-Home Orders

On her blog, journalist Marcy Wheeler helpfully tallied the twenty-seven states whose governors have imposed stay-at-home orders during the COVID-19 pandemic. Virginia joined this group late Monday. I have used her data, and figures from Johns Hopkins University on the number of identified cases, to do a quick analysis of the political forces driving the decision to impose such orders.

I used simple binary logit models for these tests.  The predictors include whether each state’s governor and legislature is controlled by Democrats, the February net job approval rating (approve – disapprove) for Donald Trump in each state from Morning Consult, and the number of reported cases in each state as of March 15th and March 30th.  Model (1) below includes all these factors; model (2) includes just the two that proved significant.

As you can see, only two factors proved nominally “significant,” whether the governor is a Democrat, and Trump’s approval rating in the state.  States with Democratic governors, and those where Trump’s net job approval is negative (“underwater”), are more likely to have instituted a stay-at-home policy. The number of COVID-19 cases surprisingly did not seem to matter.  (Using the logarithms of the number of cases did not improve things nor did looking at rates of growth.)

Using these results, I have generated the predicted probability that each state will have instituted a stay-at-home order and compared those predictions to the actual policies.

There are ten states where the predicted policy does not match the actual decision.  Thirty-three states are predicted to have imposed stay-at-home orders, but only twenty-eight have done so. Democratic strongholds like California and New York all have predicted probabilities above 0.9. However Nevada, Maine, Pennsylvania, Massachusetts, and Kentucky should all have instituted stay-at-home policies but have yet to do so.  In contrast, the governors of West Virginia, Idaho, Indiana, Alaska, and Ohio have all instituted such policies despite the political context of their states.

We can use the same set of predictors to estimate the duration of a state quarantine. Here I use a “Tobit” model, which handles dependent variables with zero lower bounds. States without a quarantine are coded zero on the duration variable.

The general pattern here is the same as for whether a quarantine was imposed.  However, the growth in cases between 3/15 and 3/30 has a weak statistical relationship with duration. Because the case figures are expressed as base-10 logarithms, the coefficient of 26.8 implies that a state whose caseload grew by a factor of ten during the latter half of March would impose a quarantine of 26.8 days, other things equal.