The Lag Between COVID Cases and Deaths

Observers often point to the lag between COVID cases and COVID deaths to explain the current situation of rapidly rising caseloads but no corresponding spike in deaths. Still, after accounting for caseloads 14, 28, and 42 days prior, the growth in the number of deaths seems to have leveled off starting around July 1st.

Recent data on the expansion of the coronavirus pandemic in the United States show two somewhat contradictory trends. The number of diagnosed cases has skyrocketed driven by states like Florida, Texas, Arizona, and California. While the rest of the developed world is bringing the virus under control, cases in the US are growing exponentially.

Yet even as cases are rising, the death toll attributed to the virus has leveled off.  These apparently contradictory trends can occur because of the lag between when someone is diagnosed with the virus and the time when he or she dies.  Today’s death count does not reflect today’s caseload, but the number of cases some weeks back.  To study the effects of this lag, I am using the daily reported numbers of cases and deaths for the US as a whole from Johns Hopkins.  The data begin on January 22, 2020, when the first case was reported, and continue daily through July 6th.

I tried a number of lag specifications in a simple regression model to predict total deaths from total cases.  I tried including sixty individual lags but, unsurprisingly, while they explained nearly all the variance in deaths, none of the individual terms was significant.  Eventually I settled on a model where today’s deaths depend on the number of cases 14, 28, and 42 days prior.

The model predicts that ten percent of people contracting COVID will die fourteen days later, though that effect is tempered by the number of cases at longer lags.  This could reflect “learning” by the medical providers.  As we have had growing experience with treating an ever greater number of cases, the effectiveness of treatments and procedures improved.

More interesting perhaps is this chart showing the model’s predictions for the number of deaths and the actual number.

In the first half of April, this model based solely on lagged case counts tended to under-predict the death toll, but the predicted and actual lines merge later that month and remained remarkably in lockstep through May and June.  Since July began though, the actual death count has slowed relative to the predictions based on the case count fourteen, twenty-eight, and forty-two days ago.

Since this model relies on past caseloads to predict contemporary deaths, we can extrapolate the death rate out fourteen days.  The future looks bleak with the model projecting that we could reach a total of 200,000 deaths before the end of July. We have to hope that the slower-growing trend in observed deaths persists.

Technical Appendix: Estimating COVID Caseloads in the States

The Johns Hopkins Center for Systems Science and Engineering deserve kudos for providing daily statistics of the spread of the novel coronavirus known as COVID-19. Data on confirmed cases, deaths, tests conducted, and hospitalizations are available for a variety of geographic units. For the US, there are data for counties and aggregates for states. I’m going to focus on the state-level measures and present a few “regression experiments” using various predictors for the number of cases reported by each state.

The Baseline Model

The dependent variable in all the models I will present is the base-10 logarithm of total number of cases confirmed for each state on April 24, 2020.  These range from a high of 271,590 cases in New York state to a low of 339 cases confirmed in Alaska. In my initial model (1) below I include a state’s area and population size as predictors for the number of cases.  By using logs on both sides of the equation, the coefficient estimates are “elasticities,” measuring the proportional effect of a one-percent increase in a predictor.

COVID’s spread is much more determined by the size of a state’s population than its area. Moreover the coefficient of 1.26 means that states with larger populations have disproportionately more cases, no doubt a consequence of the contagion effect.

At the bottom of the column for model (1) is the coefficient for a “dummy” variable representing New York state.  In this simple size-based model, New York has (10^0.84), or 6.9, times the number of cases that its population and area would predict.  The reason for this will become clear in a moment.

Testing, Testing Testing

In model (2) I add the estimated proportion of the population that has been tested for the virus as of April 17th, a week before the caseload figures. The testing numbers also come from Johns Hopkins. For this measure, and all the proportions that follow, I calculate the “logit” of the estimated proportion. For the testing measure this works out to:

logit(testing) = ln(number_tested/(total_population – number_tested))

The quantity number_tested/(total_population – number_tested) measures the odds that a random person in the state’s population has been tested for the virus. Taking the logarithm of this quantity produces a measure that ranges over the entire number line.

Testing has a strong statistical relationship to the number of identified coronavirus cases in a state. Moreover the coefficient has a plausible magnitude.  If we increase testing by one percent, the expected number of cases will grow by 0.4 percent.  In other words, increasing testing at the margin identifies an infection in about forty percent of those newly tested.

Notice how the effect for a state’s physical area declines when testing is accounted for. One apparent reason why large states have fewer cases is because it is more difficult to test for the virus over a larger area.

Finally, when testing is accounted for, the caseload for the state of New York is no different from any other state with its population size and physical area.

We can simulate the effects of testing by imagining a fairly typical state with five million people living on 50,000 square miles of land area, then using the coefficients from model (1) to see how the estimated number of confirmed cases varies with the testing rate. This chart shows how the infection rate, the proportion of the population found to have the virus, increases with the rate at which the population is tested.

If we test only one percent of the state’s population, we will find about 0.1 percent of the population with a COVID infection. If we test five percent of the population, about 0.6 percent of that state’s people will be identified as having the virus.*

Old Folks in Homes

Now lets turn to some demographic factors that are thought to increase caseloads. First is the age of the population. In general, it is thought that older people have more susceptibility to the virus. However, model (3) shows there is little evidence that states with larger proportions of elderly have greater caseloads. What does matter, as model (4) shows, is the proportion of a state’s 75 and older population living in nursing facilities. When the virus gets into one of these facilities, it can run rampant throughout the resident population and the staff.

Race, Ethnicity, and Location

Reports of higher rates of infection among black and Hispanic Americans appear in these data as well.  In model (5), it appears the effect of larger Hispanic populations is twice that of equivalent black populations.  If we also adjust for the size distribution of a state’s population in model (6), the effect of its proportion Hispanic declines. This pattern suggests that Hispanics are more likely to live in smaller communities than other ethnic groups.

It is important to remember that these analyses apply to states. Finding no relationship between the proportion of a state’s population that is Native American and the state’s number of coronavirus cases does not imply that native populations are more or less at risk.  For that we need data at the individual level where we find that Native populations are more at risk.

I’ve also said nothing about deaths arising from the novel coronavirus.  That is the subject of my next report.



*We have no way of knowing what the “true” number of cases are; we have only the Johns Hopkins figures for “confirmed” cases.

The Politics of Stay-At-Home Orders

On her blog, journalist Marcy Wheeler helpfully tallied the twenty-seven states whose governors have imposed stay-at-home orders during the COVID-19 pandemic. Virginia joined this group late Monday. I have used her data, and figures from Johns Hopkins University on the number of identified cases, to do a quick analysis of the political forces driving the decision to impose such orders.

I used simple binary logit models for these tests.  The predictors include whether each state’s governor and legislature is controlled by Democrats, the February net job approval rating (approve – disapprove) for Donald Trump in each state from Morning Consult, and the number of reported cases in each state as of March 15th and March 30th.  Model (1) below includes all these factors; model (2) includes just the two that proved significant.

As you can see, only two factors proved nominally “significant,” whether the governor is a Democrat, and Trump’s approval rating in the state.  States with Democratic governors, and those where Trump’s net job approval is negative (“underwater”), are more likely to have instituted a stay-at-home policy. The number of COVID-19 cases surprisingly did not seem to matter.  (Using the logarithms of the number of cases did not improve things nor did looking at rates of growth.)

Using these results, I have generated the predicted probability that each state will have instituted a stay-at-home order and compared those predictions to the actual policies.

There are ten states where the predicted policy does not match the actual decision.  Thirty-three states are predicted to have imposed stay-at-home orders, but only twenty-eight have done so. Democratic strongholds like California and New York all have predicted probabilities above 0.9. However Nevada, Maine, Pennsylvania, Massachusetts, and Kentucky should all have instituted stay-at-home policies but have yet to do so.  In contrast, the governors of West Virginia, Idaho, Indiana, Alaska, and Ohio have all instituted such policies despite the political context of their states.

We can use the same set of predictors to estimate the duration of a state quarantine. Here I use a “Tobit” model, which handles dependent variables with zero lower bounds. States without a quarantine are coded zero on the duration variable.

The general pattern here is the same as for whether a quarantine was imposed.  However, the growth in cases between 3/15 and 3/30 has a weak statistical relationship with duration. Because the case figures are expressed as base-10 logarithms, the coefficient of 26.8 implies that a state whose caseload grew by a factor of ten during the latter half of March would impose a quarantine of 26.8 days, other things equal.