Polls come in three flavors depending on the population being sampled. Many polls sample all adults, but election polls typically sample either self-reported registered voters, or some subset of the registered voters they call “likely voters.” Reputable polls use a scoring method based on a variety of respondent characteristics known to correlate with voting. Pew presents detailed descriptions of its methodologies including how likely voters are determined.
Polls of registered voters are well-known to show a pro-Democratic bias. Democrats come from less participatory cohorts like young people and often face greater institutional challenges when trying to exercise their franchise. In a recent post at 538, Nate Silver estimated the difference between polls of registered voters and ones of likely voters at a swing of -1.5% against the Democrats.
Meanwhile I have been looking into the dynamics of the races by estimating the relationship between the candidates’ relative standings in a poll and time before Election Day. I designate a “polling day” as the date when fieldwork was completed. All polls ending on the same polling day are averaged together, though separate averages are computed for polls of likely voters and polls of registered voters. As before, I will limit my analysis to polls whose fieldwork ended after the Fourth of July holiday. At this point in the campaign both candidates have been determined while the conventions are over a month away.
I’m going to use my dataset of daily aggregates since that lets us visualize the results on a day-by-day basis. I will come back to the poll-by-poll data momentarily. That gives us a sample of 55 days on which at least one poll was taken between July 4th and October 6th days so far in 2012, of which 32 (58%) include only polls of “likely voters.” The remaining polls sample from the larger population of registered voters. In 2008, there were 85 qualifying polling days, again with 58% (49) of the daily averages based on polls of likely voters. I begin with a simple model that predicts the size of the President’s lead over his Republican challenger based simply on whether the data point represents only likely voters (LV), and how many days in advance of the election was the poll taken (DaysBefore).
OLS, using observations 1-140 Dependent variable: Lead for Obama in voting intention coefficient std. error t-ratio p-value ---------------------------------------------------------- const 7.01470 0.795205 8.821 4.64e-15 *** LV −1.58633 0.577574 −2.747 0.0068 *** DaysBefore −0.0421944 0.00892010 −4.730 5.51e-06 *** Mean dependent var 3.416643 S.D. dependent var 3.418385 Sum squared resid 1383.196 S.E. of regression 3.177471 R-squared 0.148417 Adjusted R-squared 0.135985
The constant shows the estimated size of Obama’s lead on Election Day when the “DaysBefore” variable is, by definition, zero. Among registered voters a poll taken on election day in either 2008 or 2012 should show a seven point lead for President Obama. If a poll includes only likely voters, Obama’s lead will shrink by about 1.5% percentage points, identical to the figure reported by Nate Silver using a different method.
The regression shows the President’s fortunes improving every day. The negative coefficient of -0.04 for “DaysBefore” means that polls taken closer to Election Day show a larger lead for Barack Obama than polls taken earlier in the summer. On average, Obama’s lead over his Republican opponent has increased a full percentage point every 25 days (since 25*0.04=1.00) since the Fourth of July in both 2008 and 2012.
Most studies find a pro-Republican bias in polls of registered voters. Pew, for instance, shows much larger Democratic margins among its samples of registered voters than among likely voters. But a little further digging into this data reveals a more complex relationship than a simple difference of 1.5%.
The model above “bakes in” the assumption of a flat likely-voter effect by including just the LV dummy variable. That produces a graph with two parallel lines like these:
The President’s lead grows over the entire four months by the same amount each day. Polls of registered voters show a 1.5% larger lead for Obama on any given day.
Suppose we removed this constraint that registered voter polls and likely voter polls show the same trajectory over the final four months of the campaign. What if we estimate separate trends for the two groups? We can test to see whether letting the trends differ by including an “interaction” term that multiplies together the zero/one “dummy” variable LV and the continous variable DaysBefore. The product (DBLV below) measures how the trend effect for likely voters differs from that found for registered voters. Including that variable gives us a much different picture about how the dynamics of the campaign play out in the two different types of polling samples:
Model 6: OLS, using observations 1-140 Dependent variable: Lead for Obama in voting intention coefficient std. error t-ratio p-value ---------------------------------------------------------- const 4.65136 1.13537 4.097 7.16e-05 *** LV 1.85189 1.33160 1.391 0.1666 DaysBefore −0.0111532 0.0139398 −0.8001 0.4251 DBLV −0.0508243 0.0178371 −2.849 0.0051 *** Mean dependent var 3.416643 S.D. dependent var 3.418385 Sum squared resid 1305.274 S.E. of regression 3.098000 R-squared 0.196390 Adjusted R-squared 0.178664
Now we see a very different of the campaign dynamics. The trend effect appears to happen only in the likely voter polls. The effect for DaysBefore in registered voter polls is -0.01 which is statistically indistinguishable from zero. Likely voter polls showed a substantial pro-Obama trend. In likely voter polls the President gains a full percentage point every twenty days. Omitting the pure DaysBefore variable and leaving only LV and DBLV produces rather surprising result:
Model 7: OLS, using observations 1-140 Dependent variable: Lead for Obama in voting intention coefficient std. error t-ratio p-value -------------------------------------------------------- const 3.80220 0.402795 9.440 1.33e-16 *** LV 2.70105 0.803157 3.363 0.0010 *** DBLV −0.0619776 0.0111139 −5.577 1.26e-07 *** Mean dependent var 3.416643 S.D. dependent var 3.418385 Sum squared resid 1311.418 S.E. of regression 3.093929 R-squared 0.192608 Adjusted R-squared 0.180821