In the past few weeks, Pollster has begun reporting multiple results for a single poll. Some polling organizations have been reporting separate results for Democratic, Republican, and independent respondents, as well as the aggregated data for all respondents. They have also begun providing detailed information on the question(s) asked to determine voting intention. Pollster reports separate results for each question wording.
Since all my analyses use just one entry per poll, I have begun removing this extra data before analysis. Unless specifically stated, I am using only the first “question iteration” for each poll (coded “1” at Pollster) and only data for the entire population. Using the first iteration helps insure consistency across all the polls from a single organization.
I have expanded the list of states that might play a role in determining the outcome of the Presidential vote in the fall. For each state in the list below, I have compiled all the available polls at Huffington Post Pollster and calculated the percent of polls in which Clinton held a lead. For each state I then calculated a statistic called “chi-squared” to see whether her lead was sufficiently consistent to conclude she was truly ahead in the state. Here are the results through today:
In Wisconsin, Hillary Clinton has led in every poll conducted in the state dating back to last fall. She has nearly as impressive a lead in both Michigan and Pennsylvania, both states typically mentioned as targets for Donald Trump’s “rust-belt” strategy. In those two states there is less than one chance in twenty that Clinton is truly behind given the number of polls in which she held the lead. In the remaining states the results are still too mixed to draw any conclusions about which candidate is in the lead. Clinton does especially poorly in the traditionally-Republican states of Arizona and Georgia, but there haven’t been enough polls taken to draw any conclusions there. The other states remain toss-ups.
Standard ordinary least squares regression assumes that the error term has the same variance across all the observations. When the units are polls, we know immediately that this assumption will be violated. The error in a poll in inversely proportional to its sample size. The “margin of error” that pollsters routinely report is twice the standard error of estimate evaluated at 50%, the worst case with the largest possible variance. That comes from the well-known statistical formula
SE(p) = sqrt(p[1-p]/N)
where N is the sample size. This formula reaches its maximum at p=0.5 (50%) making the standard error 0.5/sqrt(N).
Weighted least squares adjusts for these situations where the error term has a non-constant variance (technically called “heteroskedasticity”). To even out the variance across observations, each one is weighted by the reciprocal of its estimated standard error. For polling data, then, the weights should be proportional to the reciprocal of 1/sqrt(N), or just sqrt(N) itself. I thus weight each observation by the square root of its sample size.
More intuitively we are weighting polls based on their sample sizes. However, because we are first taking the square roots of the sample sizes, the weights grow more slowly as samples increase in size, just as does the accuracy of prediction.
As in every Presidential election, the outcome will be determined by a very small number of states. As I did in 2012, I have compiled the polls in these “swing” states and counted up the number of times Hillary Clinton or Donald Trump was in the lead. I have included every poll conducted so far that includes both candidates; the oldest poll was taken in late June of 2015. I intend to update these results limiting them to only recent polls as the election nears.
Two states – Michigan and Pennsylvania – have supported Hillary Clinton consistently enough that there is just a small chance, less than one in twenty, the race is actually tied or she is behind Donald Trump in those states. The other four states remain toss-ups.
Pennsylvania tempts Republicans to compete there every election cycle, and this one is no exception. Still the state has trended Democrat in Presidential elections since the late 1960’s.