Behind the Webby Curtain
The Webby Awards’ process presents a unique challenge for voting design, particularly during their first round when the finalists are chosen. This round is where The Center for Election Science (CES) played the largest role helping. Over a thousand judges evaluated over ten thousand websites across hundreds of categories. Obviously, it was impossible for every judge to look at every website. So how do we make a system where the best sites rise to the top and each judge's precious time is used effectively?
CES was brought in to help the Webbys pursue two related goals: (1) quality outcomes and (2) effective use of judges' time. These goals are common to many voting contexts. We determined that the voting method score voting would make the most sense here. This method has voters rate candidates on a scale, declaring the candidate with the highest average the winner.
We had to do more than just identify score voting as the right voting method because, as mentioned above, it would have been impossible for every judge to rate every site. This raised a couple of issues:
1. Variation in average ratings across judges.
Some judges would be tougher than others and give lower ratings for the same quality while other judges would be easier. Further, some judges—by chance alone—would see more low-quality sites, while others would see more high-quality ones. When one judge would give lower average ratings, for instance, it would be impossible to know whether that judge was a tough rater or if the judge just happened to be seeing inferior sites.
2. Efficiently allocating judges' attention
We also wanted to be precise about the best sites’ quality. The difference between 1st and 3rd place matters a lot more than the difference between 101st and 103rd, so we focused judges’ attention on the top sites. On the other hand, each individual rating had some amount of idiosyncrasy and luck involved, so we didn’t want to ignore a site forever just because it got one mediocre rating.
A probabilistic approach
For both these issues, statistical thinking let us do a few things. First off, it let us build a model that told us which factors might lead ratings to vary. Secondly, it let us fit that model using real-world data to estimate the relative importance of those factors. Finally, we could use that fitted model to estimate true site quality with minimal error.
The goal of the first round was to identify the top 20% of works entered for the shortlist. Ranking wasn’t so important in all this, but it was still crucial not to eliminate potential winners. Thus, the system focused most of its attention on sites closest to the cutoff.
To advance the right sites, we had to figure out how to prioritize when and how often judges evaluated them. We could have been really accurate if judges rated every work in every category, but with hundreds and hundreds of entries in some categories, that wasn’t realistic. A judge’s time was best spent if their ratings were key to placing a site on the correct side of the cutoff.
If a site was on the wrong side of the cutoff, then we needed to add another rating to fix it. But we couldn’t possibly know exactly which sites we were wrong about, or whether one more rating could fix that mistake. What we could do, however, was use statistics to find the probability of these fixable mistakes for each site. Then we could send judges to the sites where that probability is highest.
The resulting system
To get the probability that a site was on the correct side of the cutoff, we had to make a few assumptions. We assumed that each rating was a combination of three factors: (1) a site’s intrinsic quality, (2) the toughness of the judge who gave that rating, and (3) random error representing how much more or less that judge liked the site compared to an average judge.
We used all the ratings we’ve seen so far to keep track of an estimated average and standard deviation for these three factors. Using these parameters, we could estimate the toughness of each judge, use that estimated toughness to correct the ratings and estimate the true quality for each site, and figure out the expected error in each of our quality estimates. All this let us estimate the probability that we could place a site on the correct side of the cutoff with an additional rating.
Once that was all done, we listed all the sites in descending order using that probability—whether they were on the correct side of the cutoff. Each time a judge asked for a new site to go look at, they got the next site from the top of the list.
In practice, that meant that we had to first send different judges to rate each site once. That’s because knowing nothing about a site maximizes the chances of being wrong about it. After that, this system balanced between looking at the sites with the fewest ratings and focusing on sites closest to the finalist cutoff.
This system ensured that those sites closest to the cutoff got the most ratings while those with hopelessly poor ratings got the least. Of course, there were many sites in between those extremes that received an average number of ratings.
This was an interesting problem. And right out of the first round! Developing a sophisticated queue system was challenging, but the voting method itself, score voting, was as straightforward as it gets.
In contrast, you can imagine how strange it would have been to limit judges to pick just one of the sites they saw—an approach we see unfortunately often. With so little information, the results would have been meaningless. Furthermore, ranking sites wouldn’t have worked at all. Score voting, however, was just the right fit—albeit in the context of a sophisticated (and necessary) system.
Now aren’t you excited to see the winners? To view the 2015 Webby Awards winners, go here.
 There are various approaches to statistical process of model building and fitting. None is simple, but we’ve used the simplest available: a model involving only normal distributions (that bell-shaped curve) and a quick-and-dirty fitting process called “moment matching,” which ensures that our theoretical estimates match what we see in real life during the Webby contest.