The forces of ratings inflation
Why critics and aggregators skew high
In this article, I speculate about why critics and audiences generally rate quality highly, whether it’s the quality of movies, consumer goods, businesses or human beings. A “high” rating here is any rating above the midpoint of any scale.
I write reviews as a hobby. In a 2021 statistical analysis, I found that average movies are rated highly on IMDb, and to a lesser extent on Rotten Tomatoes and Metacritic, three sites where I am not a contributor.
51% of all ratings on Yelp are 5 out of 5 stars.1 The tendency to go high is nearly universal on consumer-facing sites where things are rated for quality. For example, without having analyzed comprehensive statistics, it looks like BoardGameGeek skews high like IMDb. Both sites collect ratings from an open pool of users, with little verification.
Off the top of my head and in no particular order, I think ratings skew high for the following reasons.
When a user hovers over a potential rating on BoardGameGeek, they see the following guidelines for interpretation of the site’s 10-point scale:
- Awful - defies game description.
- Very bad - won’t play ever again.
- Bad - likely won’t play this again.
- Not so good - but could play again.
- Mediocre - take it or leave it.
- Ok - will play if in the mood.
- Good - usually willing to play.
- Very good - enjoy playing and would suggest it.
- Excellent - very much enjoy playing.
- Outstanding - will always enjoy playing.
These are typical. IMDb has no official guidelines, but IMDb users describing their own personal interpretations of the scale tend to do it in similar terms.2
Notice the use of two disparaging adjectives for normality around the midpoint of the BoardGameGeek scale. The word “mediocre” appropriately denotes “average”, but it has a negative connotation. Instead of being a neutral term for a neutral part of the scale, it suggests that the average is itself worse than some other standard of measurement. The term “OK” probably originated as a humorously bad abbreviation of “all correct”, denoting something that is right in every meaningful way, but here, “OK” is obviously worse than good. “Good” appears past the midpoint of the BoardGameGeek scale, and “enjoy” even further along, at 8/10.
Such guidelines, viewed in isolation, suggest one of two things about their author. Either the author expects ratings to skew high, as in fact they do, or the author believes that good experiences are rare. Ratings suggest the latter belief to be irrationally negative, but then again, official guidelines don’t seem to influence ratings. I expect a small effect size from them, even in contexts like BoardGameGeek’s user interface, where the guidelines are readily available.
Metacritic’s stated goal is to “help consumers make an informed decision about how to spend their time and money”, a goal that is not served by skewing in either direction, nor by rating on a curve. Nonetheless, Metacritic processes ratings by the undocumented “stature” of critics to produce a weighted “Metascore”. They also make arbitrary editorial choices in the interpretation of individual critics’ scales, which reduces diversity of opinion compared to Rotten Tomatoes. Finally, Metacritic moves the window of “general meaning” based on the medium of a work.3
It is not a stated editorial policy at Metacritic to skew high. However, there is money at stake,4 and movie ratings on Metacritic are higher than on Rotten Tomatoes. It is not clear whether conscious ratings inflation on either site would be sustainable as a business practice, but as long as neither editorial policy nor raw data are public, Metacritic could get away with systematic inflation if its staff or owners wanted to. Lots of reviewers do.5
Given that Amazon owns IMDb, Prime Video, and the Amazon Original brand of streaming video, there is a monetary incentive toward corruption in the case of IMDb. Hypothetically, the aggregator would pick its secret policy of interpretation in whatever way maximizes company profit while minimizing the risk of getting caught without a plausible excuse. This is, of course, easier on Amazon Marketplace, where ratings (“Amazon Seller Rating”) are extensively weighted by policy, and skew high. Amazon itself is rated highly by Amazon, on Amazon.
Sites like IMDb and BoardGameGeek, where reviewers are poorly authenticated, use secret algorithms in part to counteract campaigns to skew ratings. One example is Ghostbusters (2016), which is marked by IMDb with the following note:
Our rating mechanism has detected unusual voting activity on this title. To preserve the reliability of our rating system, an alternate weighting calculation has been applied.
The vote breakdown page for that specific movie shows a conspicuous plurality of bottom ratings and an unusually large proportion of top ratings, on either side of what is otherwise a standard distribution centered on 6. There is no doubt that IMDb is correct in identifying that voting activity as unusual. It is dishonest; a kind of disinformation.
Trying to compensate for dishonest voting on a statistical level is a cheap way to maintain integrity. It means that IMDb does not have to take expensive or commercially risky actions such as singling out trolls, shills and other abusers. A minor downside is that the secret algorithm can skew ratings. The arithmetic mean of all votes on the Ghostbusters remake is somewhere below 6, whereas the official IMDb rating as of 2021-11 is 6.5, higher than the mean of all votes and higher than the mean of the subset of votes that follow the standard distribution. In this specific case, the secret algorithm to fight abuse skews high.
I have presented the possibility that editorial policy could skew ratings. On its face, this possibility is neutral. If, hypothetically, IMDb really worked to drive business to Amazon Original content through ratings—which I do not believe is the case—they could do it by deflating other providers, just as well as by inflating their own. However, there are psychological factors that would drive a cheater to skew high.
In a 1996 speech, Alan Greenspan coined the term “irrational exuberance”, referring to a stock-market bubble at that time. The phrase is now one of many names for an underlying psychological tendency in investors. Briefly put, humans are optimistic. We are vulnerable to wishful thinking, especially in a social context. It feels good to be part of a group that likes the same things and is moving onward and upward. As a corollary, the more things that seem to well liked, the more such groups you will belong to. Unlike stocks, ratings are free, so ratings skew high.
On the stock market, irrational exuberance can be self-sustaining. A stock can come to resemble a currency, the value of which is a function of collective faith rather than the underlying material (“real”) economy. There is a parallel in movie reviews. If a strong majority seems to love The Shawshank Redemption (1994), it takes effort to disregard the balance of public opinion.
Exuberance is conformistic. Exuberance is also counterproductive. Stock markets crash because of exuberance. Consumers are ill served by generous, optimistic critics who tend to rate high, because that makes it hard to compare two of their recommendations. Like any symbol, a rating can only be meaningful in relation to other symbols. If 100% of businesses on Yelp had 5 out of 5 stars, this would not make all businesses good; instead it would put Yelp out of business.
Political campaigns can “go negative” on opponents, but corporations promoting their goods rarely do this. Marketers are happy to undermine the self-confidence of consumers to manufacture a sense of need, but it is more efficient for them to be positive about what they’re selling than to be negative about each of their competitors. IMDb advertises Prime Video because they are both owned by the same megacorp, but IMDb does not need to disparage other providers, which would risk lawsuits and bad PR. The same drivers work in an informal social context.
Most cultures value people who are exuberant and passionate, associating such people with vivacity. To be generally enthusiastic in this particular way is a signal of genetic fitness and tribal utility. To skew high in one’s individual rating on IMDb is a cheap way to send that signal. If you skew low, you send the opposite signal and you will be perceived as unkind and gloomy, a “negative ninny” who is not on the team.
The effects are exaggerated where there is a personal connection. You are more likely to say you liked a book if you got it from a friend, or that friend wrote it. To skew high feels inoffensive, even when it’s the norm, so cowardice drives up ratings.
On marketplaces like eBay, where individual buyers and sellers rate each other directly, skewing high is a strong norm. “Likes” on Youtube are far more common than “Dislikes”, and since 2021, the number of dislikes is hidden to protect creators’ feelings. There is only room for positive vibes. Even on IMDb, users may feel unkind to creators when they give a negative review, instead of feeling kind toward their fellow consumers. This is somewhat mitigated among professional critics, who can theoretically get the same amount of page views and professional respect for positive as for negative reviews. Working against irrational exuberance and keeping all ratings meaningful is part of the job description.
One analyst has speculated that BoardGameGeek is “running out of room” on its scale6 because games have gotten better. The baseline can certainly shift. IMDb displays near-constant growth in its ratings going back to the invention of cinema. That growth may come from an escalating exuberance, aided by ratings being determined nearer the height of release hype since the launch of IMDb, but it can also come from real improvements, including better technology and technique in production and presentation, as in board games and video games.
Theoretically, if IMDb scrapped all of its ratings and relaunched today, users would still rate recent works higher than old ones, but the median rating might sink by a couple of decimal points because viewers nowadays have higher expectations on the average work than they did in 1993, when IMDb moved to the web. Actively changing the scale, such as by adding more stars to IMDb, is no more attractive.
The exhaustion of quality with experience
According to an earlier study I made of another ratings site, Filmtipset, more experienced critics rate movies lower. Ratings on Filmtipset skew high for those users who’ve rated less than 1000 movies, and then start skewing low. They go lower than the midpoint, which is lower than the median rating on IMDb in any era, and lower even than Rotten Tomatoes or Metacritic.
This could be because more experienced viewers gradually run out of good movies. This is probably true among the extreme users, but there must be a counterbalancing effect where only those with an exuberant love of the medium—who therefore tend toward higher ratings—make the effort to watch, evaluate and rate a thousand feature films. There must also be another effect whereby dopamine levels decline with familiarity, regardless of quality. In any case, the inexperience of most amateurs is a plausible reason why ratings often skew high, but only on relatively open sites where inexperience is the norm.
Unlike IMDb, Rotten Tomatoes and Metacritic show generally higher ratings for older movies, with a lot of noise. This seems to contradict the quality hypothesis, but doesn’t. Nostalgia and didacticism are probably factors in it, but most of the discrepancy is instead a result of selection bias. The professional critics whose opinions are the basis of RT and Metacritic rarely publish about older works, and when they do, it is generally the most lauded classics which are remastered, rereleased, recirculated at revivals, etc. Mediocre older works are barely reviewed at all.
Selection bias is also big outside the circle of professional critics. IMDb, which skews higher than RT or Metacritic, also lists a broader range of movies and series for review, but it’s only a tiny fraction of all the moving pictures ever produced. About 500 hours of video are uploaded to Youtube each minute. Much more is filmed and never uploaded anywhere. Among countless amateur productions, home video and CCTV, there is something for everyone, but if all of it was rated on IMDb by all those who saw it, the median rating would not skew high.
The same is true in games. Poorly designed and obscure board games like Bamse och Dunderklockan (2018), a tie-in to Bamse and the Thunderbell (2018), do not appear on BoardGameGeek at all and would be rated poorly if they did. When I rate a game on BoardGameGeek, I still cannot avoid comparing it to all of my other memories of board games, including the ones that are so bad that nobody has bothered adding them to the database. There is a horizon of notability and effort underneath selection bias that hides a lot of chaff from view. It’s consumed, but it isn’t rated. Call it the data-entry bias.
In an even larger context, consumers actively seek pleasure. They watch those movies that they expect to enjoy, and they’re usually right in predicting their own pleasure. They never see the mediocre movies that would, hypothetically, fill out the bell curve on IMDb and send the median rating closer to the midpoint of the scale. However, it is an open question as to what extent users on IMDb actually compare the works they rate to inferior works they know about but do not rate. It is another open question whether they try to compensate for the former effect.
Selection bias skews ratings higher through one more effect. I have shown here that on IMDb, series are rated higher than movies, and episodes of series are rated even higher than complete series. It is possible for a whole to be worse than its individual parts, but it is not likely that individual episodes are systematically better on average than the shows they constitute. It is also not likely that movies, produced at greater cost and care, are generally less enjoyable than television to audiences on IMDb, yet they are certainly rated that way.
One of the reasons why episodes skew so high is that, like viral videos, they are shorter, less complex than movies or series. They require a smaller investment per episode and their sheer simplicity provides cognitive ease, which is pleasurable. This is especially true for television up through the mid-1990s, which was usually episodic, but it is less true for modern television, which requires investment to keep up with serialized narratives.
I believe the larger effect comes from loyalty. Those IMDb users who bother to rate individual episodes are generally fans of the specific show, whereas a broader range of users rate that show as a whole. Only the better shows attract loyal per-episode reviewers. Many shows are designed to generate that loyalty, making viewers get to know and care about the characters over the long term, something that is harder to do in movies.
Another reason why series would be rated highly is precisely that they require a larger investment of time. Once having made that investment, viewers look back and ask themselves “Did I just waste 100 hours of my life?” To answer that question in the affirmative is painful. The greater the investment, the greater the tendency to double down and assert that it was all worth it.
Conversely, viewers who do not make the complete investment may decide not to give the series any rating, on the grounds of having incomplete information. I watch at least a full season of a series before I publish a review of it, and it is the same with novels. Anything I dislike too much to finish is not reviewed at all. This skews my own ratings higher for longer works.
Bias toward attitude strength
Psychologists have found that strongly expressed attitudes are more resistant to change (e.g. persuasion), more persistent over time, and more consistent with actual human behaviour.7 For example, an answer of 10 on a 10-point scale (“Agree completely”), in a survey about virtually anything, turns out to be much more practically useful information than an answer of 8.
This interpretation of strong attitudes ties into exuberance. At the broadest level, there are social contexts where any weak attitude is seen as phoney or a sign of more general personal weakness. If your weekend was “great” or you have some funny story about how it was “terrible”, you are OK. If your weekend was “OK”, you are not OK. Having neutral experiences marks you as mediocre, in the warped sense of being both normal and bad at the same time.
If this tendency were not moderated by the others, it would produce polarized ratings, with very little at the midpoint and a lot more movies at both extremes. In actuality, most individual movies on IMDb show a mostly-standard unimodal distribution of user ratings in their vote breakdowns, albeit off-centre. A conspicuous bump at the top and bottom are common, but 80% of IMDb ratings lie between 5.0 and 8.5, not at the extremes.
Dubious neutral ground
In the evaluation of an aesthetic experience, most people view good and bad as opposites: Strong attitudes on one linear continuum. Because of that mental image, we expect some neutral ground where one’s attitude is weak. Intuitively, it speaks to a sense of symmetry to place such neutral ground on the middle of the scale, regardless of what the typical experience is. There is probably a relationship between the bias toward attitude strength and this tendency to have the neutral ground in the middle, as in the BoardGameGeek example above.
In practice, the majority of people on the Internet actively avoid using the apparently neutral part of the scale. They may be making the same interpretation of it as psychologists do: A middling rating is automatically thought to express an attitude that is itself weak, instead of an attitude that is strong in identifying a product of mixed quality. Outside of psychology, a weak attitude is disparaged, just as normality (“mediocrity”) is disparaged.
The “passionately neutral” opinion is a recurring trope of comedy because it sounds like an oxymoron, built on the premise that the strength and valence of an attitude can be disconnected. They can. It is possible to be passionately neutral, but such an opinion cannot be fully expressed in a one-dimensional model. There are more subtle factors at play.
We usually find both strengths and weaknesses in a product and we tend to pick one side or the other instead of letting them cancel out. When they do cancel out, or we can’t find them at all, or we have no interest to start with, we seem to end up on the noise floor of our own thoughts and make up a non-neutral opinion. When exposed to a “clean” input, such as Kazimir Malevich’s blank painting White on White (1918) or John Cage’s musical composition 4’33” (1952), few people seem to arrive at a neutral opinion. If we did, we would not despise normality.
Going to extremes
A bias toward attitude strength does not explain why ratings skew especially high in a general person-to-person context, but it may have something to do with peer-to-peer e-commerce (eBay, Tradera etc.), where there is a community norm of acceptable behaviour. In that context, the sheer dominance of high ratings shows the strength of the norm. Precisely because high ratings are very common, anything lower stands out. This enables a low rating to function as a warning of fraud. Such a semiotic function is barely applicable to sites like IMDb, which skew less.
The bias toward attitude strength can be oppressive, because its actual psychological basis is often misinterpreted and mixed up with prejudice. When you are asked to rate customer service in a survey, for instance at a retail store, there is a strong possibility that the employees who served you will be penalized if you use anything but the top of the scale,8 and penalized further if they tell you about the scale.
Because this is speculation and a couple of anecdotes, I have no definite answers to the question of why ratings get inflated. However, I would like to say that none of the reasons I can think of should convince a critic to skew high. My conclusion is the opposite.
As a reviewer, you can probably make up your own guidelines and your own scale. If you submit your work to Rotten Tomatoes you interpret it for them. Otherwise, the reinterpretation of your opinion by an aggregator is not something you particularly need to worry about as long as your readers can find out for themselves. If you’re honest, bias of every kind is something you want to avoid, both as a critic and as a reader of criticism.
A shifting baseline of quality is a harder problem for the individual critic. I review books and moving pictures, the contents of which get better very slowly. However, in my lifetime, TVs have gone from CRT to µLED technology, improving faster than media content shown on them. If you’re just starting out writing reviews, and especially if you’re doing it in such a fast-moving field, my advice is to skew low.
Example: Brendan Rettinger, “Cinemath: Do I Agree With IMDB Users? A Statistical Analysis of My IMDB Ratings” (2011-11-27), Collider. Online here. Rettings glosses over ratings 2–4 as “I did not like this movie” without further distinction. ↩
Metacritic labels a movie as having generally favourable reviews with a Metascore of 61–80, but in the case of games, the same wording (“Generally Favorable”) is reserved for a different range of 75–89, with only 5 points of overlap between the two media. Source: “How We Create the Metascore Magic”, Metacritic, read 2021-11-21. Online here. ↩
The general commercial importance of a rating is not clearly established. In one famous example where it did matter to creators, game publisher Bethesda denied development studio Obsidian a bonus for Fallout: New Vegas (2010) because the game received a rating of 84 on Metacritic, lower than the target number of 85. ↩
For example, at the time of writing, of the last 30 reviews on whathifi.com, one is at the midpoint of the five-star scale, and the other 29 are all at 4 or 5 stars, with nothing below the midpoint. What Hi-Fi? would not be receiving so many free products to review if it were not so business-friendly, nor would it be selling so many ads in its magazine. It, and many operators like it, drive business by providing something in the grey area between commerce-focused publicity (promotion), and unbiased consumer-focused criticism. ↩
Richard E. Petty, Curtis P. Haugtvedt and Stephen M. Smith, “Elaboration as a determinant of attitude strength: Creating attitudes that are persistent, resistant, and predictive of behavior”, a chapter of Attitude strength: Antecedents and consequences (1995). ↩
Example: Official documentation of the Net Promoter Score (NPS) describes an 11-point scale from 0 to 10. Anyone who rates their customer experience at the midpoint of this scale is characterized as a “detractor” of the business (netpromoter.com, 2021-11-21). This offshoot of Tayorism is commonly implemented in such a way that workers are arbitrarily punished (reprimanded, demoted, paid less or fired) for the existence of neutral customers. ↩