The forces of ratings inflation
Why critics and aggregators skew high
In this article, I speculate about why critics and audiences generally rate quality highly, whether it’s the quality of movies, consumer goods, businesses or human beings. A “high” rating here is any rating above the midpoint of any scale.
I write reviews as a hobby. In a 2021 statistical analysis, I found that average movies are rated highly on IMDb, and to a lesser extent on Rotten Tomatoes and Metacritic, three sites where I am not a contributor.
51% of all ratings on Yelp are 5 out of 5 stars.1 The tendency to go high is nearly universal on consumer-facing sites where things are rated for quality. For example, without having analyzed comprehensive statistics, it looks like BGG (BoardGameGeek) skews high like IMDb. Both sites collect ratings from an open pool of users, with little verification.
Off the top of my head and in no particular order, I think ratings skew high for the following reasons.
When a user hovers over a potential rating for a board game on BGG, they see the following guidelines for interpretation of the site’s 10-point scale:
- Awful - defies game description.
- Very bad - won’t play ever again.
- Bad - likely won’t play this again.
- Not so good - but could play again.
- Mediocre - take it or leave it.
- Ok - will play if in the mood.
- Good - usually willing to play.
- Very good - enjoy playing and would suggest it.
- Excellent - very much enjoy playing.
- Outstanding - will always enjoy playing.
These are typical. IMDb has no official guidelines, but IMDb users describing their own personal interpretations of the scale tend to do it in similar terms.2
Notice the two adjectives around the midpoint of the BGG scale. The word “mediocre” denotes “average”, but it has a negative connotation. The next keyword, “OK”, probably originated as a humorously bad abbreviation of “all correct”, originally denoting something that is good enough in every meaningful way, but “OK” has pejorated. In modern English, it suggests reluctance, and on the BGG scale, “OK” is explicitly worse than good. “Good” appears past the midpoint of the scale, and “enjoy” even further along, at 8/10. There is no “adequate”, “satisfactory”, “neutral” or literal “average”.
If you rate a game at 5 according to these guidelines, you do not thereby indicate that it is typical or normal among the games you play, or have ever played. You instead indicate that it is significantly worse than good and that you did not enjoy it. Disparaging the middle part of the scale in this way shows how a BGG user is expected to evaluate their experiences. Going by the guidelines, the user’s average experience should not be the standard of measurement. There is no place on the scale for a neutral average.
Such guidelines, viewed in isolation, suggest one of two things about their author. Either the author expects ratings to skew high, as in fact they do, or the author believes that enjoyable experiences are rare, which is dubious. The latter possibility, in turn, splits two ways: Either the author is irrationally pessimistic, or they have espoused some philosophy that distinguishes between two different ranges of quality, such as entertainment distinct from a “true” art that brings a “true” joy. No such philosophy is apparent in the way people actually rate games or movies.
Not all sites where ratings skew high have official guidelines. Though interesting, guidelines don’t seem to influence ratings. I expect a small effect size from them, even in contexts like BGG’s user interface, where the guidelines are readily available.
Metacritic’s stated goal is to “help consumers make an informed decision about how to spend their time and money”, a goal that is not served by skewing in either direction, nor by rating on a curve. Nonetheless, Metacritic processes ratings by the undocumented “stature” of critics to produce a weighted “Metascore”. They also make arbitrary editorial choices in the interpretation of individual critics’ scales, which reduces diversity of opinion compared to Rotten Tomatoes. Finally, Metacritic moves the window of “general meaning” based on the medium of a work.3
It is not a stated editorial policy at Metacritic to skew high. However, there is money at stake,4 and movie ratings on Metacritic are higher than on Rotten Tomatoes. It is not clear whether conscious ratings inflation on either site would be sustainable as a business practice, but as long as neither editorial policy nor raw data are public, Metacritic could get away with systematic inflation if its staff or owners wanted to. Lots of reviewers do get away with this.5
Given that Amazon owns IMDb, Prime Video, and the Amazon Original brand of streaming video, there is a monetary incentive toward corruption in the case of IMDb. Hypothetically, the aggregator would pick its secret policy of reinterpretation in whatever way maximizes company profit while minimizing the risk of getting caught without a plausible excuse. This is, of course, easier on Amazon Marketplace, where ratings (“Amazon Seller Rating”) are extensively weighted by policy, and skew high. Amazon itself is rated highly by Amazon, on Amazon.
Sites like IMDb and BGG, where reviewers are poorly authenticated, use secret algorithms in part to counteract campaigns to skew ratings. One example is Ghostbusters (2016), which is marked by IMDb with the following note:
Our rating mechanism has detected unusual voting activity on this title. To preserve the reliability of our rating system, an alternate weighting calculation has been applied.
The vote breakdown page for that specific movie shows a conspicuous plurality of bottom ratings and an unusually large proportion of top ratings, on either side of what is otherwise a standard distribution centered on 6. There is no doubt that IMDb is correct in identifying that voting activity as unusual. It is dishonest; a kind of disinformation.
Trying to compensate for dishonest voting on a statistical level is a cheap way to maintain integrity. It means that IMDb does not have to take expensive or commercially risky actions such as singling out trolls, shills and other abusers. A minor downside is that the secret algorithm can skew ratings. The arithmetic mean of all votes on the Ghostbusters remake is somewhere below 6, whereas the official IMDb rating as of 2021-11 is 6.5, higher than the mean of all votes and higher than the mean of the subset of votes that follow the standard distribution. In this specific case, the secret algorithm to fight abuse skews high.
I have presented the possibility that editorial policy could skew ratings. On its face, this possibility is neutral. If, hypothetically, IMDb really worked to drive business to Amazon Original content through ratings—which I do not believe is the case—they could do it by deflating other providers, just as well as by inflating their own. However, there are psychological factors that would drive a cheater to skew high.
In a 1996 speech, Alan Greenspan coined the term “irrational exuberance”, referring to a stock-market bubble at that time. The phrase is now one of many names for an underlying psychological tendency in investors. Briefly put, humans are optimistic. We are vulnerable to wishful thinking, especially in a social context. It feels good to be part of a group that likes the same things and is moving onward and upward. Unlike stocks, ratings are free, so ratings skew high.
At a level so basic that it’s trivial, we just want to have fun. Beyond trying to find good movies on IMDb, we like to find reviews that are funny, “life-affirming” and exuberant. The more things that seem to be well liked, the greater the number of successful “in-groups” you will belong to, and the more certain seems the promise of a good life. People who don’t tend to think this way have mostly exited the gene pool.
On the stock market, irrational exuberance can be self-sustaining. A stock can come to resemble a currency, the value of which is a function of collective faith rather than the underlying material (“real”) economy. There is a parallel in movie reviews. If a strong majority seems to love The Shawshank Redemption (1994), it takes effort to disregard the balance of public opinion.
Exuberance is conformistic. Exuberance is also counterproductive. Stock markets crash because of exuberance. Consumers are ill served by generous, optimistic critics who tend to rate high, because that makes it hard to compare two of their recommendations. Like any symbol, a rating can only be meaningful in relation to other symbols. If 100% of businesses on Yelp had 5 out of 5 stars, this would not make all businesses good. Instead it would put Yelp out of business.
Political campaigns can “go negative” on opponents, but corporations promoting their goods rarely do this. Marketers are happy to undermine the self-confidence of consumers to manufacture a sense of need, but it is more efficient for them to be positive about what they’re selling than to be negative about each of their competitors. IMDb advertises Prime Video because they are both owned by the same megacorp, but IMDb does not need to disparage other providers, which would risk lawsuits and bad PR. The same drivers work in an informal social context. To skew high looks inoffensive.
Most cultures value people who are exuberant and passionate, associating such people with kindness and vivacity. To be generally enthusiastic in this particular way is a signal of genetic fitness and tribal utility. To skew high in one’s individual rating on IMDb is a cheap way to send that signal. If you skew low, you send the opposite signal and you will be perceived as unkind and gloomy: A pedant, “hater”, “bummer” or “negative ninny” who is not on the team.
The effects of projecting an inoffensive exuberance are exaggerated where there is a personal connection. You are more likely to say you liked a book if you got it from a friend, or that friend wrote it, or even if you just know your friends like it. A cowardly form of politeness requires that you voice only compliments and conceal criticism, which can transfer to higher ratings.
On the Internet, a personal connection does not require familiarity. On marketplaces like eBay, where individual buyers and sellers rate each other directly, skewing high is a strong norm. “Likes” on peer-oriented Youtube are far more common than “Dislikes”, and since 2021, the number of dislikes is hidden to protect advertisers’ feelings. There is only room for positive, market-friendly vibes.
Even blockbuster movies are made by individual human beings hoping for a warm reception, but to give a false impression for the sake of creators’ feelings is instead mean to the readers of one’s review. The tendency to project exuberance through kindness is therefore somewhat mitigated among professional critics, who can theoretically get the same amount of page views and professional respect for positive as for negative reviews. Working against irrational exuberance and keeping all ratings meaningful is part of the job description.
Games have gotten better. The baseline of the industry has shifted, but old ratings have not. As a result, one analyst has speculated that BGG is “running out of room” on its scale6. IMDb similarly displays near-constant growth in its ratings going back to the invention of cinema. That growth may come from an escalating exuberance, aided by ratings being determined nearer the height of release hype since the launch of IMDb, but it can also come from real improvements, including better technology and technique in production and presentation, as in board games and video games.
Theoretically, if IMDb scrapped all of its ratings and relaunched today, users would still rate recent works higher than old ones, but the median rating might sink by a couple of decimal points because viewers nowadays have higher expectations upon the average work than they did in 1993, when IMDb moved to the web. Actively changing the scale, such as by adding more stars to IMDb, is no more attractive.
The exhaustion of quality with experience
According to an earlier study I made of another ratings site, Filmtipset, more experienced critics rate movies lower. Ratings on Filmtipset skew high for those users who’ve rated less than 1000 movies, and then start skewing low. They go lower than the midpoint, which is lower than the median rating on IMDb in any era, and lower even than Rotten Tomatoes or Metacritic.
This could be because more experienced viewers gradually run out of good movies. This is probably true among the extreme users, but there must be a counterbalancing effect where only those with an exuberant love of the medium—who therefore tend toward higher ratings—make the effort to watch, evaluate and rate a thousand feature films. There must also be another effect whereby dopamine levels decline with familiarity, regardless of quality. In any case, the inexperience of most amateurs is a plausible reason why ratings often skew high, but only on relatively open sites where inexperience is the norm.
Unlike IMDb, Rotten Tomatoes and Metacritic show generally higher ratings for older movies, with a lot of noise. This seems to contradict the hypothesis of rising quality, but doesn’t. Nostalgia and didacticism are probably factors in it, but most of the discrepancy is instead a result of selection bias. The professional critics whose opinions are the basis of RT and Metacritic rarely publish about older works, and when they do, it is generally the most lauded classics which are remastered, rereleased, recirculated at revivals, etc. Bad older works are rarely reviewed by professionals.
Selection bias is also big outside the circle of professional critics. IMDb, which skews higher than RT or Metacritic, lists a broader range of movies and series for review, but it’s only a tiny fraction of all the moving pictures ever produced. About 500 hours of video are uploaded to Youtube each minute. Much more is filmed and never uploaded anywhere. Among countless amateur productions, home video and CCTV, there is something for everyone, but if all of it was rated on IMDb by all those who saw it, the median rating would not skew high.
The same is true in games. Poorly designed and obscure board games do not appear on BGG at all and would be rated poorly if they did.7 When I rate a game on BGG, I still cannot avoid comparing it to all of my other memories of board games, including the ones that are so bad that nobody has bothered adding them to the database. There is a horizon of notability and effort underneath selection bias that hides a lot of chaff from view. It’s consumed, but it isn’t rated. Call it the data-entry bias.
In an even larger context, consumers actively seek pleasure. They watch those movies that they expect to enjoy and they’re usually right in predicting their own pleasure. Watching random feature films would, hypothetically, fill out the bell curve on IMDb and send the median rating closer to the midpoint of the scale. However, it is an open question as to what extent users on IMDb actually compare the works they rate to inferior works they know about but do not rate. It is another open question whether they try to compensate for the former effect.
Selection bias skews ratings higher through one more effect. As shown here, on IMDb, series are rated higher than movies, and episodes of series are rated even higher than complete series. It is possible for a whole to be worse than its individual parts, but it is not likely that individual episodes are systematically better on average than the shows they constitute. It is also not likely that movies, produced at greater cost and care, are generally less enjoyable than television to audiences on IMDb, yet they are certainly rated that way.
One of the reasons why episodes skew especially high is that, like viral videos, they are shorter, less complex than movies or series. They require a smaller investment per episode and their sheer simplicity provides cognitive ease, which is pleasurable. This is especially true for television up through the mid-1990s, which was usually episodic, but it is less true for modern television, which requires investment to keep up with serialized narratives.
I believe the larger effect comes from loyalty. Those IMDb users who bother to rate individual episodes are generally fans of the specific show, whereas a broader range of users rate that show as a whole. Only the better shows attract loyal per-episode reviewers. Many shows are designed to generate that loyalty, making viewers get to know and care about the characters over the long term, something that is harder to do in movies.
Another reason why series would be rated highly is precisely that they require a larger investment of time. Once having made that investment, viewers look back and ask themselves “Did I just waste 100 hours of my life?” To answer that question in the affirmative is painful. The greater the investment, the greater the tendency to double down and assert that it was all worth it.
Conversely, viewers who do not make the complete investment may decide not to give the series any rating, on the grounds of having incomplete information. I watch at least a full season of a series before I publish a review of it, and it is the same with novels. Anything I dislike too much to finish is not reviewed at all. This skews my own ratings higher, especially for longer works.
Bias toward attitude strength
Psychologists have found that strongly expressed attitudes are more resistant to change (e.g. persuasion), more persistent over time, and more consistent with actual human behaviour.8 For example, an answer of 10 on a 10-point scale (“Agree completely”), in a survey about virtually anything, turns out to be much more practically useful information than an answer of 8.
This interpretation of strong attitudes ties into exuberance. At the broadest level, there are social contexts where any weak attitude is seen as phoney or a sign of more general personal weakness. If your weekend was “great” or you have some funny story about how it was “terrible”, you are OK. If your weekend was “OK”, you are not OK. Having neutral experiences marks you as mediocre, in the warped sense of being both normal and bad at the same time.
If this tendency were not moderated by the others, it would produce polarized ratings, with very little at the midpoint and a lot more movies at both extremes. In actuality, most individual movies on IMDb show a mostly-standard unimodal distribution of user ratings in their vote breakdowns, albeit off-centre. A conspicuous bump at the top and bottom are common, but 80% of IMDb ratings lie between 5.0 and 8.5, not at the extremes.
Dubious neutral ground
In the evaluation of an aesthetic experience, most people view good and bad as opposites: Strong attitudes on one linear continuum. Because of that mental image, we expect some neutral ground where one’s attitude is weak. Intuitively, it speaks to a sense of symmetry to place such neutral ground on the middle of the scale, regardless of what the typical experience is. There is probably a relationship between the bias toward attitude strength and this tendency to have the neutral ground in the middle, as in the BGG example above. In practice, however, the majority of people on the Internet actively avoid using the apparently neutral part of the scale. They may be making the same interpretation of it as psychologists do: A middling rating is automatically thought to express an attitude that is itself weak, instead of an attitude that is strong in identifying a product of mixed quality.
We do usually find both strengths and weaknesses in a product and we tend to pick one side or the other instead of letting them cancel out. When exposed to a “clean” input, such as Kazimir Malevich’s blank painting White on White (1918) or John Cage’s musical composition 4’33” (1952), few people seem to arrive at a neutral opinion. Instead, we end up on the noise floor of our own thoughts and make something up.
There is no perceived need for a “Neutral” button on Youtube, between “Like” and “Dislike”. Not many would make the effort to click that button. The “passionately neutral” opinion is a recurring trope of comedy because it sounds like an oxymoron, built on the premise that the strength and valence of an attitude can be disconnected. They can. It is possible to be passionately neutral, but such an opinion cannot be fully expressed in a one-dimensional model. There are more subtle factors at play.
Going to extremes
A bias toward attitude strength does not explain why ratings skew especially high in a general person-to-person context, but it may have something to do with peer-to-peer e-commerce (eBay, Tradera etc.), where there is a community norm of acceptable behaviour. Here, users rate high by default, in the absence of norm violations. Precisely because high ratings are very common in e-commerce, anything lower stands out as showing a strong attitude. This enables a low rating to function as a warning of fraud, which is useful. Such a semiotic function is barely applicable to sites like IMDb, which skew less. On IMDb, the norm is to rate high in the presence of entertainment, not yet by default.
The bias toward attitude strength can be oppressive, because its actual psychological basis is often misinterpreted and mixed up with prejudice. When you are asked to rate customer service in a survey, for instance at a retail store, there is a strong possibility that the employees who served you will be penalized if you use anything but the top of the scale,9 and penalized further if they tell you about the scale.
Because this is speculation and a couple of anecdotes, I have no definite answers to the question of why ratings get inflated. However, I would like to say that none of the reasons I can think of should convince a critic to skew high. My conclusion is the opposite.
As a reviewer, you can probably make up your own guidelines and your own scale. If you submit your work to Rotten Tomatoes you interpret it for them. Otherwise, the reinterpretation of your opinion by an aggregator is not something you particularly need to worry about as long as your readers can find out for themselves. If you’re honest, bias of every kind is something you want to avoid, both as a critic and as a reader of criticism.
A shifting baseline of quality is a harder problem for the individual critic. I review books and moving pictures, the contents of which get better very slowly. However, in my lifetime, TVs have gone from CRT to µLED technology, improving faster than media content shown on them. If you’re just starting out writing reviews, and especially if you’re doing it in such a fast-moving field, my advice is to skew low.
Example: Brendan Rettinger, “Cinemath: Do I Agree With IMDB Users? A Statistical Analysis of My IMDB Ratings” (2011-11-27), Collider. Online here. Rettings glosses over ratings 2–4 as “I did not like this movie” without further distinction. ↩
Metacritic labels a movie as having generally favourable reviews with a Metascore of 61–80, but in the case of games, the same wording (“Generally Favorable”) is reserved for a different range of 75–89, with only 5 points of overlap between the two media. Source: “How We Create the Metascore Magic”, Metacritic, read 2021-11-21. Online here. ↩
The general commercial importance of a rating is not clearly established. In one famous example where it did matter to creators, game publisher Bethesda denied development studio Obsidian a bonus for Fallout: New Vegas (2010) because the game received a rating of 84 on Metacritic, lower than the target number of 85. ↩
For example, at the time of writing, of the last 30 reviews on whathifi.com, one is at the midpoint of the five-star scale, and the other 29 are all at 4 or 5 stars, with nothing below the midpoint. What Hi-Fi? would not be receiving so many free products to review if it were not so business-friendly, nor would it be selling so many ads in its magazine. It, and many operators like it, drive business by providing something in the grey area between commerce-focused publicity (promotion), and unbiased consumer-focused criticism. ↩
Richard E. Petty, Curtis P. Haugtvedt and Stephen M. Smith, “Elaboration as a determinant of attitude strength: Creating attitudes that are persistent, resistant, and predictive of behavior”, a chapter of Attitude strength: Antecedents and consequences (1995). ↩
Example: Official documentation of the Net Promoter Score (NPS) describes an 11-point scale from 0 to 10. Anyone who rates their customer experience at the midpoint of this scale is characterized as a “detractor” of the business (netpromoter.com, 2021-11-21). This offshoot of Tayorism is commonly implemented in such a way that workers are arbitrarily punished (reprimanded, demoted, paid less or fired) for the existence of neutral customers. ↩