Fixed IMDb rankings

The Internet Movie Database is a massive movie info repository that among other things uses a public voting system to rate movies. Unfortunately, there are a number of disturbing anomolies that have been apparent in their system.

  1. While originally claiming to use a Bayesian estimation formula, their computations do not appear to fit the raw data and parameters that they claim.
  2. For some unexplicable reason, they have chosen to exclude documentaries from their top 250 list.

Their explanation for the mathematical mismatch is that they only accept votes from regular voters. However the raw data is not given for these set of votes, and IMHO, their rankings don't seem to be of particularly higher quality from this filtering. Bowling for Columbine (a very controvertial and highly political, but nevertheless extremely popular movie), briefly ranked amongst the top 50 movies on IMDb, then suddenly disappeared followed by an update to their FAQs indicating that documentaries were to be excluded from the top 250 list (where prior to this, they had not -- in particular I remember seeing Hoop Dreams amongst the top 250 at one point.)

In the list below I have rectified these two major problems. Documentaries from their top 50 documentary list have been added, those that had too few votes eliminated, and all scores were recalculated from scratch using the exact formula as originally described by IMDb (but taking everyone's votes into account.) The results are startling to say the least.

  1. Serenity (an excellent movie) is #14 here. (IMDb does not even rank it)
  2. Amelie ranked an amazing 12th! (IMDb ranks it as #27)
  3. The Godfather, which IMDb ranks as #1 actually only ranks at #17 and well below The Godfather: Part II.
  4. The Empire Strikes Back ranks higher than Star Wars (actually quite a common sentiment, that afficienados as well as myself agree with) whereas IMDb reverses this.
  5. The Professional (one of my favorite films) ranks #8 which is much higher than IMDb's rank of #43.
  6. Princess Mononoke (universally hailed as a brilliant film) ranks at #42, while IMDb ranks it at #103.
  7. The brilliant Donnie Darko earns its rightful place at #34, but only shows up on IMDb at #96.
  8. Shindler's List (a very good, but exaggeratedly so) ranks #36 while IMDb ranks it unbelievably at #6. (IMHO, the Pianist is a far superior film.)
  9. American History X, a truly excellent film also seems to get a much higher deserving rank than on the IMDb list.
  10. Ghost in the Shell is an excellent film deserving of a place in the top 250; its not ranked by IMDb.
  11. The City of the Lost Children is ranked at #206 here. Not ranked by IMDb at all.
  12. Show me Love is ranked at #135 here. Not ranked by IMDb at all.
In short, the following list makes a lot more sense than IMDb's list to me all around. Unfortunately, this list suffers from a rather important weakness: I cannot access IMDb's whole database. This affects movies that would otherwise rank among the top 250 but which my recalculation doesn't see because it didn't make it to their top 250 (which to me seems clearly dubious.) As such I have added in a few entries by hand, but I only found Hard Boiled (Lashou shentan) Grosse Pointe Blank, The Hours, When Harry Met Sally, Rocky, the 25th Hour, Drunken Master (Jui kuen II), Sense and Sensibility and Kiki's Delivery Service (Majo no takkyubin) which seem to have risen to the top 250 even though they are ignored in IMDb's top 250.

I should point out that I am not a movie industry shill, nor do I have any hidden adgenda. The IMDb top 250 just did not seem very accurate to me, so I wanted to fix it.

Update: From the September 30, 2003 refresh, it seems likely from the numeric values, that probably there should be a completely new set of movies beyond the rank of 240. I am now using an entry retention algorithm so that as I update the list, movies which fall out of IMDb's top 250 list are not forgotten. Hopefully over time, this along with manual entries, will give more accurate entries for the lower end of this list.

So as it stands I would say that the movies ranked up to about the top 235 or so are definately more accurate with this list versus IMDb's in all respects, while those listed after that point are more accurately ordered here, but likely closer to their actual rank on IMDb (in theory -- since many of these are not even ranked by IMDb).

Update: IMDb changed their server configuration to limit the maximum number of accesses per second that I or anyone else can access their servers at. Of course, its not my intention at all to cause any kind of denial of service for them -- the tool I use to pull down their webpages observes "530 messages" which the server is supposed to issue when its overloaded. The problem is that when IMDb gets tired of sending me 530 messages, they start just giving me a substitute HTML page (which is worse for them and me) with a message indicating that I'm hitting their site too often. There are much better ways of dealing with this -- my tool (and any proper tool that understands 530s) tries to do the right thing and decreases the retry frequency in an exponential way. So they really could just keep sending me 530s (which should be very low overhead for their server), and it would work out for both sides.

This is the primary reason that there was such a lag in the updating of this page. In absence of a fix on their side, I implemented a work around to slow down the downloads from my side (to once every 10 seconds as they suggest) which seems to have worked just fine.

Update: Just for those geeks out there that are interested: I have significantly revamped the tool I use to pull down IMDb. The primary fix is to redownload any URL grabs that cannot be parsed (sometimes IMDb returns with completely corrupted HTML.) This fixes an issue where movies would sometimes "disappear" off the chart. I also now retain the URL <-> title mapping, so that I can bypass the mainpages of each movie and just fetch the ratings pages directly (this just makes it faster for me, which means I will be less disinclined to update this page). Finally, the tool now operates in an incremental state -- so if a run fails (possibly because I am too impatient and stop it because I need to use my computer for something else) I can rerun it and it will pick up where it left off.

This site has been noticed!

The corrected list.

Last updated: 04-11-2013