Are new movies longer than they were 10, 20, 50 year ago?

As stated before, we only need three columns:startYear — represents the release year of a titleruntimeMinutes — primary runtime of the title, in minutesnumVotes — number of votes the title has receivedmovies = movies[[‘startYear’, ‘runtimeMinutes’, ‘numVotes’]]In the end we need to change data type of those columns to numeric and drop rows with missing values.for column in movies.columns.values.tolist(): movies[column] = pd.to_numeric(movies[column], errors='coerce')movies = movies.dropna()print(movies.shape)>>>(197552, 3)After this step our number of movies dropped to 197.5k.Before we continue with further analysis, it is good to check descriptive statistics of our dataset to determine if everything looks all right.print(movies.describe())>>>startYear runtimeMinutes numVotes>>>count 197552.000000 197552.000000 1.975520e+05>>>mean 1988.940932 94.929492 3.643819e+03>>>std 24.758088 29.967162 3.173653e+04>>>min 1894.000000 1.000000 5.000000e+00>>>25% 1973.000000 83.000000 1.700000e+01>>>50% 1996.000000 92.000000 6.500000e+01>>>75% 2010.000000 103.000000 3.390000e+02>>>max 2019.000000 5760.000000 2.029673e+06We can notice that at least one movie is only 1 minute long, which doesn’t look right..There are probably some mistakes in the database.According to the Academy of Motion Picture Arts and Sciences, an original film needs to be 40 minutes or less to qualify as a short film, whereas a feature film is more than 40 minutes..That’s a great rule to drop movies which are too short.movies = movies[movies[‘runtimeMinutes’] > 40]What’s more important, we are only interested in popular movies..There are thousands of movies in IMDb database which have only a few dozen votes..They can skew our results..Let’s say a popular movie is the one with more than 1000 ratings..We drop all movies which don’t apply to this rule (good bye thousands of TV movies and garage productions!).movies = movies[movies[‘numVotes’] >= 1000]print(movies.describe())>>>startYear runtimeMinutes numVotes>>>count 27951.000000 27951.000000 2.795100e+04>>>mean 1995.441165 104.993167 2.494047e+04>>>std 21.236780 22.305108 8.118090e+04>>>min 1911.000000 43.000000 1.000000e+03>>>25% 1986.000000 91.000000 1.679000e+03>>>50% 2003.000000 100.000000 3.440000e+03>>>75% 2011.000000 114.000000 1.195000e+04>>>max 2018.000000 450.000000 2.029673e+06In our final dataset there are 27,951 movies..The shortest one is 43 minutes long and the longest is 450 minutes long (the price of Iron Bladder goes to anyone who can watch it without bathroom break!)..The oldest movie(s) is(are) from 1911.On average every movie in our dataset have almost 25k votes, but the standard deviation is 81k, which probably means that the distribution is skewed right and the mean is overvalued by minority of movies with huge amount of votes (there is at least one movie with over 2 million ratings!)..Median looks closer to reality, 50% of movies have 3,440 votes or less.Now we can save our data to CSV and move to a new script..This one takes a long time to execute..Python needs to download in total over 100MB data and process it few times.. More details

Leave a Reply