The most influential factor of IMDB movie rating — Part I: Data Scraping

The most influential factor of IMDB movie rating — Part I: Data ScrapingYuri DaiBlockedUnblockFollowFollowingMay 15I have always been an enthusiastic fan of movie, and I like to explore great movies through looking at different film ratings / reviews websites, such as IMDb, Rotten Tomatoes, etc.

When I look at those top movies list, I always wonder, what are the primary factors that influence a movie’s success, is it budget, box office, language, or movie genre?By an accidental chance, me and another three movie lovers (also data lovers) decided to conduct a statistical research regarding the influential factors of a movie’s success.

We determined to look at IMDb “Top 500 Greatest Movies of All Time”; we used movie as instances to collect various quantities that are related to each movie, and subsequently conducted statistical analysis regarding the dataset.

Our goal was to:1.

Use data-scraping technique to extract data from a IMDb movie list, and create a Dataset.

2.

Use descriptive statistics and multi-regression modeling to visualize and analyze the data that we have collected.

IMDb top 500 moviesThis article will be focusing on the first part of this research project, which is Dataset Creation through data scraping and cleaning.

MotivationThe ​purpose for creating this dataset is to analyze the primary factors that influence a movie’s success, measured by movie rating.

Our team created this database and obtained the information in this dataset from scraping, using Beautiful Soup, from a page created by IMDb in 2017 — Top Greatest Movie of All Times.

CompositionInstances and VariablesThe dataset is comprised only of ​movies as the type of instance.

There are ​500 instances (movies) in total.

Our dataset contains​ all possible instances ​from the webpage.

For each instance, it consists of ​certificate, duration, rating, genre, vote, gross, country, language and budget​.

Our ​categorical variables for each instance consist of information on certificate, genre, country, and language collected as ​unprocessed text directly scraped from the webpage.

​Numerical variables for each instance consist of information on duration, rating, vote, gross, and budget collected as ​unprocessed text and values ​also scraped directly from the webpage.

There is no label or target associated with each instance.

Among all of our 500 instances, certain data points of variables were ​missing ​such as some movies’ gross and budget.

For the instances that we could find information on elsewhere, we added the information into the csv file (our dataset).

However, for the instances in which gross or budget information weren’t available, we removed the instances.

Our original dataset also included Metascore for each instance but we decided to delete it at the end because a lot of movies’ megascores were missing.

The reason why these data points are missing is because they were not available on original IMDB website because some movies are really old and therefore hard to collect data from sixty years ago.

We collected data initially with no explicit links between individual instances, because we want to analyze this dataset objectively.

NoisesThere are some ​sources of noise in the dataset.

For example, some foreign movies have gross and budget in foreign currencies and we manually converted these foreign currencies to USD in current exchange rate.

It might be inaccurate since exchange rates are always fluctuating and it might not accurately reflect the exact gross and budget for foreign movies.

There are also some old movies with gross and budget collected long time ago.

We didn’t adjust gross and budget based on inflation and depreciation.

Contained Information/Data from IMDbThe dataset is ​partly self-contained​.

For variables such as certificate, duration, rating, genre, vote, and gross, the unprocessed text and values are directly scraped from the webpage, and therefore self-contained.

The other three variables — country, language, and budget — were not self-contained and scraped from going into the webpages of each movie instances on the IMDb website.

For example, to find information on country, language, and budget of the first movie Citizen Kane, w​e went to the webpage of ​Citizen Kane to scrape unprocessed text and values.

The dataset does not contain information that is confidential or offensive.

The data is partially related to people as ratings are crowdsourced opinion.

Ratings can reflect how the public views the qualities of movies.

However, there are certain groups of people who are more likely to rate.

For example, people who really like or dislike this particular movie or casting are more likely to rate to express their strong opinions.

People who are movie lovers are also more likely to rate but in a more objective way.

Collection ProcessAll of our variables are directly observable as raw text and values on the various pages on IMDb.

the packages that our team has utilized for data collectionWe utilized the Python library BeautifulSoup to scrape data from the IMDb webpages.

As shown in the Python file provided, for each variable, we identified patterns of text or values presented as html.

We then obtained information for each variable by identifying tags and class of each variable on the website.

For self-contained information:Movie Name: We identified that Movie Name is under the tag “h3” and class “lister-item-header”.

We simply used getText() to obtain text data and used regular expressions to get rid of extra spaces.

obtain Movie NamesCertificate: We identified that information on the certificate are under the tag “p” and class “text-muted text-small.

” We found that the information we need for each movie instance is every third tag, so we append the getText() value from every third tag into our certificate array.

Obtain CertificateDuration: We identified that the information we need on duration is under the tag “span” and class “runtime.

” Similar to certificate, lots of text information are returned for each instance.

The duration we need for this dataset is always the first text returned under this tag and class, so we only appended information from index 0.

Obtain DurationRating: The information we need is in a section of information under the tag “div” and class “ipl-rating-star small.

” In that section, the rating value is under tag“span” and class “ipl-rating-star__rating.

” We used 2 for loops to obtain rating values.

Obtain RatingGenre: Obtaining information on genre was more straightforward.

All genre information are under the tag “span” and class “genre.

” We simply used get_text() to acquire the text we needed.

Obtain GenreVote: Vote follows similar patterns as Certificate.

We first found a section of information containing vote data under tag “p” and class”text-muted text-small.

” This section includes three repeating types of information, and the information we gathered is in every third chunk.

We looped through every third chunk, and looked for information under the tag “span,” which is the text information we want.

Gross: Gross information are also taken from the same information under “span” used for extracting Vote data.

Gross data is always the 5th data value presented under that chunk of information, so we simply extracted data using get(“data-value”) on index 4 of the chunk of data.

Obtain Vote and GrossData from External Links:For the following variables, we had to loop through the webpages of each movie instance.

We found the pattern of these webpages consists of the heading “https://www.

imdb.

com/title/tt” + numbers that represent movie ID.

We firstly used Beautiful Soup find_all function to find all the links in the main homepage.

Later on, we found the certain pattern of how IMDB identifies each movie with a series of numbers.

Then we use re.

compile() function with regular expression pattern ‘/title/(ttd{5,7})/’ to find any links that are matched with this pattern.

Then we filtered out those useful links and appended to a new list.

After printing out the new list, we found each movie title actually repeated twice.

Then we use a for loop function to extract every link in odd position.

Find Movie IDsUsing for loops In each instance, we first extracted a text box containing the information we need under the tag “div” and class “txt-block.

” In the text box, to prevent mismatching information with each instance, we used the if-elif structure to append 0 for movies with no Budget information and append “No Specific” for movies with no language information.

We recognized that each information type follows a heading that states the information type.

For example, budget data following the heading “Budget:” so we used these headings to identify movies with no such information by checking to see if we could find the headings in each instance.

For the majority of the instances with all available information, we used try-except and if functions to scrape the following information:Country: Country information are under the tag “h4” so we used getText() to extract text data after the heading “Country:”.

Language: Similar to country, Language information are also under the tag “h4” so we also used getText() to extract text data after the heading “Language.

”Budget: Budget is similar but different in that budget information is also under the tag “h4” but we couldn’t use getText(), because budget information wasn’t in the form of a text.

Budget information was stored as a data value, so we used the attribute “.

next_sibling” to extracting the data value after the heading “Budget:”.

Extract data from all 500 moviesNote: The dataset is not a sample from large dataset.

Our team ​was involved in the data collection without any compensation.

There is no ethical review processes conducted.

The dataset is only related to people when people posted their ratings to IMDB.

Preprocessing/ Cleaning / LabelingTo create the dataset in csv, we created blank arrays for every variable and appended extracted data into these arrays for cleaning and creating the final csv file.

Here are some methods we used to clean data:Regular Expression: For Movie Names, we used “n.

+n(.

+)n.

+n” to get rid of blank spaces before, after, and in between letters.

Rstrip: We used the attribute rstrip to get rid of white space at the end of a string for Genre and Budget.

Currency conversion: For some instances the Gross value were in other currencies.

We manually converted those currencies to USD.

Convert Strings to Float or Integer: For Duration, we firstly used get_text() to find the string after a tag.

Then, we convert duration’s type string to float.

For Year, after extracting years from the website by get_text() function, we converted year type from string to int.

For Vote, after extracting votes from the website by get_text() and replacing a comma with space (because the text was in currency mode which includes a dollar signs and commas), we convert the vote type from string to int.

Modifying units for scaling: We divided data for gross by one million to better scale with rating.

Uses of DatasetOur dataset can be accessed through this LINK.

The dataset has been used for some similar tasks analyzing movies.

There are many data analysis about movies but some of them are outdated and not comprehensive.

For example, many data analysis focusing on movies in a specific year or analysis specifically towards genre and movie types.

Our dataset can also be used to analyze how people’s taste of movies change over the past few years or predict expected gross and budget by inserting other facts of that movie in the future.

Our composition of the dataset might not impact future users and there aren’t any tasks that the dataset should be used.

ConclusionIn this first part of our research, we used techniques such as data scraping and regular expression to create the database.

In Part II, we will be analyzing the dataset that we have obtained from the movie list and conduct further statistical modeling and research.

_This is a group project conducted by a team of four students from Cornell University: Yuri Dai, Serina Lee, Kexin Lou, Shangzhen Wu.

_The dataset is created for research purposes and there is no associated grant or funding.

.. More details

Leave a Reply