2. Programme architecture¶
2.1. Define constants¶
The objective of constants is to update quickly behaviour of the software. For example you can display on result new kind of movie properties only by adding column name to the list CONST_KIND_PROP_LIST_COL.
import os
import pandas as pd
import numpy as np
import matplotlib.pyplot as plt
#-----------------------------------------------
CONST_PROJECT_NAME="Project 1: TMDb movie data"
#constant used to transform year with 2 digits to 4 digits.
CONST_MAX_YEAR=2040
CONST_TEXT_MAX_LEN=25
CONST_TO_MILLION=1000000
# database column name
CONST_DATE_COL='release_date'
CONST_GENRES_COL='genres'
CONST_POPULARITY_COL='popularity'
CONST_REVENUE_COL="revenue_adj"
CONST_BUDGET_COL="budget_adj"
CONST_DIRECTOR_COL="director"
CONST_ORIGINAL_TITLE="original_title"
CONST_RELEASE_YEAR_COL='release_year'
# database column name to read for answer objectives
CONST_COL_TO_READ=[CONST_GENRES_COL,CONST_POPULARITY_COL,CONST_DATE_COL,CONST_REVENUE_COL,CONST_DIRECTOR_COL,CONST_ORIGINAL_TITLE,CONST_RELEASE_YEAR_COL,CONST_BUDGET_COL]
# computed additional column
CONST_GENRES_LIST_COL="genres_list"
CONST_DIRECTOR_LIST_COL="director_list"
CONST_SHORT_TITLE="short_title"
CONST_OCCURENCE_COL="occurence"
# List of kind properties
CONST_KIND_PROP_LIST_COL=[CONST_REVENUE_COL,CONST_GENRES_LIST_COL,CONST_DIRECTOR_LIST_COL,CONST_POPULARITY_COL,CONST_ORIGINAL_TITLE,CONST_SHORT_TITLE,CONST_RELEASE_YEAR_COL,CONST_BUDGET_COL]
# constant used to set the number of movies to display
CONST_NB_MOVIES_FOR_REVENU=10
# path to the database
CONST_DATABASE_FILEPATH=os.path.join("DATABASE","tmdb-movies.csv")
#-----------------------------------------------
2.2. Load database¶
To load database, I create dedicated function readDatabase with file path and number of rows as argument. I use the pandas function read_csv by using the folowing arguments: * encoding: to specified the encoding format: UTF-8 * parse_dates: to define the dates column * date_parser: to specify my parser to transform CSV database date format to standard format * usecols: to specify the column to load. (decrease loading time) * nrows: to specify the number of rows to load. (Used for debug)
The date format inside the database is not standard date format because the number of digit for year is only 2 ( bug of 2000 year :) ).
I have to transform 2 digits years to 4 digits year. The databae give information about movie already produced so it is impossible that the database contains movies with years upper than CONST_MAX_YEAR ( for example: 2040). So if year is higher than CONST_MAX_YEAR, I supposed that is one hundred years before.
#-----------------------------------------------
def getDate(date_texte):
"""
Tranform date format from MM/DD/YY to datetime module format
If Year is higher than CONST_MAX_YEAR, I supposed that is one hundred years before.
"""
import datetime
from dateutil.relativedelta import relativedelta
dt= datetime.datetime.strptime(date_texte, '%m/%d/%y')
if dt.year >= CONST_MAX_YEAR :
dt = dt - relativedelta(years=100)
return dt
#-----------------------------------------------
def readDatabase(filePath,nrows=None):
"""
read CSV file provided by filePath argument. the nrows allows to reduce the number of line read.
"""
return pd.read_csv(filePath,encoding = "UTF-8",parse_dates=[CONST_DATE_COL],date_parser=getDate, usecols=CONST_COL_TO_READ,nrows=nrows)
2.3. Cleaning and adapt database¶
The function adaptDatabase receive pandas dataframe and return adapted pandas dataframe.
To make easier the database analysis, I create 2 new movie properties: * release_year(CONST_YEAR_COL): contains only the movie release year. The objective is to simplified the year groupement * genres_list (CONST_GENRES_LIST_COL): contains movie genres with python list format instead text format with | separator.
#-----------------------------------------------
def splitValue(text):
"""
return list by spliting text argument based on | character.
"""
if type(text) != type(""):
return list()
else:
return text.split("|")
#-----------------------------------------------
def reduceLen(text):
"""
return string by reducing lengh and add "..." if the number of char is higher than CONST_TEXT_MAX_LEN
"""
if len(text) > CONST_TEXT_MAX_LEN:
return text[:CONST_TEXT_MAX_LEN-3]+"..."
else:
return text
#-----------------------------------------------
def adaptDatabase(database):
"""
database transformation
"""
database[CONST_GENRES_LIST_COL]=database[CONST_GENRES_COL].apply(splitValue) # create new database property for genre: convert string to list of genre
database[CONST_DIRECTOR_LIST_COL]=database[CONST_DIRECTOR_COL].apply(splitValue) # create new database property for director: convert string to list of director
database[CONST_SHORT_TITLE]=database[CONST_ORIGINAL_TITLE].apply(reduceLen) # create new database property for movie title: short title used to display it on graph
return database
2.4. Answer questions¶
2.4.1. Which genres are most popular from year to year?¶
The function searchPopularGenreByYear receive pandas dataframe and return specific dataframe .
To search the most popular genre by year, I create a new data frame by splitting genre list. For that, I split each movie by unique genre. Next, I group the data frame by year and genre by adding the popularity value.
To search the maximum popuylarity value by year, I group the data frame by year with the maximum value of popularity ( combination of idmax and loc function). The resulting data frame is returned.
The function displayPopularGenreByYear receive the data frame and display it.
#-----------------------------------------------
def createGenreDatabase(database):
"""
return new database by spliting genre.
The database parameters is: CONST_RELEASE_YEAR_COL, CONST_GENRES_COL, CONST_POPULARITY_COL, CONST_OCCURENCE_COL
"""
database= database.groupby(CONST_RELEASE_YEAR_COL)
done=0
nb_years=len(database.groups.keys())
genres_database=pd.DataFrame()
# for each movie create a data frame with by duplicate information by number of genre and merge it
for year,movies in database:
for _, movie in movies.iterrows():
tmp=pd.DataFrame({CONST_RELEASE_YEAR_COL: [str(year)]*len(movie[CONST_GENRES_LIST_COL]), CONST_GENRES_COL: movie[CONST_GENRES_LIST_COL], CONST_POPULARITY_COL: [float(movie[CONST_POPULARITY_COL])]*len(movie[CONST_GENRES_LIST_COL]), CONST_OCCURENCE_COL: [float(1.0)]*len(movie[CONST_GENRES_LIST_COL])})
genres_database=genres_database.append(tmp,ignore_index = True)
done+=1
# print("split genres data => done for ",year,percent(done,nb_years))
return genres_database
#-----------------------------------------------
def searchPopularGenreByYear(database):
"""
return new dataframe containing foreach year the genre and popularity who have the maximun popularity
"""
genres_popularity=database.groupby([CONST_RELEASE_YEAR_COL, CONST_GENRES_COL])[CONST_POPULARITY_COL].sum() # group database by year and genre and realise sum of popularity
genres_popularity=genres_popularity.loc[genres_popularity.groupby([CONST_RELEASE_YEAR_COL]).idxmax(0)] # identify the index of maximum popularity ( index 0) by year and use it to create database containing all information
return genres_popularity
#-----------------------------------------------
def displayPopularGenreByYear(data):
"""
Create a bar chart to display the maximum popularity and genre by year.
"""
print("Most popular movie genre by year:")
print(data)
fig, axs = plt.subplots(1,1)
fig.canvas.set_window_title(CONST_PROJECT_NAME)
data.unstack().plot.bar(stacked=True,title='Popular genre by year',ax=axs)
axs.set_ylabel('Popularity')
axs.set_xlabel('Years')
plt.tight_layout()
2.4.2. What is the genres division by years ?¶
The function searchGenreDivisionbyYear receive pandas dataframe and return specific dataframe containing for each year and genre the occurence.
To search the genre occurence by year, I group the dataframe by year and genre, and, I realise the sum of occurence.
To search the maximum popuylarity value by year, I group the data frame by year with the maximum value of popularity ( combination of idmax and loc function). The resulting data frame is returned.
The function displayGenreDivisionbyYear receive the data frame and display it.
#-----------------------------------------------
def displayPopularGenreByYear(data):
"""
Create a bar chart to display the maximum popularity and genre by year.
"""
print("Most popular movie genre by year:")
print(data)
fig, axs = plt.subplots(1,1)
fig.canvas.set_window_title(CONST_PROJECT_NAME)
data.unstack().plot.bar(stacked=True,title='Popular genre by year',ax=axs)
axs.set_ylabel('Popularity')
axs.set_xlabel('Years')
plt.tight_layout()
#-----------------------------------------------
def searchGenreDivisionbyYear(database):
"""
return new dataframe containing foreach year the genre occurence.
"""
genres_division=database.groupby([CONST_RELEASE_YEAR_COL,CONST_GENRES_COL])[CONST_OCCURENCE_COL].sum() # group database by year and genre and realise sum of occurence
return genres_division
#-----------------------------------------------
2.4.3. What kinds of properties are associated with movies that have high revenues?¶
The function searchPropertiesForHighRevenue receive pandas dataframe and the number of movies to record. It returns data frame.
To search kind properties for high revenue movies, I sort data frame by descending revenue. So, to get “nb_movie” high revenu, I used the data frame “head” function.
To be more powerfull, the function can receive a negative value for number of movies to get the properties of lower revenue.
The function displayPropertiesByRevenue receive the data frame and display it. The sub function addInformation allow to customize the graph to add information like: popularity, release year, director and budget. The goal is to improve analisys.
Note
This customization is dependant to the graph size and dpi. An evolution should be to adapt this function to remove this dependencies.
#-----------------------------------------------
def searchPropertiesForRevenue(database,nb_movie):
"""
return extract of data frame with nb_movie value.
If nb_movie is positive, it is the nb_movie higher revenue.
If nb_movie is negative, it is the nb_movie lower revenue.
"""
result = database.sort_values(by=[CONST_REVENUE_COL],ascending=False)
if nb_movie > 0:
return result[CONST_KIND_PROP_LIST_COL].head(nb_movie)
else:
return result[CONST_KIND_PROP_LIST_COL].tail(-1*nb_movie)
#-----------------------------------------------
def displayPropertiesByRevenue(movies,title):
"""
display on stdout CONST_KIND_PROP_LIST_COL properties for each movie
display bar chart of movies to compare movies based on revenue
"""
print(str("{}").format(title))
for _, movie in movies.iterrows():
text=list()
for col in CONST_KIND_PROP_LIST_COL:
text.append(str("{}:{}").format(col,movie[col]))
print(str(" ").join(text))
#-----------------------------------------------
def addInformation(rects,ax,pop,year,director,budget):
"""
internal function allowing to improve bar chart with more information like
popularity division, revenu value, budget value, released year and director
"""
fake_bar = [plt.bar([0], [0], color=(0,0,1,0.8)),
plt.bar([0], [0], color=(0,0,1,0.05))]
ax.legend(fake_bar, ['High popularity', 'Low popularity'])
tot_pop=sum(pop)
index=0
for rect in rects:
height = rect.get_height()
width = rect.get_width()
x = rect.get_x()
# manage color transparency of bar to inform user about the popularity.
rect.set_color((0.0, 0.0, 1.0, pop[index]/tot_pop))
# Add revenue et budget information to make easier the comparision
ax.annotate(str("{0:.0f}M\$ / {1:.0f}M\$").format(height,budget[index]),
xy=(x + width / 2, height),
xytext=(0, 5),
textcoords="offset points",
ha='center', va='bottom')
# Add Year information
if height > 0: # Add this contraint for display only year if threre are a bar chart.
ax.annotate(str("{0}").format(year[index]),
xy=(x + width / 2, height),
xytext=(0, -20),
textcoords="offset points",
ha='center', va='bottom')
# Add director information
ax.annotate(str("{0}").format(str(",").join(director[index])[:CONST_TEXT_MAX_LEN]),
xy=(x + width / 2, 20),
xytext=(0, 5),
textcoords="offset points",
ha='center', va='bottom',rotation=90)
index+=1
#-----------------------------------------------
fig, axs = plt.subplots(1,1)
fig.canvas.set_window_title(CONST_PROJECT_NAME)
mov_plot=axs.bar(movies[CONST_SHORT_TITLE], movies[CONST_REVENUE_COL]/CONST_TO_MILLION)
addInformation(mov_plot,axs,movies[CONST_POPULARITY_COL].tolist(),movies[CONST_RELEASE_YEAR_COL].tolist(),movies[CONST_DIRECTOR_LIST_COL].tolist(),(movies[CONST_BUDGET_COL]/CONST_TO_MILLION).tolist())
axs.set_ylabel('Revenue [M$]')
axs.set_xlabel('Movies')
axs.set_title(str('{} (Revenue / Budget)').format(title))
axs.set_ylim(ymin=0)
plt.setp(axs.get_xticklabels(), rotation = 20)
#-----------------------------------------------
2.5. Main program¶
The main program call step by step each function. The part with “__main__” allows to import this python file inside other programme to used function.
Note
A way to improve time comsumption is to create two thread to answer questions after loading database.
#-----------------------------------------------
def main():
print("read CSV")
DataBase=readDatabase(CONST_DATABASE_FILEPATH,nrows=None)
print("adapt database")
DataBase=adaptDatabase(DataBase)
GenreDatabase=createGenreDatabase(DataBase)
print("Search Popular Genre by year:")
popGenreByYear=searchPopularGenreByYear(GenreDatabase)
displayPopularGenreByYear(popGenreByYear)
print()
print("Search Division Genre by year:")
DivGenreByYear=searchGenreDivisionbyYear(GenreDatabase)
displayGenreDivisionbyYear(DivGenreByYear)
print()
print("Search kind properties for high revenue movies:")
popForRev=searchPropertiesForRevenue(DataBase,CONST_NB_MOVIES_FOR_REVENU)
displayPropertiesByRevenue(popForRev,str("Kind properties for {} higher revenue").format(CONST_NB_MOVIES_FOR_REVENU))
print()
print("Search kind properties for Lower revenue movies:")
popForRev=searchPropertiesForRevenue(DataBase,-1*CONST_NB_MOVIES_FOR_REVENU)
displayPropertiesByRevenue(popForRev,str("Kind properties for {} lower revenue").format(CONST_NB_MOVIES_FOR_REVENU))
plt.show()
#-----------------------------------------------
if __name__ == '__main__':
main()
#-----------------------------------------------