Indian Regional Movie Dataset (IRM Dataset) is a database of regional movies geographically attributed to the Indian Subcontinent. The database was introduced in the early 2018 by IIIT Delhi in “Indian Regional Movie Dataset for Recommender Systems”1. The database introduces various user demographics for the native viewership. This was the first Dataset provided exclusively for the Indian Subcontinent. This paper aims to design a recommender system inspired from the IRM Dataset, using Collaborative Filtering.
Collaborative Filtering is a method of making automatic predictions of a user’s choice and interests by taking into account similar metrics provided and exhibited by many other users. This is the essence of Collaboration. The Filtering process refers to the fine tuning of the collected data such that it would make a close enough prediction to the user’s actual interests. Combining these, we can create a system that collects data from ‘n’ users over ‘v’ items of interest and puts them together to get a predictive, systematic rating for any given user.
The collaborative work requires the collection of data from a dataset. We have scraped and combined our data from previous workings using basic web scraping. The Data Set contains given details for a movie recommender system using CF. Movie Data included in the file was contained to several relevant fields like genre, average critic rating, story type, language it was conveyed in, etc. A combination of several movies in a csv format was used as a data-set , having combined over 2900 movies in this fashion.
The filtering process uses a learning algorithm based on Non Negative Matrix Factorization (NNMF or NMF). This is a matrix filtering and fine tuning technique, where 2 factor matrices P and Q are combined to reach a final matrix R, such that its similar to the original matrix – but factorized to give a predictive rating based on user’s rating habits / interests. The model presented uses NMF with an Unsupervised Learning based Stochastic Gradient Descent (SGD) to calculate filtration process over the data set. The returned Matrix is then normalized to reach a predictive rating (out of 5) for all the unrated movies ‘v’ for a user ‘u’.
The results yielded a higher predictive rating with a better sparsity combination than Single Value Decomposition (SVD) and Cosine Similarity Metrics, which are also used for CF Techniques. The normalization of NMF favors high sparsity and low data density, which is required for the collaborative data collection – since not all users would have rated all given items, and users are expected to have atleast 1 unrated item, which would be predicted by the Recommender System. The predicted ratings by nmf can be used to give out suggestions to users to as it would be a good approximate to the rating they would have given to the movie.
Machine Learning, Collaborative Filtering, Recommender Systems, Crowd-sourcing, Data Scraping, Movie Dataset, NMF, SVD, Similarity metrics
Due to rise of easy-access internet and rise of high-speed connectivity, e-commerce has flourished in the past few years, making it highly dynamic in gaining users and traction1. This e-commerce advent offers convenience and even doorstep-delivery, which is highly time-efficient. E-commerce applications provide opportunity to use the web as a potential business accelerator. But with the increasing traction and users, the amount of data collected is also huge. Data, no matter how abundant, provides ambiguity associated with its representation due to multiple variants. Without an efficient way to classify and take only the useful information from the given data, a potent resource is wasted. However, filtration using Human Interference is mundane and is harder to manage, which is why we deploy a standard of learning algorithms.
Machine learning is a domain in computer science where in we allow a machine to learn without programming it. The idea is to train a machine to think and learn by itself. It would learn based of the data given to it and further perform a required operation. The internet is a vast source of information today and machines can be provided data from the internet to learn. That is the essence of machine learning. Today machine learning is used across various domains from our smartphones to big companies for statistical analysis of their data. Machine learning has a application in almost every field. It is being used in genome mapping, weather analysis, personal assistants, search engines, stock market analysis, image and voice recognition and the list of fields go on. One of the fields machine learning is use is in recommendation systems.
Recommendation systems 1,5,10,11,12 are systems which are used to predict what kind of products a user is most likely to prefer based of the previous choices made by the user. In such a system we recommend various products to users. Some of the products that are recommended to users include movies, ecommerce products, restaurants, articles of various kinds like news, friend suggestions on social media and so on. Each user has a different preference and that has to be taken into account while predicting an product to the user. This is the essential of a recommendation system.
The recommendation system has mainly two techniques that are involved in it. These techniques are :
1. Content based filtering
2. Collaborative filtering(CF) 7,13
In a content based recommendation system there is a user and the items or products that the user prefers. The system predicts products that the user will like based on similarity between products that have similar parameters to the products selected, used and rated well by the user. The problem with such a system is that it only makes recommendations in the same field and is not able to generate any predictions across a different field or domain. Collaborative filtering is a technique that tries to solve this problem. Collaborative filtering technique works based on the users. It uses similarity between users choice and preferences to recommend new items to users. Collaborative filtering is done in two ways :
1. User based and 2. Item based.
1. User based approach is where in the similarities between two users is calculated to give predictions of items a user is most likely to preference based of commonalities between two different users.
2. Item based approach is where in the similarities between items are used to predictions to users based of a common parameter like ratings.
This paper focuses on the algorithmic recommendation performance, mainly in memory based CF Systems, mainly for Movie Recommendation. The basis of CF is similarity calculation and evaluation, which is either a user-user metric or an item-item metric. The comparison metric for this paper is user-user, where generic and traditional similarity measures are the Pearson Correlation Coefficient, Cosine Similarity, Mean Squared Difference and Root Mean Squared Error. However, taking into account the data sparsity due to cold users, we are unable to predict – to a high degree of accuracy – effective similar users. Since CF uses collaboration and filtration techniques, we need effective similar users to perform the filtering. Undermining prediction / accuracy also reduces effective recommendation through filtering, which is why NMF is used.
NMF is non-negative matrix factorisation which is a linear algebra technique. This technique uses factorisation of the matrix to obtain a prediction for values that have not been previously rated by users. NMF factorises a given matrix into two whose product produces a matrix with data elements that calculate the possibility of the value by using stochastic gradient descent. Stochastic gradient descent is a technique wherein each iteration minimises the error in values of the matrix.
We formed a matrix of users vs movies and filled the matrix with the ratings users have given to a movie. The User to Movie matrix is a 1,0 combination, where 1 denotes the user having watched a movie and 0 denotes a non-rated item from the user.The movies which are not rated are marked as ‘0’.
Related Work :
Collaborative Filtering , as a personalized recommendation technique , has inter-domain applications. However, CF also has a few issues, namely Cold Start Problem, Scalability, Data Sparsity, etc. Cold Start due to users and high data sparsity are issues which are common in a Movie Recommendation System. These issues undermine the user-user preference and experience. However, using the NMF Model, Data Scarcity and Scalability are taken into account to provide a more accurate rating and recommendation according to preferences. Since there is only a limited ‘rated’ database, the data collected is actually very sparse, since each user rates only a very small number of items in the given group. There has been research into a fine tuned accuracy model and thereby, solutions regarding Cold Start as well.
Recommendation Systems developed for Movie Recommendation can have multiple metrics, one of which is User similarity. The following User Similarity models have been used and compared against in the proposed system for optimum performance in terms of scaling and accuracy:
The proposed System is built entirely on NMF, with regularization. Feature selection techniques (like Regularization6) are an exercise in finding (and selecting) those features that are significant when it comes to signaling a given target variable. This method is useful when we have a plethora of features to be taken from the data set, that might not entirely be useful.6
MovieLens and Netflix are online web-based portals for movie recommendations. It also uses a user-user similarity for film preferences of its users by maintaining a mapping of movie ratings and genres. Employing CF techniques, they are able to make movie recommendations. The Computer Science and Engineering Department at the University of Minnesota, which is home to the Grouplens Research, was responsible for MovieLens’ creation in 19971. The primary goal was to collect data for research and analysis, to provide personalized recommendations over the data set. The Indian Regional Cinema Dataset is inspired from MovieLens. IRC Dataset, along-with an IMDb inspired dataset, is used for testing the proposed Recommendation System. MovieLens have released 4 datasets, the 100k, 1M, 10M and the 20M, in which the users and movies are represented with integer IDs, while ratings range from 1 to 5 at a gap of 0.5. The inspiration behind using an IRC dataset is the fact that MovieLens datasets are largely for hollywood movies and TV series, and their viewers. This implies that user groups preferring the Indian Cinema was a different , unaccounted for demographic. Quite a few Learning techniques, belonging to both Unsupervised and Supervised Learning, have been deployed on the MovieLens dataset for Collaborative Filtering. Proposed System part ( c ) talks more about the techniques used for CF over our collected dataset.
Proposed System : The proposed system runs over the collected IRC dataset, which contributes to the algorithm chosen for CF’s Movie Recommendation. The System can be broken down into 4 distinctive parts :
Scraping : The IRC scraped is a combination of movies selected and combined over User ratings till the year 2018, spanning up to the month of June. This dataset8 covers movies of 18 different regional languages and a variety of user ratings for such movies.Each user is of a varying demographic with over 2900 movies spanning multiple genres. The dataset has over 10k ratings from over 900 users.
Metadata Information :
The data for movies has been scraped from IMDb2. It boasts a collection of collection of Indian movies spanning the regional languages and multiple genres. Each movie is associated with multiple data fields, of which we have extracted only a select few to run our experimental dataset on :
Description : Movie’s plot description for users
Language : Language(s) used in the movie. A movie may have been released in multiple regional languages.
Rating Count : Taken from IMDb, to judge a Movie’s popularity on a scale of 10
Director : Director of the movie
Genre : It can be one or multiple out of 20 genres available on IMDb.
Name : Release Name of the movie (as per IMDb)
A distribution of languages , genres ,etc. is represented in subsequent parts of the proposed model.1,2
Pre-Processing: The given data-set contained numerous fields, of which not all are equally relevant. Pre-processing of data included removal duplicate user fields, film id, release date etc. to extract only the metadata discussed in Scraping. This was done by separating the given csv formatted data and extracting (column wise) the relevant information by running a simple extraction script.
Model: The proposed model uses cosine similarity as basic user-user similarity calculation, and NMF at its core to provide a predictive rating and hence, recommendation. NMF is also regularized to avoid any noise. This scheme is evaluated against an SVD scheme, which is also a common CF technique for high dimensionality filtering and recommendation.
The techniques are explained below briefly :
Cosine Similarity : This is a measure of similarity between any two given metrics. CS evaluates similarity between any two non-zero metrics , measuring the cosine of the angle between them. Since cosine values lie between (relative) 0 ~ 0.5? , the similarity always gives a value between 1 and 0. (Since cosine of 0 is 1 and cosine of 0.5? is 0). This method works when the each user has equal no. of evaluations. The users are termed ‘similar’ if they have a CS metric close to ~1. ‘Dis-similar’ users have a CS metric closer to 0. Cosine Similarity is calculated as :
cos? = a?b c
Where a and b are whatever vectors you want to compare, in this case 2 users’ preference list of items4.
This has been used as a Baseline to develop and compress our similarity metric for user-user preference. A deeper, refined technique with Stochastic Gradient Descent, Non Negative Matrix Factorization and Regularization was employed to reach a more accurate prediction.
NMF : Non-Negative Matrix Factorization13 is a method where a parent matrix is factorized into 2 matrices such that none of the three matrices have any negative elements. This makes inspection of values easier while negating overfitting by using Regularization:
We use Regularization to avoid overfitting of data. NMF is preferred over SVD in a Movie Recommendation system since NMF works better when sparsity is high, and also supports a system with a larger number of cold users. Also, to provide a direct and considerably accurate rating for a user, we have scaled the matrix values out of 5. The highest value is taken on a relative scale with 5 as the scaled integer to evaluate against.
SVD : Singular Value Decomposition, popular in recent CF systems, is a method that decomposes a matrix into 3 other matrices9:
A is an n x m matrix , U is an orthogonal n x m matrix , S is a diagonal n x n matrix and V is n x n orthogonal matrix, in the context where m > n or m = n. This can also be expressed as a series of summation :
The variables of the matrix are normally arranged in a descending fashion. The method is used to establish high data density and reduce dimensionality, i.e. reduce number of features to be used in the analysis of data. A higher rank ‘M’ matrix is reduced to a lower rank ‘K’ matrix , while retaining all required information. Variable dimensionalities are reduced to a single ‘K’ rank matrix. It means that we can take a list of R unique vectors, and approximate them as a linear combination of K unique vectors, where each feature of the original data is represented by linear equations – negating the need to select multiple equations for the same inference while reducing complexity in running time.
To better understand the data provided by the dataset, visualisation is done by plotting various graphs on the parameters that are present in the dataset. The parameters taken into consideration are genre, age, state, language and so on. Each of the parameters allows us to understand the preference of a user. By knowing the distribution of such parameters we can better predict the movie that a user would prefer. We extract the data from the dataset which is in a csv file. The data once extracted is used to form the various distribution tables whose data we can use to plot the graphs. These graphs are then plotted in a ipython notebook using the package matplotlib. Matplotlib is a library that helps us plot graphs for data in python. The graphs are then plotted using pyplot from matplotlib. The extensive library supports various forms of graphs like line charts, bar graphs, histograms and pie charts. Visualisation of the data is done by plotting error of nmf using a line chart to track the decrease in error, the genre distribution is plotted using a bar graph to see the kind of movies watched more often, and the rest of the parameters are plotted using pie charts to help us understand the distributions using percentages.
For recommendations to give us meaningful statistics, its important to separate individual factors that might influence a user’s rating for a particular movie.
Factors most common and most influential include :
Languages : Languages known by a user
State : Home state of the user
Another meaningful data-field is the genre distribution of movies over the data-set.
The Genres have been subdivided into 20 categories, as per IMDb’s support for categories. Also note that a movie may belong to more than one genre.
A user’s distribution can also be provided on the basis of Gender and Age, so as to create a clear category preferred by each sub-group.
Experiment and Results:
The dataset taken into consideration has about 2944 movies. The dataset has various parameters which include names, genres, cast, director, writer, rating, release date, language and description. We have a dataset of 925 users which include parameters like gender, date of birth, state, job and language. The comparison of our dataset with 1 datasets is done by using the method NMF and along with cosine similarity to allow us to identify people with similar preferences. This gives us a better understanding of people with similar choices and suggest movies that a user hasn’t watched, but would like. We compare the method SVD with our method NMF and this allows us to identify that NMF provides better results.
Each of the 2 datasets – Movies and Users – has different metadata considerations.
The Movie dataset uses data-fields for genre,rating and language. This also helps in the filtration process with respect to cosine similarity. A cosine similarity for user ratings is selected based on the data-fields extracted. Each movie has an average critic rating and we have kept a table to keep track of the number of movies along with the number of users.
The User dataset provides language, state and date of birth as data-fields. This, filtered with the movie dataset, increases the filtration accuracy. The state field allows us to monitor and infer the type / genre of movies preferred by a certain group of people. The date-of-birth is used to extract the user’s age to also monitor and provide age-constricted content in consideration of younger users.
Comparison of Datasets
Dataset Movie Count User Count Rating Count Sparsity(%) Release Year
Our Dataset 2945 925 10,000 99.96 2018
MovieLens 100k 1700 1000 100,000 99.94 4/1998
MovieLens 1M 6000 4000 1,000,000 99.96 2/2003
The genre of movies are plotted in a bar graph which provides us information that the movies may have multiple genres but the major genres are drama, romance and comedy.
The language is taken for both movies and users. The following table gives us a count of both users and movies for each language. A movie may be released in more than one language, and is kept tracked of by adding a count increment variable to each language it was produced in. The number of users watching a particular language is shown in the table given below.
Language Movies Users
Hindi 704 908
Bengali 582 29
Urdu 129 24
bhojpuri 26 22
Rajasthani 18 14
Punjabi 150 79
Tamil 314 30
Telugu 338 18
Kannada 305 11
Malayalam 346 16
Gujarati 49 7
Haryanvi 3 18
Marathi 203 14
Konkani 6 4
Manipuri 8 4
Oriya 98 6
Nepali 51 9
Assamese 22 10
The age distribution allows us to identify the various distribution of ages in ranges and helps us understand the age groups that prefer to watch a specific subset of movies. The age ranges help understand and infer any age-restriction on movie content.
The learning algorithm used in the paper is backed by Stochastic Gradient Descent, wherein each iteration reduces the nominal error to expected value. As a result, an iterative decrease in error is recorded with respect to mean absolute error. This can be seen in the plot for iterations, where the error decreases iteratively. Increase in iterations fine tunes the model, but we have recognized the iterative limit to be 30 for a stable recommendation using small experimental data. A higher iteration count records only a minimal error decrease, which is negotiable.
The proposed model uses NMF and not SVD9, since SVD is mainly used in Data Mining14 to see and recognize important patterns, increase the scalability and maintain a dynamic and efficient database for recommendation. However, we have used feature selection techniques like regularization, which avoid overfitting and are computationally easier for a highly sparse dataset.
Techniques Our Dataset MovieLens 100k MovieLens 1M
Errors MAE RMSE MAE RMSE MAE RMSE
User-User Similarity 0.5872 1.0139 0.6980 1.026 0.607 0.8810
Item-Item Similarity 0.4698 0.9668 0.744 1.061 0.671 0.9196
NMF 0.5485 1.1220 0.828 1.128 0.6863 0.8790
For the Movie Dataset, where the number of cold users is high, the sparsity increases, making NMF the better choice provided we use Regularization.
NMF also takes care of negative scaling and has bias based fine tuning, making it a better fit to the IRM Dataset for Recommendations.
1 – Indian Regional Movie Dataset for Recommender Systems, IIIT Delhi, Jan. 2018
2 – IMDb – http://www.imdb.com/
3 – Scraping – https://www.digitalocean.com/community/tutorials/how-to-scrape-web-pages-with-beautiful-soup-and-python-3
4 – Cosine Similarity – http://blog.christianperone.com/2013/09/machine-learning-cosine-similarity-for-vector-space-models-part-iii/
5 – M. F. Hornick and P. Tamayo, Extending recommender systems for disjointuser/item sets: e conference recommendation problem, IEEE Trans. Knowl. DataEng., vol. 24, no. 8, pp. 1478-1490, Aug. 2012.
6 – Regularization: https://towardsdatascience.com/regularization-in-machine-learning-76441ddcf99a
7 – Collaborative Filtering : http://recommender-systems.org/collaborative-filtering/
8 – MovieLens : https://movielens.org/
9 – SVD : https://en.wikipedia.org/wiki/Singular-value_decomposition
10 – Q.Liu, E.Chen, H.Xiong, Y.Ge, Z.Li and X.Wu, A cocktail approach for travelpackage recommendation, IEEE Trans. Knowl. Data Eng.,vol. 26, no. 2, pp. 278-293, Feb. 2014.
11 – Y. Koren and R. Bell, Advances in collaborative ltering, in Recommender SystemsHandbook. New York, NY, USA: Springer, 2011, pp. 145-186.
12 – A. Gogna and A. Majumdar, A Comprehensive Recommender System Model: Im-proving Accuracy for Both Warm and Cold Start Users, in IEEE Access, vol. 3, no. ,pp. 2803-2813, 2015.
13 – Q.Gu, J.Zhou, and C.Ding,Collaborative ltering:Weighted non negative matrixfactorization incorporating user and item graphs, in Proc. SDM, 2010, pp. 199-210.
14 – S.-T. Park, D. Pennock, O. Madani, N. Good, and D. DeCoste, Naive lterbots forrobust cold-start recommendations, in Proc. 12th ACM SIGKDD Int. Conf. Knowl.Discovery Data Mining, 2006, pp. 699-705.