Using the Surprise Framework in Python
One way to explore the concepts behind recommendation engines is to use the Surprise framework (http://surpriselib.com/). A few of the handy things about the framework are that it has built-in data sets: MovieLens (https://grouplens.org/datasets/movielens/) and Jester, and it includes SVD and other common algorithms including similarity measures. It also includes tools to evaluate the performance of recommendations in the form of root mean squared error (RMSE) and mean absolute error (MAE), as well as the time it took to train the model.
Here is an example of how it can be used in a pseudo production situation by tweaking one of the provided examples.
First are the necessary imports to get the library loaded.
In [2]: import io ...: from surprise import KNNBaseline ...: from surprise import Dataset ...: from surprise import get_dataset_dir ...: import pandas as pd
A helper function is created to convert IDs to names.
In [3]: def read_item_names(): ...: """Read the u.item file from MovieLens 100-k dataset and return two ...: mappings to convert raw ids into movie names and movie names into raw ids. ...: """ ...: ...: file_name = get_dataset_dir() + '/ml-100k/ml-100k/u.item' ...: rid_to_name = {} ...: name_to_rid = {} ...: with io.open(file_name, 'r', encoding='ISO-8859-1') as f: ...: for line in f: ...: line = line.split('|') ...: rid_to_name[line[0]] = line[1] ...: name_to_rid[line[1]] = line[0] ...: ...: return rid_to_name, name_to_rid
Similarities are computed between items.
In [4]: # First, train the algorithm # to compute the similarities between items ...: data = Dataset.load_builtin('ml-100k') ...: trainset = data.build_full_trainset() ...: sim_options = {'name': 'pearson_baseline', 'user_based': False} ...: algo = KNNBaseline(sim_options=sim_options) ...: algo.fit(trainset) ...: ...: Estimating biases using als... Computing the pearson_baseline similarity matrix... Done computing similarity matrix. Out[4]: <surprise.prediction_algorithms.knns.KNNBaseline>
Finally, “10 recommendations” are provided, which are similar to another example in this chapter.
In [5]: # Read the mappings raw id <-> movie name ...: rid_to_name, name_to_rid = read_item_names() ...: ...: # Retrieve inner id of the movie Toy Story ...: toy_story_raw_id = name_to_rid['Toy Story (1995)'] ...: toy_story_inner_id = algo.trainset.to_inner_iid( toy_story_raw_id) ...: ...: # Retrieve inner ids of the nearest neighbors of Toy Story. ...: toy_story_neighbors = algo.get_neighbors( toy_story_inner_id, k=10) ...: ...: # Convert inner ids of the neighbors into names. ...: toy_story_neighbors = (algo.trainset.to_raw_iid(inner_id) ...: for inner_id in toy_story_neighbors) ...: toy_story_neighbors = (rid_to_name[rid] ...: for rid in toy_story_neighbors) ...: ...: print('The 10 nearest neighbors of Toy Story are:') ...: for movie in toy_story_neighbors: ...: print(movie) ...: The 10 nearest neighbors of Toy Story are: Beauty and the Beast (1991) Raiders of the Lost Ark (1981) That Thing You Do! (1996) Lion King, The (1994) Craft, The (1996) Liar Liar (1997) Aladdin (1992) Cool Hand Luke (1967) Winnie the Pooh and the Blustery Day (1968) Indiana Jones and the Last Crusade (1989)
In exploring this example, consider the real-world issues with implementing this in production. Here is an example of a pseudocode API function that someone in your company may be asked to produce.
def recommendations(movies, rec_count): """Your return recommendations""" movies = ["Beauty and the Beast (1991)", "Cool Hand Luke (1967)",.. ] print(recommendations(movies=movies, rec_count=10))
Some questions to ask in implementing this are: What tradeoffs are you making in picking the top from a group of selections versus just a movie? How well will this algorithm perform on a very large data set? There are no right answers, but these are things you should think about as you deploy recommendation engines into production.