The basic structure for offline evaluation is based on the setup common in machine learning. It starts with a data set, typically consisting of a collection of user ratings or histories and possibly containing
additional information about users and/or items. The users in this data set are then split into two groups: the training set and the test set. A recommender model is built against the training set. The
users in the test set are then considered in turn, and have their ratings or purchases split into two parts, the query set and the target set. The recommender is given the query set as a user history and asked to recommend items or to predict ratings for the items in the target set; it is then evaluated on how well its recommendations or predictions
match with those held out in the query. This whole process is frequently
repeated as in k-fold cross-validation by splitting the users into k equal
sets and using each set in turn as the test set with the union of all other
sets as the training set. The results from each run can then be aggregated
to assess the recommender’s overall performance, mitigating the
effects of test set variation [53].