Build a Recommendation engine with ML.NET and F#

In this blog we are going to build a shopping/product recommendation service using F# and ML.NET in NET Core.

I have watched how Machine Learning (ML) tools and trends have evolved in recent years.  I am impressed by the great implementation work MSFT is doing, especially the tool ML.NET.  This tool aids developers to integrate ML in any application easily without the need of having a PhD in ML.

Machine learning has long been a difficult and foreign topic for developers working in the .NET space. Historically, if you had to build a machine learning model, you needed to use programming languages like Python or R.  Still today they are the most popular tools to create and train your ML models.  Although these are great programming languages, if you wanted to enter the ML area, you might have the double challenge to learn ML concepts and new language tools.  This is the benefit that ML brings, you can adopt a tool that you are already familiar with like .NET and NET Core, skipping the learning curve of Python and R.  ML.NET allows developers to easily integrate ML in their code without leaving the .NET space and using tools that we are familiar with.

What is ML.NET 

It is a cross-platform and open-source framework that aims to bring the capabilities of machine learning to .NET ecosystem to cover and solve a variety of scenarios, such as sentiment analysis, price prediction, recommendation, image classification, and so forth.

ML.NET is much more than just a machine learning library; it is an extensible platform that is able to leverage and integrate other ML infrastructure libraries and runtimes, such as TensorFlow and ONNX in a way that you don’t have to know how to use Tensorflow API.   ML.NET abstracts away from you the implementation details to use these libraries

In addition, there is the .NET part, which means that ML.NET allows you to train, build, and ship custom machine learning models reusing your existing experience and skillsets in .NET, which is great because we don’t have to leave the .NET ecosystem

Building an ML Model involves few steps

This diagram represents the code structure and the process of development in ML.NET

First, we need to detect the problem and isolated a good set of quality data, which has to be cleaned and normalized. Then, we need to load the data, and for this step ML.NET offers different options, such as a Text, a CSV file, from a database, and of course as a stream to reduce memory consumption. 

Following that, you need to Extract the features transforming the input data; this is because Machine learning algorithms understand featurized data (only numbers). 

Next, we have to Train the model, for this step we need to add a learner/trainer specifying what column is the feature and what column is the label (or goal) to predict.

Once the estimator has been defined, we can train the model providing the already loaded training data. This returns a model which you can use for predictions.  Now we can Evaluate the trained model for estimate the accuracy.  If we are happy with the Model, we can consume it to achieve predictions passing some new inputs, otherwise we can re-train it in a way to get better accuracy.

Let’s build a product recommendation

You can follow along the steps, and you can find/download the complete source code here.

The goal of a product recommendation engine is to be able to predict the most popular product combinations given a purchased product. To build this product recommendation engine we are going to use the data-set from Amazon that contains a large number of product purchases based on the “Customers Who Bought This Item Also Bought” feature, which contains over a million of product combinations. 

The file format is a simple TSV (Tab Separated Values) with two columns, one for the product-id purchased and a second one the product-id purchased in combination (bought by customers who also bought the first product).

Here the link of the file SNAP Amazon Co-Purchasing Network. The source code contains this file compressed.

This is how the file data-set looks like:

Let’s start creating a simple Console App, I personally use Rider XXXX as an IDE, you can use your favorite tool, more important, is to import the necessary nuget packages to exploit the ML.NET library. For the recommendation engine we need these packages

  • Microsoft.ML
  • Microsoft.ML.Recommender

I am breaking in 9 steps when implementing a ML.NET project

  1. Define the data-model
  2. Create MLContext
  3. Load the data
  4. Split the Data
  5. Convert the Data /  Create Pipeline
  6. Training the algorithm for the model
  7. Use the test data against the model
  8. Check accuracy and improve (cross-validation)
  9. Use the model

The first step consists in defining the data-model to reflect in code the data we are dealing with. In this case we have a TSV file with the information of the product purchases. We can use two Record-Types to define respectively our features and predicted attribute.  These objects have to be mutable for compatibility reasons, in F# we can achieve this decorating the Record-Types with the CLIMutable attribute.

[<CLIMutable>]
type Product = {
    [<LoadColumn(0)>] ProductID : float32
    [<LoadColumn(1)>] [<ColumnName("Label")>] CombinedProductID : float32
}

[<CLIMutable>]
type ProductPrediction = {
    [<ColumnName("Score")>] Score : float32
}

These Record-Types define how the schema of data will look (column names & column types).

The Product Record-Type represents a distinct product purchase combination. Each field of the Record-Type is decorated with the LoadColumn attribute, which notifies the data loader which column to import data from based on the index.

The second Record-Type ProductPrediction denotes the product combination prediction.  In the second step we need to initialize the MLContext., which is the starting point for all ML.NET operations, and it used for all aspects of creating and consuming a ML model.

At its core, MLContext  exposes a set of catalogues like Data, Model and Transformers that basically give access to the API that let us load data, and transform and prepare the data, apply learning algorithms, and make predictions and so forth.

let context = MLContext(seed = Nullable 1)

Next, in the third step we load the data in memory, generating two sub-sets respectively one for training the model and one to test the model trained.

let data = context.Data.LoadFromTextFile<Product>(dataPath, hasHeader = true, separatorChar = '\t')

// Step 3 - Split the Data
let dataPartitions = context.Data.TrainTestSplit(data, testFraction = 0.2)
let trainSet = dataPartitions.TrainSet
let testSet = dataPartitions.TestSet

In this code, we use the method LoadFromTextFile to load the TSV data into memory.  The attribute LoadColumn instructs the method on how to store the loaded data into the Product Record-Type.

When we load data for training, we get back an object that implements the IDataView interface, which is a tabular representation of the data.  

The IDataView is designed to efficiently handle high-dimensional and large data sets and is the component that holds the data during data transformations and model training.  In addition, it is very efficient for loading data in memory, in fact ML.NET was designed to handle datasets of theoretically infinite size using a forward cursor, almost like the cursor in SQL where you might have a table with millions of records, but you touch only few at a given time using the cursor.  

When loading the data, it is common practice to divide the data for the training and testing. 

To partition the data, we split the data-set into a training and a testing sub-set using the TrainTestSplit method. In this case we use 80% of the data for training and 20% of the data for testing. Successively, there will be multiple iterations, where for each iteration the data is shuffled, so that the Training set of data and the Test set data keep changing.  During each iteration we adjust the model to be more accurate.  The goal of a machine learning model is to accurately make predictions on data it has not seen before. Therefore, making predictions using inputs that are the same as those it was trained on may provide misleading accuracy metrics.

After we have loaded the data is time for building the machine learning model.  In general, the implementation of a recommendation engine that generates predictions is based on the Matrix Factorization algorithm.  In our case, we have “only” two IDs as data fields, we are limited to use the One-Class Matrix Factorization.

Matrix factorization (recommender systems)

The idea behind matrix factorization is to represent users and items in a lower dimensional latent space. Since the initial work by Funk in 2006 a multitude of matrix factorization approaches have been proposed for recommender systems. Some of the most used and simpler ones are listed in the following sections.

Matrix factorization is a class of collaborative filtering algorithms used in recommender systems. Matrix factorization algorithms work by decomposing the user-item interaction matrix into the product of two lower dimensionality rectangular matrices. This family of methods became widely known during the Netflix prize challenge due to its effectiveness as reported by Simon Funk in his 2006 blog post, where he shared his findings with the research community. The prediction results can be improved by assigning different regularization weights to the latent factors based on items’ popularity and users’ activeness.

Thankfully, the MLNET library supports all these algorithms, it just requires a bit of a simple set up.

let options = MatrixFactorizationTrainer.Options(   MatrixColumnIndexColumnName = "ProductIDEncoded",
                                                    MatrixRowIndexColumnName = "CombinedProductIDEncoded",
                                                    LabelColumnName = "Label",
                                                    LossFunction = MatrixFactorizationTrainer.LossFunctionType.SquareLossOneClass,
                                                    Alpha = 0.01,
                                                    Lambda = 0.025 )
let pipeline =
    Common.printCyan "Create pipeline..."
    EstimatorChain()
     // map ProductID and CombinedProductID to keys
     .Append(context.Transforms.Conversion.MapValueToKey(inputColumnName = "ProductID", outputColumnName = "ProductIDEncoded"))
     .Append(context.Transforms.Conversion.MapValueToKey(inputColumnName = "Label", outputColumnName = "CombinedProductIDEncoded"))
     // find recommendations using matrix factorization
     .Append(context.Recommendation().Trainers.MatrixFactorization(options))

ML.NET uses the concepts of pipelines; data pipelines and training pipelines is a concept that ensures we have all the points connected from generating the model to running the prediction. You can think of the ML.NET model pipeline as a chain of estimators.  For example, we can load the data, and then append the “Transform the data” step of the pipeline, and then append the “train step” and so forth, until we can run and evaluate the model.

The connotation of pipeline in ML.NET is basically building or aggregating a list of operations that must be performed in order to apply a machine learning algorithm to a model and to prepare the data to feed into this model before the training is done.

In the previous code, to set up the matrix factorization we are providing the LossFunction, which for this case we use the SquareLossOneClass function because it fits the case of building a recommendation model that only uses two values.

The Label column is our goal, it is what we want to predict, which in this case is mapped to the “CombinedProductID” field annotated with the Label ColumnName Attribute of the Product Record-type. 

In the option setting, the Alpha and Lambda parameters are used speed up the training and boost the accuracy of the factorization algorithm.

Here is a more in-depth look at the ML.NET for the recommendation engine.  It has the following sections:

  • MapValueToKey: this step converts the IDs to numbers that the model can understand. The method MapValueToKey reads the ProductID column and builds a dictionary of unique ID values that is used to generate an output column called ProductIDEncoded containing an encoding for each ID. The same operation happens for the Label column (mapped to the CombinedProductID column)
  • MatrixFactorization is part of the “Trainers” module, and it executes the “Matrix Factorization” on the encoded outputs of the previous step MapValueToKey to compute the predicted product combinations for every product.

With the ML.NET pipeline completely set up, we are finally able to train the model on the training partition calling the Fit method

let model = trainSet |> pipeline.Fit

At this point that we have a trained model, we can use the validation data to predict all product combinations, which it will helps us to compute the accuracy metrics of the model.

let metrics = testSet |> model.Transform |> context.Regression.Evaluate

// show the metrics
Common.printGreen (sprintf "Model metrics:")
Common.printGreen (sprintf "  RMSE:%f" metrics.RootMeanSquaredError)
Common.printGreen (sprintf "  MSE: %f" metrics.MeanSquaredError)

This code runs the Transform method to make predictions for every product combination in the test set. Then, the Evaluate method compares these predictions to the concrete product combinations to calculate the model metrics, such as the RMSE value, which evaluates the model and rate its accuracy. RMSE represents the length of a vector in n-dimensional space, made up of the error in each individual prediction.

The last step, if we are happy with the accuracy of the model achieved, is to use the model to make a prediction.

To initialize a prediction engine we use the CreatePredictionEngine method. The generic type constructors for this object are the input class Product and the output class ProductPrediction that produces the prediction. When the prediction engine is initialized, we can run the Predict method to obtain the prediction.

As a starting test we can run few predictions on a specific product with ID 21 to verify if it’s frequently purchased together with product ID 77.

This code runs the Transform method to make predictions for every product combination in the test set. Then, the Evaluate method compares these predictions to the concrete product combinations to calculate the model metrics, such as the RMSE value, which evaluates the model and rate its accuracy. RMSE represents the length of a vector in n-dimensional space, made up of the error in each individual prediction.

The last step, if we are happy with the accuracy of the model achieved, is to use the model to make a prediction.

To initialize a prediction engine we use the CreatePredictionEngine method. The generic type constructors for this object are the input class Product and the output class ProductPrediction that produces the prediction. When the prediction engine is initialized, we can run the Predict method to obtain the prediction.

As a starting test we can run few predictions on a specific product with ID 21 to verify if it’s frequently purchased together with product ID 77.

Here the code to to make the prediction:

let engine : PredictionEngine<Product, ProductPrediction> = context.Model.CreatePredictionEngine model

let runPrediction () =
    let productInfo = {
        ProductID = 21.f
        CombinedProductID = 77.f
    }

    let prediction = productInfo |> engine.Predict

    // show the prediction
    Common.printRed (sprintf "Score for product %f combined with %f is %f" productInfo.ProductID productInfo.CombinedProductID prediction.Score)

which output:

At this point we have our recommendation engine working. 

A useful case is to apply the recommendation engine to find the 3 top products to recommend based on a given purchased product. To achieve the goal, we run the prediction iterating through each unique product to predict how it will fit in combination with the purchased one. Then we sort by the highest prediction score and take the top 3. 

Here the code demonstrates how to obtain the 3 recommended products with highest scores for the purchased product with id 61.  

let runPredictionBestMatches productId topN =
    let bestRecommendedProducts productId topN =
        seq {
            for index = 1 to 262110 do
                let product =  {
                    ProductID = float32 productId
                    CombinedProductID = float32 index
                }
                let prediction = engine.Predict product
                {| Score = prediction.Score; ProductID = index |}
        }
        |> Seq.sortByDescending(fun p -> p.Score)
        |> Seq.take topN

    for (index, product) in bestRecommendedProducts productId topN |> Seq.indexed do
        Common.printRed (sprintf "%d) Best match for product: %d is product: %d\t with score: %f" index productId product.ProductID product.Score)


runPredictionBestMatches 61 3

This output running the previous code 

The best 3 matches by score for product id 63 are 8, 99 and 481 with scores between 0.98 to 0.83, which are pretty good values.

As we “wrap” up this blog, I hope that with this new skill you are able to the most compatible recommendations for gifts to “wrap” and put under your Christmas tree!

Leave a Reply

Your email address will not be published. Required fields are marked *