Machine learning models are ubiquitous in adtech today. One of the most common applications of machine learning in adtech is building a predictive model to estimate probabilities of rare events like clicks and installs. Model predictions are used to estimate the worth of an impression in an RTB scenario. Since the correctness of the model directly impacts the economic performance of the media buying, it’s of great importance to validate the efficiency of the model. There are numerous metrics in literature which tries to measure what constitutes a good model. This article is a discussion on some of these metrics and their applicability to the adtech scenario.
A model employed in adtech is usually a probability regression model, say, something like logistic regression which computes the probability of a certain rare event like click, given a set of input variables. Since predicted probabilities are very small and even for a realistically high CTR impression, it’s impossible to confidently classify it as a click; classification metrics like precision, recall, F1 score are not suitable for measuring the goodness of the model.
Metrics which are in widespread use in industry are Log-loss and Area under the receiver operating curve (AUC).
Since models like logistic regression use log-loss as the optimization objective function, it’s a very natural metric to measure the generalization. Note that log-loss is nothing but the negative log likelihood of the dataset on which we are running the model.
Smaller the log-loss, better the quality of predictions. Although an ideal model would have a zero log-loss, it’s never the case in a realistic scenario. It’s hard to suggest a particular value for log-loss below which model could be deemed good. This is one of the downsides of log-loss and also, it does not capture the economic performance of the model.
AUC is also one such popular metric which is widely used. its the area under the receiver operating curve which is a plot of true positive rate versus the false positive rate as we vary the classification threshold of the model. Note that the true positive rate denotes the ratio of true positives to all positive samples and false positive rate denotes the ratio of false positives to all true negative samples. If we made random predictions, true positive rate would be equal to false positive rate for any value of the classification threshold and AUC would be 0.5. A perfect model would have an AUC of 1 and the realistic model would have an AUC somewhere in between. A good AUC is a strong indicator of the discrimination power of the model which means model is able to separate out good impressions from bad impressions. AUC does not tell much about the quality of actual value of probabilities. We can scale all predictions by a constant multiplier and AUC will be exactly same. This is one of the disadvantages of AUC.
The predicted probabilities are used to compute a bid value to be submitted to an auction mechanism. So, it’s important that the predicted rate of events turn out to be very close to the observed rate of events. A deviance in this regard implies monetary losses. A model for which the observed rate of events is very close to the predicted rate of events across different regions of the prediction space is referred as a well calibrated model. A statistical metric to measure the calibration of a model is Hosmer- Lemeshow test metric. There are more popular metrics like Brier Score. One could also look at the calibration curve and the deviances (ratio of observed rate of events to expected rate) across different buckets of the prediction space. Having deviances close to 1 across different buckets is probably as good as a model can get because it implies that observed rate of events closely follow the predicted rate of events which in turn imply that we are achieving the business KPI.
There is some recent literature suggesting utility metrics which try to directly capture the economic performance of the model. There is also work around using these utility metrics as the optimization objective while training the model. All the previously discussed metrics focus on the quality of predictions but ignore the bidding system in which they are used. Utility metrics on the other hand focus on the profit that a prediction function generates. This obviously makes business sense. The naïve utility function which is simply the profit that the bidder would generate with a given prediction function may fail to correctly penalize over predictions. To overcome this problem, there is some recent work proposing the usage of expected utility which is constructed by defining a distribution on the second bid prices. Some of these utility functions are not convex and are not easy to optimize but could be used as performance metrics. These metrics hold a lot of promise and are yet to be fully explored in the industry.
Working with a number of metrics is also helpful in identifying new features. more often than not, a new feature improves one of the metrics but may not improve all of them.working with a single metric could lead to rejecting the feature which might improve other metrics.
With emergence of utility metrics, it’s perhaps possible in future that adtech prediction models get built and measured in terms of one metric which completely captures the economic performance of the model. That would take machine learning in adtech to the next level.