AppLift specializes in delivering data-driven performance-optimized mobile app install campaigns. Some of the problems that we come across when solving these problems, include click-through-rate prediction (rate at which to expect clicks given that we have shown an impression), conversion-rate-prediction (rate at which to expect conversions or app-installs given that we have a click on the impression shown) and fraud-traffic classification.
A simple way to solve this problem would be to use a linear model with a logistic link function. All of the parameters, which we have from the request-advertisement context, are categorical variables. Some of the examples being, app, publisher, creative, campaign, country, carrier type, gender and others. Except for gender, and maybe creative type, all the others are of very high cardinality. Due to the long tail of apps and publishers we essentially have an unbounded number of features to train on. Apart from the cardinality, we would have billions of examples to learn from, during every training session.
Naively setting it up as a logistic-regression problem is prone to multiple problems. The resulting model would be very large. Although the model would fit the data very well, it would have very poor predictive accuracy on the unseen test data. Apart from the fact that model performance suffers, training the model would also be very hard. If the cardinality of the features lead to the problem of overfitting and poor generalization, the scale of the problem means that we cannot store all the examples in memory. This means that the only viable option is to train the model in an online fashion.
Using R to solve to train the model used to take an absurdly large amount of time. We are rarely used to seeing it finish with our dataset. This is why we resorted to Mahout’s Stochastic gradient descent-based online learning procedure. It had many parameters to be tuned and did not address the problem of large cardinality very well. This meant that we would have a model that is not bounded in size as it grows with the long tail of apps.
This led us to feature hashing. Feature hashing essentially means that all the unbounded cardinality categorical features are hashed to a bounded domain and hopefully the long tail of apps which might collide with others will not seriously degrade the performance of the model. Empirical verification has shown that feature collision due to hashing does not lead to model performance degradation. Below is the data we obtained in an empirical study we did with respect to the size of the output range of the hashing function used.
Bit-length of the hashes used
Area Under Curve
Number of non-zero coefficients
From the above data we can clearly see that the model performance did not improve much when the bit length of the hashes used was increased beyond 22.
There were very few tools which implemented all of this in an efficient fashion. This led us to Vowpal Wabbit. Vowpal Wabbit had a very efficient implement of learning linear models in an online fashion. More importantly, it had very good defaults to most of the parameters that one needs to tune to get a good performance out of an stochastic gradient descent based learning tool. Vowpal Wabbit made it very easy for us to try higher order interaction features. It addressed the problem of combinatorial explosion of cardinality due to having two features of high cardinality interacting with each other by means of feature hashing and low-rank approximation. This has lead us to easily try a variety of feature combinations in quick iterations. Vowpal Wabbit also supports mini batching to strike a good compromise between fast learning and good convergence.