Diving into data | A blog on machine learning, data mining and visualization
1,440 FOLLOWERS
I'm Ando Saabas. I'm currently employed by Microsoft and work on data in the Skype team. This is a blog dedicated to general topics in applied machine learning, data mining and visualizations.
Diving into data | A blog on machine learning, data mining and visualization
3y ago
In practical machine learning and data science tasks, an ML model is often used to quantify a global, semantically meaningful relationship between two or more values. For example, a hotel chain might want to use ML to optimize their pricing strategy and use a model to estimate the likelihood of a room being booked at a given price and day of the week. For a relationship like this the assumption is that, all other things being equal, a cheaper price is preferred by a user, so demand is higher at a lower price. However what might easily happen is that upon building the model, the data scientist ..read more
Diving into data | A blog on machine learning, data mining and visualization
3y ago
In two of my previous blog posts, I explained how the black box of a random forest can be opened up by tracking decision paths along the trees and computing feature contributions. This way, any prediction can be decomposed into contributions from features, such that \(prediction = bias + feature_1contribution+..+feature_ncontribution\).
However, this linear breakdown is inherently imperfect, since a linear combination of features cannot capture interactions between them. A classic example of a relation where a linear combination of inputs cannot capture the output is exclusive or (XOR), define ..read more
Diving into data | A blog on machine learning, data mining and visualization
3y ago
The need for anomaly and change detection will pop up in almost any data driven system or quality monitoring application. Typically, there are set of metrics that need to be monitored and an alert raised if the values deviate from the expected. Depending on the task at hand, this can happen at individual datapoint level (anomaly detection) or population level where we want to know if the underlying distribution changes or not (change detection).
The latter is most commonly tackled by the most straightforward: calculating some point estimates, typically the mean or median and track these. This ..read more
Diving into data | A blog on machine learning, data mining and visualization
3y ago
Like with any sport, the question of who are the best competitors of all time in Mixed Martial Arts (MMA) is something that is hotly debated among MMA fans. And unlike for tournament based sports such as tennis, or sports where results can be objectively measured such as track and field, it is a question that is much more difficult to answer in MMA. Firstly, fighters compete in different weight classes and organizations, often making direct comparison impossible. Secondly, even when competitors are in the same weight class in the same organization, comparison can be difficult simply because of ..read more
Diving into data | A blog on machine learning, data mining and visualization
3y ago
Today, we had the first event of the Estonian Machine Learning Meetup series. I was quite baffled by the pretty massive turnout, with more than a hundred people attending, indicating that such an event series is long overdue. So props to Andre Karpištšenko for organizing. I had the honour of being a presenter in the inaugural event, talking about interpreting machine learning models in general, and random forest models in particular. Slides below.
Follow @crossentropy
Tweet ..read more
Diving into data | A blog on machine learning, data mining and visualization
3y ago
There is huge number of machine learning methods, statistical tools and data mining techniques available for a given data related task, from self organizing maps to Q-learning, from streaming graph algorithms to gradient boosted trees. Many of these methods, while powerful in specific domains and problem setups, are arcane and utilized or even understood by few.
On the other hand, there are some methods and concepts that are widely used and consistently useful (or downright irreplaceable) in a large variety of domains, problem settings and scales. Knowing and understanding them well will give ..read more
Diving into data | A blog on machine learning, data mining and visualization
3y ago
In one of my previous posts I discussed how random forests can be turned into a “white box”, such that each prediction is decomposed into a sum of contributions from each feature i.e. \(prediction = bias + feature_1 contribution + … + feature_n contribution\).
I’ve a had quite a few requests for code to do this. Unfortunately, most random forest libraries (including scikit-learn) don’t expose tree paths of predictions. The implementation for sklearn required a hacky patch for exposing the paths. Fortunately, since 0.17.dev, scikit-learn has two additions in the API that make this relatively st ..read more
Diving into data | A blog on machine learning, data mining and visualization
3y ago
An aspect that is important but often overlooked in applied machine learning is intervals for predictions, be it confidence or prediction intervals. For classification tasks, beginning practitioners quite often conflate probability with confidence: probability of 0.5 is taken to mean that we are uncertain about the prediction, while a prediction of 1.0 means we are absolutely certain in the outcome. But there are two concepts being mixed up here. A prediction of 0.5 could mean that we have learned very little about a given instance, due to observing no or only a few data points about it. Or it ..read more
Diving into data | A blog on machine learning, data mining and visualization
3y ago
Hacker News is a popular social news website, mostly covering technology and startup topics. It relies on user submissions and moderation, where each submitted story can be upvoted and commented by users, which in term determines whether the story reaches the front page and how long it stays there.
One question that has intrigued me is whether among the topics that are regularly on the front page, are there any that are consistently preferred over others in terms of upvotes or comments? For example, do science news get more or less upvotes than stories on gaming on average? Do bitcoin stories ..read more
Diving into data | A blog on machine learning, data mining and visualization
3y ago
In my previous posts, I looked at univariate methods,linear models and regularization and random forests for feature selection.
In this post, I’ll look at two other methods: stability selection and recursive feature elimination (RFE), which can both considered wrapper methods. They both build on top of other (model based) selection methods such as regression or SVM, building models on different subsets of data and extracting the ranking from the aggregates.
As a wrap-up I’ll run all previously discussed methods, to highlight their pros, cons and gotchas with respect to each other.
Stability se ..read more