## Follow Another Datum on Feedspot

or

Another Datum by Yoel Zeldes - 3M ago

The Variational Autoencoder (VAE) is a paragon for neural networks that try to learn the shape of the input space. Once trained, the model can be used to generate new samples from the input space.

If we have labels for our input data, it’s also possible to condition the generation process on the label. In the MNIST case, it means we can specify which digit we want to generate an image for.

Let’s take it one step further... Could we condition the generation process on the digit without using labels at all? Could we achieve the same results using an unsupervised approach?

If we wanted to rely on labels, we could do something embarrassingly simple. We could train 10 independent VAE models, each using images of a single digit.

That would obviously work, but you're using the labels. That's cheating!

OK, let’s not use them at all. Let’s train our 10 models, and just, well, have a look with our eyes on each image before passing it to the appropriate model.

Hey, you’re cheating again! While you don’t use the labels per se, you do look at the images in order to route them to the appropriate model.

Fine... If instead of doing the routing ourselves we let another model learn the routing, that wouldn’t be cheating at all, would it?

Right! :)

We can use an architecture of 11 modules as follows:

But how will the manager decide which expert to pass the image to? We could train it to predict the digit of the image, but again - we don’t want to use the labels!

Phew... I thought you're gonna cheat...

So how can we train the manager without using the labels? It reminds me of a different type of model - Mixture of Experts (MoE). Let me take a small detour to explain how MoE works. We'll need it, since it's going to be a key component of our solution.

Mixture of Experts explained to non-experts

MoE is a supervised learning framework. You can find a great explanation by Geoffrey Hinton on Coursera and on YouTube. MoE relies on the possibility that the input might be segmented according to the $x \rightarrow y$ mapping. Have a look at this simple function:

The ground truth is defined to be the purple parabola for $x = x$'. If we were to specify by hand where the split point $x$' is, we could learn the mapping in each input segment independently using two separate models.

In complex datasets we might not know the split points. One (bad) solution is to segment the input space by clustering the $x$’s using K-means. In the two parabolas example, we’ll end up with $x$'' as the split point between two clusters. Thus, when we’ll train the model on the $x So how can we train a model that learns the split points while at the same time learns the mapping that defines the split points? MoE does so using an architecture of multiple subnetworks - one manager and multiple experts: The manager maps the input into a soft decision over the experts, which is used in two contexts: • The output of the network is a weighted average of the experts’ outputs, where the weights are the manager’s output. • The loss function is$\sum_i p_i(y - \bar{y_i})^2$.$y$is the label,$\bar{y_i}$is the output of the i'th expert,$p_i$is the i'th entry of the manager's output. When you differentiate the loss, you get these results (I encourage you to watch the video for more details): 1. The manager decides for each expert how much it contributes to the loss. In other words, the manager chooses which experts should tune their weights according to their error. 2. The manager tunes the probabilities it outputs in such a way that the experts that got it right will get higher probabilities than those that didn’t. This loss function encourages the experts to specialize in different kinds of inputs. The last piece of the puzzle... is$x$Let’s get back to our challenge! MoE is a framework for supervised learning. Surely we can change$y$to be$x$for the unsupervised case, right? MoE's power stems from the fact that each expert specializes in a different segment of the input space with a unique mapping$x \rightarrow y$. If we use the mapping$x \rightarrow x$, each expert will specialize in a different segment of the input space with unique patterns in the input itself. We'll use VAEs as the experts. Part of the VAE’s loss is the reconstruction loss, where the VAE tries to reconstruct the original input image$x$: A cool byproduct of this architecture is that the manager can classify the digit found in an image using its output vector! One thing we need to be careful about when training this model is that the manager could easily degenerate into outputting a constant vector - regardless of the input in hand. This results in one VAE specialized in all digits, and nine VAEs specialized in nothing. One way to mitigate it, which is described in the MoE paper, is to add a balancing term to the loss. It encourages the outputs of the manager over a batch of inputs to be balanced:$\sum_\text{examples in batch} \vec{p} \approx Uniform$. Enough talking - It's training time! In [1]: import numpy as np import tensorflow as tf from tensorflow.examples.tutorials.mnist import input_data import matplotlib.pyplot as plt np.random.seed(42) tf.set_random_seed(42) %matplotlib inline  In [2]: mnist = input_data.read_data_sets('MNIST_data') INPUT_SIZE = 28 * 28 NUM_DIGITS = 10  In [3]: params = { 'manager_layers': [128], # the manager will be implemented using a simple feed forward network 'encoder_layers': [128], # ... and so will be the encoder 'decoder_layers': [128], # ... and the decoder as well (CNN will be better, but let's keep it concise) 'activation': tf.nn.sigmoid, # the activation function used by all subnetworks 'decoder_std': 0.5, # the standard deviation of P(x|z) discussed in the first post of the series 'z_dim': 10, # the dimension of the latent space 'balancing_weight': 0.1, # how much the balancing term will contribute to the loss 'epochs': 100, 'batch_size': 100, 'learning_rate': 0.001 }  In [4]: class VAE(object): _ID = 0 def __init__(self, params, images): self._id = VAE._ID VAE._ID += 1 self._params = params encoder_mu, encoder_var = self.encode(images) eps = tf.random_normal(shape=[tf.shape(images)[0], self._params['z_dim']], mean=0.0, stddev=1.0) z = encoder_mu + tf.sqrt(encoder_var) * eps self.decoded_images = self.decode(z) self.loss = self._calculate_loss(images, self.decoded_images, encoder_mu, encoder_var) def encode(self, images): with tf.variable_scope('encode_{}'.format(self._id), reuse=tf.AUTO_REUSE): x = images for layer in self._params['encoder_layers']: x = tf.layers.dense(x, layer, activation=self._params['activation']) mu = tf.layers.dense(x, self._params['z_dim']) var = 1e-5 + tf.exp(tf.layers.dense(x, self._params['z_dim'])) return mu, var def decode(self, z): with tf.variable_scope('decode_{}'.format(self._id), reuse=tf.AUTO_REUSE): for layer in self._params['decoder_layers']: z = tf.layers.dense(z, layer, activation=self._params['activation']) mu = tf.layers.dense(z, INPUT_SIZE) return tf.nn.sigmoid(mu) def _calculate_loss(self, images, decoded_images, encoder_mu, encoder_var): loss_reconstruction = -tf.reduce_sum( tf.contrib.distributions.Normal( decoded_images, self._params['decoder_std'] ).log_prob(images), axis=1 ) loss_prior = -0.5 * tf.reduce_sum( 1 + tf.log(encoder_var) - encoder_mu ** 2 - encoder_var, axis=1 ) return loss_reconstruction + loss_prior  In [5]: class Manager(object): def __init__(self, params, experts, images): self._params = params self._experts = experts probs = self.calc_probs(images) self.expected_expert_loss, self.balancing_loss, self.loss = self._calculate_loss(probs) def calc_probs(self, images): with tf.variable_scope('prob', reuse=tf.AUTO_REUSE): x = images for layer in self._params['manager_layers']: x = tf.layers.dense(x, layer, activation=self._params['activation']) logits = tf.layers.dense(x, len(self._experts)) probs = tf.nn.softmax(logits) return probs def _calculate_loss(self, probs): losses = tf.concat([tf.reshape(expert.loss, [-1, 1]) for expert in self._experts], axis=1) expected_expert_loss = tf.reduce_mean(tf.reduce_sum(losses * probs, axis=1), axis=0) experts_importance = tf.reduce_sum(probs, axis=0) _, experts_importance_var = tf.nn.moments(experts_importance, axes=[0]) balancing_loss = experts_importance_var loss = expected_expert_loss + self._params['balancing_weight'] * balancing_loss return expected_expert_loss, balancing_loss, loss  In [6]: images = tf.placeholder(tf.float32, [None, INPUT_SIZE]) experts = [VAE(params, images) for _ in range(NUM_DIGITS)] manager = Manager(params, experts, images) train_op = tf.train.AdamOptimizer(params['learning_rate']).minimize(manager.loss)  In [7]: samples = [] expected_expert_losses = [] balancing_losses = [] losses = [] with tf.Session() as sess: sess.run(tf.global_variables_initializer()) for epoch in range(params['epochs']): # train over the batches for _ in range(mnist.train.num_examples / params['batch_size']): batch_images, batch_digits = mnist.train.next_batch(params['batch_size']) sess.run(train_op, feed_dict={images: batch_images}) # keep track of the loss expected_expert_loss, balancing_loss, loss = sess.run( [manager.expected_expert_loss, manager.balancing_loss, manager.loss], {images: mnist.train.images} ) expected_expert_losses.append(expected_expert_loss) balancing_losses.append(balancing_loss) losses.append(loss) # generate random samples so we can have a look later on sample_z = np.random.randn(1, params['z_dim']) gen_samples = sess.run([expert.decode(tf.constant(sample_z, dtype='float32')) for expert in experts]) samples.append(gen_samples)  In [8]: plt.subplot(131) plt.plot(expected_expert_losses) plt.title('expected expert loss', y=1.07) plt.subplot(132) plt.plot(balancing_losses) plt.title('balancing loss', y=1.07) plt.subplot(133) plt.plot(losses) plt.title('total loss', y=1.07) plt.tight_layout()  Read Full Article • Show original • . • Share • . • Favorite • . • Email • . • Add Tags Another Datum by Yoel Zeldes - 5M ago So you’ve finished training your model, and it’s time to get some insights as to what it has learned. You decide which tensor should be interesting, and go look for it in your code — to find out what its name is. Then it hits you — you forgot to give it a name. You also forgot to wrap the logical code block with a named scope. It means you’ll have a hard time getting a reference to the tensor. It holds for python scripts as well as TensorBoard: Can you see that small red circle lost in the sea of tensors? Finding it is hard... That’s a bummer! It would have been much better if it looked more like this: That’s more like it! Each set of tensors which form a logical unit is wrapped inside a named scope. Why can’t the graph be automatically constructed in a way that resembles your code? I mean, most chances are you didn’t construct the model using a single function, did you? Your code base contains multiple functions — each forms a logical unit which deserves its own named scope! Let’s say you have a tensor x which was defined by the function f, which in turn was called by g. It means that while you were writing the code, you had this logical structure in mind: g -> f -> x. Wouldn’t it be great if the model would automatically be constructed in a way that the name of the tensor would be g/f/x? Come to think of it, it’s pretty simple to do. All you have to do is go over all your functions and add a single line of code: def f(): with tensorflow.name_scope(‘f’): # define tensors  So what’s wrong with that approach? 1. The name of the function f appears twice — both in the function declaration and as an argument to tensorflow.name_scope. Maybe next week you’ll change the name of the function to something more meaningful, let’s say foo. Unfortunately, you might forget to update the name of the scope! 2. You have to apply indentation to the entire body of f. While it’s not that bad, personally I don’t like having high indentation levels. Let’s say f contains a for loop which contains an if statement, which contains another for loop. Thanks to calling to tensorflow.name_scope, we’re already at an indentation level of 4! We can bypass these disadvantages using simple metaprogramming — Python’s decorators to the rescue! import re def name_scope(f): def func(*args, **kwargs): name = f.__name__[re.search(r’[^_]’, f.__name__).start():] with tensorflow.name_scope(name): return f(*args, **kwargs) return func @name_scope def foo(): # define tensors  How does it work? The @ is a syntactic sugar. It’s equivalent to the following: def foo(): # define tensors foo = name_scope(foo)  name_scope gets a function as an argument (f) and returns a new function (func). func creates a named scope, and then calls f. The result? All the tensors that are defined by f will be created inside a named scope. The name of the scope will be the name of the original function (“foo”) — thanks to f.__name__. One small problem is that while function names might start with “_”, tensorflow scope names can’t. This is why we have to use re. Why is it that important? The challenge of writing clean tensorflow code is negligible compared to the research challenge of actually making the model any good. Thus, it’s easy to be tempted to just focus on the research aspects of your job. However, in the long run, it’s important not to neglect the maintainability and readability of your code, including those of your graph. The decorator approach make my job a little easier, and I hope you’ll benefit from it too. Do you have other tips you’d like to share? Drop a line in the comments! Originally published by me at engineering.taboola.com. Read Full Article • Show original • . • Share • . • Favorite • . • Email • . • Add Tags Another Datum by Yoel Zeldes - 5M ago Some of the problems we tackle using machine learning involve categorical features that represent real world objects, such as words, items and categories. So what happens when at inference time we get new object values that have never been seen before? How can we prepare ourselves in advance so we can still make sense out of the input? Unseen values, also called OOV (Out of Vocabulary) values, must be handled properly. Different algorithms have different methods to deal with OOV values. Different assumptions on the categorical features should be treated differently as well. In this post, I’ll focus on the case of deep learning applied to dynamic data, where new values appear all the time. I’ll use Taboola’s recommender system as an example. Some of the inputs the model gets at inference time contain unseen values — this is common in recommender systems. Examples include: • Item id: each recommendable item gets a unique identifier. Every day thousands of new items get into the system. • Advertiser id: sponsored content is created by advertisers. The number of new daily advertisers is much smaller compared to the number of new items. Nonetheless, it’s important to handle them correctly, especially since we want to support new advertisers. So what’s the challenge with OOV values? Learning to handle OOV values An OOV value is associated with values not seen by the model at training time. Hence, if we get an OOV value at inference time, the model won’t know what to do with it. One simple solution is to replace all the rare values with a special OOV token before training. Since all OOV values are the same from the model’s point of view, we’ll replace them with the OOV token at inference time. This solution has two positive outcomes: 1. The model will be exposed to the OOV token while training. In deep learning we usually embed categorical features. After training, the model will learn a meaningful embedding for all OOV values . 2. The risk of overfitting to the rare values will be mitigated. These values appear in a small number of examples. If we learn embeddings for these values, the model might learn to use them to explain particularities or random noise found in these specific examples. Another disaster that can result with learning these embeddings is not getting enough gradient updates propagated to them. As a consequence, the random initialization will dominate the result embeddings over the signal learned through training. Problem solved... Or is it? Handling OOV values is hard! The model uses the item id feature to memorize different information per item, similarly to the pure collaborative filtering approach. Rare items that are injected with the OOV token can’t benefit from it, so the model performs worse on them. The interesting thing is that even if we don’t use the item id at all during training, the model still performs worse on rare items! This is because they come from a distribution different than that of the general population. They have specific characteristics — maybe they performed poorly online, which caused Taboola’s recommender system to recommend them less, and in turn — they became rare in the dataset. So why does this distribution difference matter? If we learn the OOV embedding using this special distribution, it won’t generalize to the general population. Think about it this way — every item was a new item at some point. At that point, it was injected with the OOV token. So the OOV embedding should perform well for all possible items. Randomness is the data scientist’s best friend In order to learn the OOV embedding using the general population, we can inject the OOV token to a random set of examples from the dataset before we start the training process. But how many examples will suffice? The more we sample, the better the OOV embedding will be. But at the same time, the model will be exposed to a fewer number of non-OOV values, so the performance will degrade. How can we use lots of examples to train the OOV embedding while at the same time use the same examples to train the non-OOV embeddings? Instead of randomly injecting the OOV token before starting to train, we chose the following approach: in each epoch the model trains using all of the available values (the OOV token isn’t injected). At the end of the epoch we sample a random set of examples, inject the OOV token, and train the model once again. This way, we enjoy both worlds! As was done in the previous approach, we also inject the OOV token to rare values - to avoid overfitting. To evaluate the new approach, we injected the OOV token to all of the examples and evaluated our offline metric (MSE). It improved by 15% compared to randomly injecting the OOV token before the model starts to train. Final thoughts Our model had been used in production for a long time before we thought of the new approach. It could have been easy to miss this potential performance gain, since the model performed well overall. It just stresses the fact that you always have to look for the unexpected! Originally published by me at engineering.taboola.com. Read Full Article • Show original • . • Share • . • Favorite • . • Email • . • Add Tags Another Datum by Yoel Zeldes - 5M ago In the last couple of years deep learning (DL) has become a main enabler for applications in many domains such as vision, NLP, audio, click stream data etc. Recently researchers started to successfully apply deep learning methods to graph datasets in domains like social networks, recommender systems and biology, where data is inherently structured in a graphical way. So how do Graph Neural Networks work? Why do we need them? The Premise of Deep Learning In machine learning tasks involving graphical data, we usually want to describe each node in the graph in a way that allows us to feed it into some machine learning algorithm. Without DL, one would have to manually extract features, such as the number of neighbors a node has. But this is a laborious job. This is where DL shines. It automatically exploits the structure of the graph in order to extract features for each node. These features are called embeddings. The interesting thing is, that even if you have absolutely no information about the nodes, you can still use DL to extract embeddings. The structure of the graph, that is — the connectivity patterns, hold viable information. So how can we use the structure to extract information? Can the context of each node within the graph really help us? Learning from Context One well known algorithm that extracts information about entities using context alone is word2vec. The input to word2vec is a set of sentences, and the output is an embedding for each word. Similarly to the way text describes the context of each word via the words surrounding it, graphs describe the context of each node via neighbor nodes. While in text words appear in linear order, in graphs it’s not the case. There’s no natural order between neighbor nodes. So we can’t use word2vec... Or can we? Reduction like a Badass Mathematician We can apply reduction from the graphical structure of our data into a linear structure such that the information encoded in the graphical structure isn’t lost. Doing so, we’ll be able to use good old word2vec. The key point is to perform random walks in the graph. Each walk starts at a random node, and performs a series of steps, where each step goes to a random neighbor. Each random walk forms a sentence that can be fed into word2vec. This algorithm is called node2vec. There are more details in the process, which you can read about in the original paper. Case study Taboola’s content recommender system gathers lots of data, some of which can be represented in a graphical manner. Let’s inspect one type of data as a case study for using node2vec. Taboola recommends articles in a widget shown in publishers’ websites: Each article has named entities — the entities described by the title. For example, the item “the cutest dogs on the planet” contains the entities “dog” and “planet”. Each named entity can appear in many different items. We can describe this relationship using a graph in the following way: each node will be a named entity, and there will be an edge between two nodes if the two named entities appear in the same item: Now that we are able to describe our data in a graphical manner, let’s run node2vec to see what insights we can learn out of the data. You can find the working code here. After learning node embeddings, we can use them as features for a downstream task, e.g. CTR (Click Through Rate) prediction. Although it could benefit the model, it’ll be hard to understand the qualities learned by node2vec. Another option would be to cluster similar embeddings together using K-means, and color the nodes according to their associated cluster: Cool! The clusters captured by node2vec seem to be homogeneous. In other words, nodes that are close to each other in the graph are also close to each other in the embedding space. Take for instance the orange cluster — all of its named entities are related to basketball. You might wonder what is the benefit of using node2vec over classical graphical algorithms, such as community detection algorithms (e.g., the Girvan-Newman algorithm). Capturing the community each node belongs to can definitely be done using such algorithms, there’s nothing wrong with it. Actually, that’s exactly feature engineering. And we already know that DL can save you the time of carefully handcrafting such features. So why not enjoy this benefit? We should also keep in mind that node2vec learns high dimensional embeddings. These embeddings are much richer than merely community belonging. Taking Another Approach Using node2vec in this use case might not be the first idea that comes to mind. One might suggest to simply use word2vec, where each sentence is the sequence of named entities inside a single item. In this approach we don’t treat the data as having a graphical structure. So what’s the difference between this approach — which is valid, and node2vec? If we think about it, each sentence we generate in the word2vec approach is a walk in the graph we’ve defined earlier. node2vec also defines walks on the same graph. So they are the same, right? Let’s have a look at the clusters we get by the word2vec approach: Now the “basketball” cluster is less homogenous — it contains both orange and blue nodes. The named entity “Basketball” for example was colored orange, while the basketball players “Lebron James” and “Kobe Bryant” were colored blue! But why did this happen? In this approach each walk in the graph is composed only of named entities that appear together in a single item. It means we are limited to walks that don’t go further than distance 1 from the starting node. In node2vec, we don’t have that limit. Since each approach uses a different kind of walks, the learned embeddings capture a different kind of information. To make it more concrete, consider the following example: say we have two items — one with named entities A, B, C and another with D, B, E. These items induce the following graph: In the simple word2vec approach we’ll generate the following sentences: [A, B, C] and [D, B, E]. In the node2vec approach we could also get sentences like [A, B, E]. If we fetch the latter into the training process, we’ll learn that E and C are interchangeable: the prefix [A, B] will be able to predict both C and E. Therefore, C and E will get similar embeddings, and will be clustered together. Takeway Using the right data structure to represent your data is important. Each data structure implies a different learning algorithm, or in other words — introduces a different inductive bias. Identifying your data has a certain structure, so you can use the right tool for the job, might be challenging. Since so many real world datasets are naturally represented as graphs, we think Graph Neural Networks are a must-have in our tool box as data scientists. Originally published at engineering.taboola.com by me and Zohar Komarovsky. Read Full Article • Show original • . • Share • . • Favorite • . • Email • . • Add Tags Another Datum by Yoel Zeldes - 6M ago Personal branding is a thing now. It always has been, but I believe it’s been getting more and more attention recently. More people are aware of its importance, including the employers. Giving you a big paycheck, assuming you’re good, is obvious. Providing opportunities to flourish and build your personal brand is something an increasing number of companies are trying to seduce you with. While working in the algorithms group at Taboola, I was encouraged by the company to share my knowledge with the data science community. It has motivated me to embark on a journey to build my personal brand as a data scientist, and I want to share how I did it with you. Choosing Your Path Wisely There are many paths you can choose in your journey to build your personal brand. The problem is that your time is limited. There’s a lot of uncertainty about each path and its outcomes, so it’s best to make educated decisions. There are several criteria by which I judge each path: • How likely is it that I’ll get to the finish line? This is the most important criterion when you just start building your brand. If you fail on the first path you tried, it’s gonna be a motivation killer, and you’ll quit. • How much time will it take? Together with the previous criterion, it enables you to better manage your time. Maybe it’s ok to choose a risky path, given it’s only gonna take you a small amount of time. • Will I learn something new? Data scientists have to position themselves as experts in their domain. In order to be an expert, you must keep learning all the time. • What will the final product be? At the beginning of your journey, the product can be mild — maybe a nice blog post for beginners. As you gain more experience, you should put more emphasis on doing more meaningful things. Maybe a blog post about an advanced aspect of ML, or even trying to win a Kaggle competition. • Will I enjoy walking along the path? At the end of the day, we do what we do for fun, right? :) So which paths exist? The Roadmap 1. Blogging For me, it’s the path of choice. The advantages of having an active blog are many. But how do you choose what to write about? The most obvious option is to write about something you worked on as part of your job. Most people do interesting things. The sad thing is, they’re not aware of it. They think their daily tasks won’t be interesting to others. I claim it to be not true. Another option is to write a post about a paper you’ve just read. I believe providing value is important — merely summarizing a paper can be done by almost anyone. You must make your post unique. I love to write code that accompanies the post, for instance. In fact, most of my posts are Jupyter notebooks. There is plenty of ML material on the web, but code samples that demonstrate how things actually work are hard to come by... A third option, which is for the more experienced data scientists, is to write an opinion post. Why is it for the experienced? Inexperienced data scientists will be able to deliver their tasks. Experienced data scientists will be able to state their opinion about what is the best way to do so. Which framework is better for the task? Is this paper any good? When should you use this visualization and not the other?... Another somewhat harder option would be to write a post as if you were trying to teach a subject in class. You’ll want to write the post in a concise, short, to the point manner. If you can do it well, there’s a lot of value in it. In my Intro to Statistical Tests post I took a well-known concept — statistical tests, and tried to explain it to newcomers. I also added code, so it’d be even easier to comprehend. Writing this kind of post will allow you to better understand the subject yourself! 2. Working on a side project Try to think of a cool project you can build in a couple of weeks/months, without putting too much time into it. Once you’re done, you can show the project to the world! Recognizing in advance that one of the goals is to share the project, you can choose which project to do more wisely. For example, solving MNIST won’t be interesting. You must choose something original. Here’s an example of a post I wrote about Word Morphing which I believe was interesting because of its originality. One of the cool things about side projects is that besides the project itself, you can write a blog post about it. If you have an idea of a project which is the material for a blog post — it’s a great sign the project will be good. 3. Kaggle competitions Try to compete using a model you’ve never tried before. You might not win, but you will gain knowledge and a bit of experience with that new model, which you could share using a blog post! The winners of Kaggle competitions write blog posts all the time. There’s no reason why people who don’t win won’t write blog posts as well. 4. Meetups and conferences Strive to give talks at interesting Meetups/conferences. The first time I gave a talk at a Meetup was because Taboola was asked to give one, and I took the bullet. Only after the talk did I realize what a cool and fun experience it could be. When I heard about Reversim summit, the first thing I did was to submit a talk proposal. Submitting is easy, and no harm is done if your proposal gets rejected. I was lucky enough to get mine accepted. Don’t get me wrong, it required preparation. But there’s a lot of value in it. Even if you just attend an interesting conference without giving a talk, you can write a blog post about it. It builds your brand as someone who goes to conferences and shares knowledge with the community. Here’s an example of a post I wrote after attending an interesting conference. 5. Contribution to Open Source projects If you become a significant contributor your brand is gonna shine. If we’re only talking about small contributions I’m afraid it might not do. But that’s the nice thing though — you can put in a small effort, see if you like it, and if you do — invest more time in it with the goal of becoming a significant contributor. 6. Publishing a paper If you have published a paper in a known proceeding, it’ll positively affect your brand. But if you’re not in the academy, or if you don’t work in a company in which it’s part of your job to publish papers, then the risk is huge. You’ll need to put a lot of effort and time, and there’s a high probability your paper won’t be accepted. I was lucky enough to publish a paper with my coworkers at Taboola, but I’m afraid I won’t go through this path in my spare time — because of the risk. Final thoughts Building your personal brand is a journey. There are no right or wrong paths to take. It’s all about your style, what you like more, and how much time you have to put into it. I believe that all data scientists should put some effort into spreading their knowledge. It can be fun, and it’s awesome for your brand, as well as the brand of the company you work at. Do you have some other path you’ve taken that worked better for you? Let the world know by dropping a comment! Read Full Article • Show original • . • Share • . • Favorite • . • Email • . • Add Tags Another Datum by Yoel Zeldes - 7M ago Tensorflow is great. Really, I mean it. The problem is it’s great up to a point. Sometimes you want to do very simple things, but tensorflow is giving you a hard time. The motivation I had behind writing TFFS (TensorFlow File System) can be shared by anyone who has used tensorflow, including you. All I wanted was to know what the name of a specific tensor is; or what its input tensors are (ignoring operations). All of these questions can be easily answered using tensorboard. Sure, you just open the graph tab, and visually examine the graph. Really convenient, right? Well, only if you want to have a bird overview of the graph. But if you’re focused and have a specific question you want to answer, using the keyboard is the way to go. So why not load the graph inside a python shell and examine it? That’s doable, but writing these lines of code every time I want to do that task? Having to remember how to load a graph, how to look for a tensor, how to get its inputs... Sure, it’s only a couple of lines of code, but once you repeat the same task over and over again, it’s time to write a script! So why not write an interactive script? You mean a script that given the path to your model loads it for you, and provides utility functions to ease your pain of writing tensorflow code? Well, we could do that, but that’s not gonna be as awesome as what I’m gonna show you! Disclaimer: if you want a solution that makes sense, stop reading here and just use the interactive script approach. Continue reading only if you want to learn something a bit different ;) Filesystem to the Rescue!Â¶ The names of tensors have slashes — a big resemblance with the UNIX filesystem. Imagine a world with a tensorflow filesystem, where directories are analogous to tensorflow scopes, and files - to tensors. Given such a filesystem, we could use good old bash to do what we want. For instance, we could list all available scopes and tensors by running find ~/tf - assuming ~/tf is where the tensorflow filesystem mounted. Want to list only tensors? No problem, just use find ~/tf -type f. Hey, we’ve got files, right? What should their content be? Let’s run cat ~/tf/.../tensor_name. Wow! We got the tensor’s value - nice... This will work only if the tensor doesn’t depend on placeholders, for obvious reasons. What about getting the inputs to a tensor? Well, you can run ~/tf/bin/inputs -d 3 ~/tf/.../tensor_name. This special script will print a tree view of the inputs with recursion depth of 3. Nice... OK, I’m in. How do we implement it? That’s a fine question. We can use Filesystem in Userspace (FUSE). It’s a technology that allows us to implement a filesystem in user space. It saves us the trouble of going into the low level realm, which is really hard if you’ve never done it before (and slow to implement). It includes writing code in C - the horror! We’ll use a python binding called fusepy. In this post I’ll explain only the interesting parts of the implementation. The entire code can be found here. Including documentation, it’s only 345 lines of code. First we need to load a tensorflow model: 1. Import the graph structure using tf.train.import_meta_graph. 2. If the model was trained, load the weights using saver.restore. In [ ]: def _load_model(model_path): """ Load a tensorflow model from the given path. It's assumed the path is either a directory containing a .meta file, or the .meta file itself. If there's also a file containing the weights with the same name as the .meta file (without the .meta extension), it'll be loaded as well. """ if os.path.isdir(model_path): meta_filename = [filename for filename in os.listdir(model_path) if filename.endswith('.meta')] assert len(meta_filename) == 1, 'expecting to get a .meta file or a directory containing a .meta file' model_path = os.path.join(model_path, meta_filename[0]) else: assert model_path.endswith('.meta'), 'expecting to get a .meta file or a directory containing a .meta file' weights_path = model_path[:-len('.meta')] graph = tf.Graph() with graph.as_default(): saver = tf.train.import_meta_graph(model_path) if os.path.isfile(weights_path): session = tf.Session(graph=graph) saver.restore(session, weights_path) else: session = None return graph, session  Mapping Tensors to FilesÂ¶ Next, I’ll describe the main class. There are several interesting things in the constructor. First, we call _load_model. Then, we map each tensor in the graph into a path in the filesystem. Say a tensor has the name a/b/c:0 - it will be mapped to the path ~/tf/a/b/c:0, assuming the mount point is ~/tf. Each directory and file is created using the _create_dir and _create_tensor_file functions. These functions return a simple python dict with metadata about the directory or file. We also call _populate_bin, which I'll touch on later. In [ ]: class TfFs(fuse.Operations): def __init__(self, mount_point, model_path): self._graph, self._session = _load_model(model_path) self._files = {} self._bin_scripts = {} self._tensor_values = {} now = time() self._files['/'] = _create_dir(now) self._files['/bin'] = _create_dir(now) self._populate_bin(mount_point) for op in self._graph.get_operations(): for tensor in op.outputs: next_slash_index = 0 while next_slash_index >= 0: next_slash_index = tensor.name.find('/', next_slash_index + 1) if next_slash_index >= 0: key = '/' + tensor.name[:next_slash_index] if key not in self._files: self._files[key] = _create_dir(now) self._files['/' + tensor.name] = _create_tensor_file(tensor) # ...  Working with fusepyÂ¶ fusepy requires you to implement a class that extends fuse.Operations. This class implements the operations the filesystem supports. In our case, we want to support reading files, so we'll implement the read function. when called, this function will evaluate the tensor associated with the given path. In [ ]: # ... def _eval_tensor_if_needed(self, path): """ Given a path to a tensor file, evaluate the tensor and cache the result in self._tensor_values. """ if self._session is None: return None if path not in self._tensor_values: self._tensor_values[path] = self._session.run(self._graph.get_tensor_by_name(path[1:])) return self._tensor_values[path] def read(self, path, size, offset, fh): if path.startswith('/bin/'): return self._bin_scripts[path][offset:offset + size] val = self._eval_tensor_if_needed(path) with printoptions(suppress=True, formatter={'all': _fixed_val_length}, threshold=sys.maxint, linewidth=sys.maxint): return str(val)[offset:offset + size] # ...  Worried about the printoptions part? When implementing a filesystem you need to know the size of the files. We could evaluate all tensors in order to know the sizes, but this would take time and memory. Instead, we can examine each tensor’s shape. Given its shape, and given we use a formatter that outputs a fixed amount of characters per entry in the result (this is where _fixed_val_length comes in), we can calculate the size. Getting Inputs and OutputsÂ¶ While tensorflow scopes have a structure that resembles a filesystem, tensor inputs and outputs don’t. So instead of using a filesystem to get inputs and outputs, we can write a script that can be executed as follows: ~/tf/bin/outputs --depth 3 ~/tf/a:0 The result will look like this: ~/tf/a:0 ├── ~/tf/b:0 └── ~/tf/c:0 └── ~/tf/d:0 | └── ~/tf/e:0 └── ~/tf/f:0 Nice! We have the tree of outputs! To implement it, all we have to do is: 1. Get the data, which is a mapping from each tensor to its outputs. 2. Implement a recursive function that prints the tree. It's not that complicated (it took me 6 lines of code), and can be a nice exercise. There's one challenge though... The outputs script is gonna be executed inside a new process - this is how UNIX works. It means it won't have access to the tensorflow graph, which was loaded in the main process. So how is it going to access all the inputs/outputs of a tensor? I could have implemented interprocess communication between the two processes, which is a lot of work. But I chose a different approach. I created a template python file that contained the following line: _TENSOR_TO_DEPENDENCIES = {{TENSOR_TO_TENSORS_PLACEHOLDER}} This is illegal python code. The _populate_bin function which we saw earlier reads this python file, and replaces {{TENSOR_TO_TENSORS_PLACEHOLDER}} with the dictionary of tensor to outputs (or inputs). The result file is then mapped to a path in our filesystem - ~/tf/bin/outputs (or ~/tf/bin/inputs). It means that if you run cat ~/tf/bin/outputs, you'll be able to see the (potentially) huge mapping inside the file. cat ./final_thoughtsÂ¶ We did it! We mapped a tensorflow model to a filesystem. TFFS was a fun small project, and I learned about FUSE while doing so, which is neat. TFFS is a cute tool, but it's not a replacement for good old python shell. With python, you can easily import a tensorflow model and inspect the tensors manually. I just wish I could remember how to do so, even after doing it hundreds of times... Read Full Article • Show original • . • Share • . • Favorite • . • Email • . • Add Tags Another Datum by Yoel Zeldes - 8M ago In the previous post of this series I introduced the Variational Autoencoder (VAE) framework, and explained the theory behind it. In this post I'll explain the VAE in more detail, or in other words - I'll provide some code :) After reading this post, you'll understand the technical details needed to implement VAE. As a bonus point, I'll show you how by imposing a special role to some of the latent vector's dimensions, the model can generate images conditioned on the digit type. In [1]: import numpy as np import tensorflow as tf from tensorflow.examples.tutorials.mnist import input_data import matplotlib.pyplot as plt np.random.seed(42) tf.set_random_seed(42) %matplotlib inline  The model will be trained on MNIST - handwritten digits dataset. The input is an image in$\mathbb{R}^{28×28}$. In [2]: mnist = input_data.read_data_sets('MNIST_data') input_size = 28 * 28 num_digits = 10  Next we'll define the hyperparameters we're going to use. Feel free to play with different values to get a feeling of how the model is affected. The notebook can be found here. In [3]: params = { 'encoder_layers': [128], # the encoder will be implemented using a simple feed forward network 'decoder_layers': [128], # and so will the decoder (CNN will be better, but I want to keep the code simple) 'digit_classification_layers': [128], # this is for the conditioning. I'll explain it later on 'activation': tf.nn.sigmoid, # the activation function used by all sub-networks 'decoder_std': 0.5, # the standard deviation of P(x|z) discussed in the previous post 'z_dim': 10, # the dimension of the latent space 'digit_classification_weight': 10.0, # this is for the conditioning. I'll explain it later on 'epochs': 20, 'batch_size': 100, 'learning_rate': 0.001 }  The ModelÂ¶ The model is composed of three sub-networks: 1. Given$x$(image), encode it into a distribution over the latent space - referred to as$Q(z|x)$in the previous post. 2. Given$z$in latent space (code representation of an image), decode it into the image it represents - referred to as$f(z)$in the previous post. 3. Given$x$, classify its digit by mapping it to a layer of size 10 where the i'th value contains the probability of the i'th digit. The first two sub-networks are the vanilla VAE framework. The third one is used as an auxiliary task, which will enforce some of the latent dimensions to encode the digit found in an image. Let me explain the motivation: in the previous post I explained that we don't care what information each dimension of the latent space holds. The model can learn to encode whatever information it finds valuable for its task. Since we're familiar with the dataset, we know the digit type should be important. We want to help the model by providing it with this information. Moreover, we'll use this information to generate images conditioned on the digit type, as I'll explain later. Given the digit type, we'll encode it using one hot encoding, that is, a vector of size 10. These 10 numbers will be concatenated into the latent vector, so when decoding that vector into an image, the model will make use of the digit information. There are two ways to provide the model with a one hot encoding vector: 1. Add it as an input to the model. 2. Add it as a label so the model will have to predict it by itself: we'll add another sub-network that predicts a vector of size 10 where the loss is the cross entropy with the expected one hot vector. We'll go with the second option. Why? Well, in test time we can use the model in two ways: 1. Provide an image as input, and infer a latent vector. 2. Provide a latent vector as input, and generate an image. Since we want to support the first option too, we can't provide the model with the digit as input, since we won't know it in test time. Hence, the model must learn to predict it. Now that we understand all the sub-networks composing the model, we can code them. The mathematical details behind the encoder and decoder can be found in the previous post. In [4]: def encoder(x, layers): for layer in layers: x = tf.layers.dense(x, layer, activation=params['activation']) mu = tf.layers.dense(x, params['z_dim']) var = 1e-5 + tf.exp(tf.layers.dense(x, params['z_dim'])) return mu, var def decoder(z, layers): for layer in layers: z = tf.layers.dense(z, layer, activation=params['activation']) mu = tf.layers.dense(z, input_size) return tf.nn.sigmoid(mu) def digit_classifier(x, layers): for layer in layers: x = tf.layers.dense(x, layer, activation=params['activation']) logits = tf.layers.dense(x, num_digits) return logits  In [5]: images = tf.placeholder(tf.float32, [None, input_size]) digits = tf.placeholder(tf.int32, [None]) # encode an image into a distribution over the latent space encoder_mu, encoder_var = encoder(images, params['encoder_layers']) # sample a latent vector from the latent space - using the reparameterization trick eps = tf.random_normal(shape=[tf.shape(images)[0], params['z_dim']], mean=0.0, stddev=1.0) z = encoder_mu + tf.sqrt(encoder_var) * eps # classify the digit digit_logits = digit_classifier(images, params['digit_classification_layers']) digit_prob = tf.nn.softmax(digit_logits) # decode the latent vector - concatenated to the digits classification - into an image decoded_images = decoder(tf.concat([z, digit_prob], axis=1), params['decoder_layers'])  In [6]: # the loss is composed of how well we can reconstruct the image loss_reconstruction = -tf.reduce_sum( tf.contrib.distributions.Normal( decoded_images, params['decoder_std'] ).log_prob(images), axis=1 ) # and how off the distribution over the latent space is from the prior. # Given the prior is a standard Gaussian and the inferred distribution # is a Gaussian with a diagonal covariance matrix, the KL-divergence # becomes analytically solvable, and we get loss_prior = -0.5 * tf.reduce_sum( 1 + tf.log(encoder_var) - encoder_mu ** 2 - encoder_var, axis=1 ) loss_auto_encode = tf.reduce_mean( loss_reconstruction + loss_prior, axis=0 ) # digit_classification_weight is used to weight between the two losses, # since there's a tension between them loss_digit_classifier = params['digit_classification_weight'] * tf.reduce_mean( tf.nn.sparse_softmax_cross_entropy_with_logits(labels=digits, logits=digit_logits), axis=0 ) loss = loss_auto_encode + loss_digit_classifier train_op = tf.train.AdamOptimizer(params['learning_rate']).minimize(loss)  TrainingÂ¶ We'll train the model to optimize the two losses - the VAE loss and the classification loss - using SGD. At the end of every epoch we'll sample latent vectors and decode them into images, so we can visualize how the generative power of the model improves over the epochs. The sampling method is as follows: 1. Deterministically set the dimensions which are used for digit classification according to the digit we want to generate an image for. If for example we want to generate an image of the digit 2, these dimensions will be set to$[0010000000]$. 2. Randomly sample the other dimensions according to the prior - a multivariate Gaussian. We'll use these sampled values for all the different digits we generate in a given epoch. This way we can have a feeling of what is encoded in the other dimensions, for example stroke style. The intuition behind step 1 is that after convergence the model should be able to classify the digit in an input image using these dimensions. On the other hand, these dimensions are also used in the decoding step to generate the image. It means the decoder sub-network learns that when these dimensions have the values corresponding to the digit 2, it should generate an image of that digit. Therefore, if we manually set these dimensions to contain the information of the digit 2, we'll get a generated image of that digit. In [7]: samples = [] losses_auto_encode = [] losses_digit_classifier = [] with tf.Session() as sess: sess.run(tf.global_variables_initializer()) for epoch in xrange(params['epochs']): for _ in xrange(mnist.train.num_examples / params['batch_size']): batch_images, batch_digits = mnist.train.next_batch(params['batch_size']) sess.run(train_op, feed_dict={images: batch_images, digits: batch_digits}) train_loss_auto_encode, train_loss_digit_classifier = sess.run( [loss_auto_encode, loss_digit_classifier], {images: mnist.train.images, digits: mnist.train.labels}) losses_auto_encode.append(train_loss_auto_encode) losses_digit_classifier.append(train_loss_digit_classifier) sample_z = np.tile(np.random.randn(1, params['z_dim']), reps=[num_digits, 1]) gen_samples = sess.run(decoded_images, feed_dict={z: sample_z, digit_prob: np.eye(num_digits)}) samples.append(gen_samples)  Let's verify both losses look good, that is - decreasing: In [8]: plt.subplot(121) plt.plot(losses_auto_encode) plt.title('VAE loss') plt.subplot(122) plt.plot(losses_digit_classifier) plt.title('digit classifier loss') plt.tight_layout()  Read Full Article • Show original • . • Share • . • Favorite • . • Email • . • Add Tags Another Datum by Yoel Zeldes - 8M ago In the previous post of this series I introduced the Variational Autoencoder (VAE) framework, and explained the theory behind it. In this post I'll explain the VAE in more detail, or in other words - I'll provide some code :) After reading this post, you'll understand the technical details needed to implement VAE. As a bonus point, I'll show you how by imposing a special role to some of the latent vector's dimensions, the model can generate images conditioned on the digit type. In [1]: import numpy as np import tensorflow as tf from tensorflow.examples.tutorials.mnist import input_data import matplotlib.pyplot as plt np.random.seed(42) tf.set_random_seed(42) %matplotlib inline  The model will be trained on MNIST - handwritten digits dataset. The input is an image in$\mathbb{R}^{28×28}$. In [2]: mnist = input_data.read_data_sets('MNIST_data') input_size = 28 * 28 num_digits = 10  Next we'll define the hyperparameters we're going to use. Feel free to play with different values to get a feeling of how the model is affected. In [3]: params = { 'encoder_layers': [128], # the encoder will be implemented using a simple feed forward network 'decoder_layers': [128], # and so will the decoder (CNN will be better, but I want to keep the code simple) 'digit_classification_layers': [128], # this is for the conditioning. I'll explain it later on 'activation': tf.nn.sigmoid, # the activation function used by all sub-networks 'decoder_std': 0.5, # the standard deviation of P(x|z) discussed in the first post 'z_dim': 10, # the dimension of the latent space 'digit_classification_weight': 10.0, # this is for the conditioning. I'll explain it later on 'epochs': 20, 'batch_size': 100, 'learning_rate': 0.001 }  The ModelÂ¶ The model is composed of three sub-networks: 1. Given$x$(image), encode it into a distribution over the latent space - referred to as$Q(z|x)$in the first post. 2. Given$z$in latent space (code representation of an image), decode it into the image it represents - referred to as$f(z)$in the first post. 3. Given$x$, classify its digit by mapping it to a layer of size 10 where the i'th value contains the probability of the i'th digit. The first two sub-networks are the vanilla VAE framework. The third one is used as an auxiliary task, which will enforce some of the latent dimensions to encode the digit found in an image. Let me explain the motivation: in the first post I explained that we don't care what information each dimension of the latent space holds. The model can learn to encode whatever information it finds valuable for its task. Since we're familiar with the dataset, we know the digit type should be important. We want to help the model by providing it with this information. Given the digit type, we'll encode it using one hot encoding, that is, a vector of size 10. These 10 numbers will be concatenated to the latent vector, so when decoding that vector into an image, the model will make use of the digit information. There are two ways to provide the model with a one hot encoding vector: 1. Add it as an input to the model. 2. Add it as a label so the model will have to predict it by itself: we'll add another sub-network that predicts a vector of size 10 where the loss is the cross entropy with the expected one hot vector. We'll go with the second option. Why? Well, in test time we can use the model in two ways: 1. Provide an image as input, and infer a latent vector. 2. Provide a latent vector as input, and generate an image. Since we want to support the first option too, we can't provide the model with the digit as input, since we won't know it in test time. Hence, the model must learn to predict it. In [4]: def encoder(x, layers): for layer in layers: x = tf.layers.dense(x, layer, activation=params['activation']) mu = tf.layers.dense(x, params['z_dim']) var = 1e-5 + tf.exp(tf.layers.dense(x, params['z_dim'])) return mu, var def decoder(z, layers): for layer in layers: z = tf.layers.dense(z, layer, activation=params['activation']) mu = tf.layers.dense(z, input_size) return tf.nn.sigmoid(mu) def digit_classifier(x, layers): for layer in layers: x = tf.layers.dense(x, layer, activation=params['activation']) logits = tf.layers.dense(x, num_digits) return logits  In [5]: images = tf.placeholder(tf.float32, [None, input_size]) digits = tf.placeholder(tf.int32, [None]) # encode an image into a distribution over the latent space encoder_mu, encoder_var = encoder(images, params['encoder_layers']) # sample a latent vector from the latent space - using the reparameterization trick eps = tf.random_normal(shape=[tf.shape(images)[0], params['z_dim']], mean=0.0, stddev=1.0) z = encoder_mu + tf.sqrt(encoder_var) * eps # classify the digit digit_logits = digit_classifier(images, params['digit_classification_layers']) digit_prob = tf.nn.softmax(digit_logits) # decode the latent vector - concatenated to the digits classification - into an image decoded_images = decoder(tf.concat([z, digit_prob], axis=1), params['decoder_layers'])  In [6]: # the loss is composed of how well we can reconstruct the image loss_reconstruction = -tf.reduce_sum( tf.contrib.distributions.Normal( decoded_images, params['decoder_std'] ).log_prob(images), axis=1 ) # and how off the distribution over the latent space is from the prior. # Given the prior is a standard Gaussian and the inferred distribution # is a Gaussian with a diagonal covariance matrix, the KL-divergence # becomes analytically solvable, and we get loss_prior = -0.5 * tf.reduce_sum( 1 + tf.log(encoder_var) - encoder_mu ** 2 - encoder_var, axis=1 ) loss_auto_encode = tf.reduce_mean( loss_reconstruction + loss_prior, axis=0 ) # digit_classification_weight is used to weight between the two losses, # since there's a tension between them loss_digit_classifier = params['digit_classification_weight'] * tf.reduce_mean( tf.nn.sparse_softmax_cross_entropy_with_logits(labels=digits, logits=digit_logits), axis=0 ) loss = loss_auto_encode + loss_digit_classifier train_op = tf.train.AdamOptimizer(params['learning_rate']).minimize(loss)  TrainingÂ¶ We'll train the model to optimize the two losses - VAE loss and classification loss - using SGD. At the end of every epoch we'll sample latent vectors and decode them into images, so we can visualize how the generative power of the model improves over the epochs. The sampling method is as follows: 1. Deterministically set the dimensions which are used for digit classification according to the digit we want to generate an image for. If for example we want to generate an image of the digit 2, these dimensions will be set to$[0010000000]\$.
2. Randomly sample the other dimensions according to the prior - a multivariate Gaussian. We'll use these sampled values for all the different digits we generate in a given epoch. This way we can have a feeling of what is encoded in the other dimensions, for example stroke style.

The intuition behind step 1 is that after convergence the model should be able to classify the digit in an input image using these dimensions. While doing so, they are also used in the decoding step to generate the image. It means the decoder sub-network learns that when these dimensions have the values corresponding to the digit 2, it should generate an image of that digit. Therefore, if we manually set these dimensions to contain the information of the digit 2, we'll get a generated image of that digit.

In [7]:
samples = []
losses_auto_encode = []
losses_digit_classifier = []
with tf.Session() as sess:
sess.run(tf.global_variables_initializer())
for epoch in xrange(params['epochs']):
for _ in xrange(mnist.train.num_examples / params['batch_size']):
batch_images, batch_digits = mnist.train.next_batch(params['batch_size'])
sess.run(train_op, feed_dict={images: batch_images, digits: batch_digits})

train_loss_auto_encode, train_loss_digit_classifier = sess.run(
[loss_auto_encode, loss_digit_classifier],
{images: mnist.train.images, digits: mnist.train.labels})

losses_auto_encode.append(train_loss_auto_encode)
losses_digit_classifier.append(train_loss_digit_classifier)

sample_z = np.tile(np.random.randn(1, params['z_dim']), reps=[num_digits, 1])
gen_samples = sess.run(decoded_images,
feed_dict={z: sample_z, digit_prob: np.eye(num_digits)})
samples.append(gen_samples)


Let's verify both losses look good, that is - decreasing:

In [8]:
plt.subplot(121)
plt.plot(losses_auto_encode)
plt.title('VAE loss')

plt.subplot(122)
plt.plot(losses_digit_classifier)
plt.title('digit classifier loss')

plt.tight_layout()

• Show original
• .
• Share
• .
• Favorite
• .
• Email
• .
Another Datum by Yoel Zeldes - 8M ago

So you just finished designing that great neural network architecture of yours. It has a blazing number of 300 fully connected layers interleaved with 200 convolutional layers with 20 channels each, where the result is fed as the seed of a glorious bidirectional stacked LSTM with a pinch of attention. After training you get an accuracy of 99.99%, and you’re ready to ship it to production.

But then you realize the production constraints won’t allow you to run inference using this beast. You need the inference to be done in under 200 milliseconds.

In other words, you need to chop off half of the layers, give up on using convolutions, and let’s not get started about the costly LSTM...

If only you could make that amazing model faster!

Sometimes you can

Here at Taboola we did it. Well, not exactly... Let me explain.

One of our models has to predict CTR (Click Through Rate) of an item, or in other words — the probability the user will like an article recommendation and click on it.

The model has multiple modalities as input, each goes through a different transformation. Some of them are:

• categorical features: these are embedded into a dense representation
• image: the pixels are passed through convolutional and fully connected layers
• text: after being tokenized, the text is passed through a LSTM which is followed by self attention

These processed modalities are then passed through fully connected layers in order to learn the interactions between the modalities, and finally, they are passed through a MDN layer.

As you can imagine, this model is slow.

We decided to insist on the predictive power of the model, instead of trimming components, and came up with an engineering solution.

Cache me if you can

Let’s focus on the image component. The output of this component is a learned representation of the image. In other words, given an image, the image component outputs an embedding.

The model is deterministic, so given the same image will result with the same embedding. This is costly, so we can cache it. Let me elaborate on how we implemented it.

The architecture (of the cache, not the model)

• We used a Cassandra database as the cache which maps an image URL to its embedding.
• The service which queries Cassandra is called EmbArk (Embedding Archive, misspelled of course). It’s a gRPC server which gets an image URL from a client and retrieves the embedding from Cassandra. On cache miss EmbArk sends an async request to embed that image. Why async? Because we need EmbArk to respond with the result as fast as it can. Given it can’t wait for the image to be embedded, it returns a special OOV (Out Of Vocabulary) embedding.
• The async mechanism we chose to use is Kafka — a streaming platform used as a message queue.
• The next link is KFC (Kafka Frontend Client) — a Kafka consumer we implemented to pass messages synchronously to the embedding service, and save the resulting embeddings in Cassandra.
• The embedding service is called Retina. It gets an image URL from KFC, downloads it, preprocesses it, and evaluates the convolutional layers to get the final embedding.
• The load balancing of all the components is done using Linkerd.
• EmbArk, KFC, Retina and Linkerd run inside Docker, and they are orchestrated by Nomad. This allows us to easily scale each component as we see fit.

This architecture was initially used for images. After proving its worth, we decided to use it for other components as well, such as text.

EmbArk proved to be a nice solution for transfer learning too. Let’s say we believe the content of the image has a good signal for predicting CTR. Thus, a model trained for classifying the object in an image such as Inception would be valuable for our needs. We can load Inception into Retina, tell the model we intend to train that we want to use Inception embedding, and that’s it.

Not only that the inference time was improved, but also the training process. This is possible only when we don’t want to train end to end, since gradients can’t backpropagate through EmbArk.

So whenever you use a model in production you should use EmbArk, right? Well, not always...

Caveats

There are three pretty strict assumptions here.

1. OOV embedding for new inputs is not a big deal

It doesn’t hurt us that the first time we see an image we won’t have its embedding.

In our production system it’s ok, since CTR is evaluated multiple times for the same item during a short period of time. We create lists of items we want to recommend every few minutes, so even if an item won’t make it into the list because of non optimal CTR prediction, it will in the next cycle.

2. The rate of new inputs is low

It’s true that in Taboola we get lots of new items all the time. But relative to the number of inferences we need to perform for already known items are not that much.

3. Embeddings don’t change frequently

Since the embeddings are cached, we count on the fact they don’t change over time. If they do, we’ll need to perform cache invalidation, and recalculate the embeddings using Retina. If this would happen a lot we would lose the advantage of the architecture. For cases such as inception or language modeling, this assumption holds, since semantics don’t change significantly over time.

Some final thoughts

Sometimes using state of the art models can be problematic due to their computational demands. By caching intermediate results (embeddings) we were able to overcome this challenge, and still enjoy state of the art results.

This solution isn’t right for everyone, but if the three aforementioned assumptions hold for your application, you could consider using a similar architecture.

By using a microservices paradigm, other teams in the company were able to use EmbArk for needs other than CTR prediction. One team for instance used EmbArk to get image and text embeddings for detecting duplicates across different items. But I’ll leave that story for another post...