Yesterday I posted this tweet, which seemed to resonate with quite a few people. Ok, at least it got more likes than my tweets usually do. Of course, while writing snarky tweets is fun, it doesn’t do anything to address the underlying problem. Computational Chemistry and Cheminformatics (including applications of machine learning in drug discovery) are a few of the only remaining fields where publishing computational methods without source code is still deemed acceptable. I spend a lot of time reading the literature and am constantly frustrated by how difficult it is to reproduce published work. Given the ready availability of repositories for code and data, as well as tools like Jupyter notebooks that make it easy to share computational workflows, this shouldn’t be the case. I really liked this tweet from Egon Willinghagen, one of the editors of TheJournal of Cheminformatics (J. ChemInf.).
Shouldn’t we be able to download the code and data used to generate a paper and quickly reproduce the results? Even better, we should be able to try the code on our own data and determine whether the method is generally applicable. When code is available I do this and try to provide feedback to the authors on areas where the method or the paper could be improved.
High impact journals such as Science and Nature are now mandating the inclusion of source code with computational methods. Here’s the wording from the editorial policy at Science.
We require that all computer code used for modeling and/or data analysis that is not commercially available be deposited in a publicly accessible repository upon publication. In rare exceptional cases where security concerns or competing commercial interests pose a conflict, code-sharing arrangements that still facilitate reproduction of the work should be discussed with your Editor no later than the revision stage.
Nature is a bit less prescriptive, but still strongly urges the release of source code.
Upon publication, Nature Research Journals consider it best practice to release custom computer code in a way that allows readers to repeat the published results. We also strongly encourage authors to deposit and manage their source code revisions in an established software version control repository, such as GitHub, and to use a DOI-minting repository to uniquely identify the code and provide details of its license of use. We strongly encourage the use of a license approved by the open source initiative. The license of use and any restrictions must be described in the Code Availability statement.
At least one journal in our field is making an effort to ensure that the methods described in papers can be easily reproduced. J. ChemInf. has the following as part of its editorial policy.
If published, the software application/tool should be readily available to any scientist wishing to use it for non-commercial purposes, without restrictions (such as the need for a material transfer agreement). If the implementation is not made freely available, then the manuscript should focus clearly on the development of the underlying method and not discuss the tool in any detail.
Each J. ChemInf. paper also includes a section entitled “Availability of data and materials”, which specifically lists the availability of the code and data used in the paper. The editors of J. ChemInf. do a reasonably good job of enforcing this policy, although the occasional paper slips through without code, or with vague references to the data used in the paper. As far as I can tell, the other prominent Cheminformatics journals
have no editorial policy regarding reproducibility or code release.
What can we do about this?
Pressure the journals in our field to create editorial policies that enforce the reproducibility of computational methods. Write to the editors of the journals that you read regularly and let them know that reproducibility is important to you.
As reviewers, either reject papers which aren’t reproducible or, at the very least, strongly suggest that the authors release their code with the paper. I’ve done this several times, and in many cases, the authors have released their code.
Release code with your papers! If enough people do this, it will become the norm. Ultimately we need a cultural shift so that we can reach a point where publishing a paper without code is deemed socially unacceptable. At one point in time, it was socially acceptable to smoke in public. Finally, enough people got fed up and cultural norms shifted.
This change has already occurred in other fields. In the 1990s, one could publish a protein crystal structure without depositing the coordinates in the PDB. I had a colleague who wrote software to extract 3D coordinates from stereo images in journals. Editorial policies changed and now no one would think of publishing a crystal structure without prior deposition of coordinates. Not too many years ago, it was possible to publish medicinal chemistry papers that were based on unreleased proprietary data. Due to changes in the editorial policy at the Journal of Medicinal Chemistry, this is no longer acceptable. This insistence on scientific rigor did not diminish the quality of the papers or the number of papers published. To the contrary, it made the fields stronger and made it easier to build upon the work of others.
I’ve tilted at this windmill before. In 2013, I wrote a perspective in the Journal of Chemical Information and Modeling that called for new standards for reproducibility of computational methods. As I said in a recent talk, this was the equivalent of digging a hole and yelling into it. It may have made me feel better for a few minutes, but it had no impact whatsoever. Am I missing something here? Are there legitimate reasons to not release code with a paper? Are there other ways that we can encourage people to include code with their papers? I’d appreciate any comments on how we can improve the current situation.
In this post, I'll talk about how we can use FAISS, an Open Source Library from Facebook, to cluster a large set of chemical structures. As usual, the code associated with this post is on GitHub.
As I wrote in a previous post, K-means clustering can be a useful tool when you want to partition a dataset into a predetermined number of clusters. While there are a number of tricks for speeding up k-means (also mentioned in the previous post), it can still take a long time to run when the number of clusters, or the number of items being clustered, is large.
One of the great things about Cheminformatics these days is that we can take advantage of advances in other fields. One such advance comes in the form of a software library called FAISS that was released as Open Source by Facebook. FAISS is a library of routines for performing extremely efficient similarity searches. While many of us think about similarity searches in terms of chemical fingerprints, similar techniques are at the heart of modern commerce. Our characteristics, such as likes, dislikes, and purchases are represented as vectors. Similarity searches among these vectors enable websites to tailor content and point out that people who bought this also bought that. Of course, Facebook would like to do this quickly, thus the motivation behind FAISS.
In addition to similarity search, FAISS also provides routines to perform k-means clustering. As implemented, FAISS can run on either a CPU or a GPU. Having seen the performance increases that can be achieved on a GPU, I was initially drawn to the GPU implementation. Unfortunately, I found it difficult to get the data I was using into GPU memory. This is something I'll spend some more time with in the future, but for now, I have work to do.
Being lazy pragmatic, I decided to try out the CPU implementation. Fortunately, FAISS scales across multiple CPUs and CPU memory is a lot easier to come by. On AWS, I can rent an m5.24xlarge instance with 96 CPUs for $4.61/hr. That's a lot of CPU! Let's see what we can do with it.
Ok, Here's the problem we want to solve. We have a commercial library of 2.1 million molecules, and we would like to cluster this to 50K clusters. We can break this problem down into two steps. The first step is to generate the fingerprints and write them to disk.
Step 1 - Creating a Fingerprint File
In a previous post, I used the parquet library to store a Pandas dataframe containing fingerprints to disk. After some experimentation, I found that this approach was pretty slow when the dataset got bigger. For datasets containing more than a million fingerprints, I found that storing the data in an HDF5 file was much faster, the pyhd5 library makes this pretty easy. The other nice aspect to the HDF5 files is that it's easy to integrate other information like SMILES strings and molecule names into the same file.
The code in the GitHub repo has a python script called gen_fp.py which will generate fingerprints and write them to an HDF5 file. I just went with a simple 256 bit Morgan fingerprint as the default. It's easy enough to change this if you want to use something else. Using the script is pretty simple.
gen_fp.py test2100K.smi test2100K.h5
It took me 11.9 minutes to generate fingerprints for 2.1 million molecules on a single core.
Step 2 - Clustering 2.1 million molecules into 50,000 clusters
The GitHub repo has another script called faiss_kmeans.py that performs the clustering. Again, the command line is pretty simple.
faiss_kmeans.py test2100K.h5 50000
This clustering will produce two output files.
centers.smi contains the smiles for the most central molecule in each cluster
detail.csv contains the SMILES, molecule name, distance to the cluster center, and cluster number for each molecule in the input file.
Here's a screenshot of htop running with 96 cores clustering with FAISS, it's a beautiful thing!
Clustering 2.1 million molecules took 22 minutes on 92 cores, not too bad. It's amazing what you can do with on-demand computing and a little bit of open source code.
I haven't spent a lot of time pushing the limits of this method, but a lot more may be possible, stay tuned.
In Cheminformatics, we frequently run into situations where we want to compare more than two datasets. When comparing multiple datasets, we have a higher probability of chance correlation so we must make a few adjustments to the ways in which we compare our data. In this post, we will examine the impact of multiple comparisons and talk about methods known as post-hoc tests that can be used to correct p-values when multiple comparisons are performed. We will also make a brief foray into non-parametric statistics, a technique that is appropriate for dealing with the skewed data distributions that we often encounter in drug discovery. As usual, all of the data and the code used to perform the analyses in this post is available on GitHub.
I always find it easier to understand a method when I’m presented with an example. For this case, we’ll look at a 2011 paper from Li and coworkers. In this paper, the authors compared the impact of nine different charge assignment methods on the performance of the docking program Autodock 4.2. The study was relatively straightforward. The authors used each of nine different methods to assign atomic partial charges to a set of ligands. They then looked at the impact of the charge assignment method on the ability of the docking program to reproduce the crystallographic pose. The resulting data is a table of RMS fit values (smaller is better) for each protein-ligand complex/charge assignment combination. To provide some context, here the first few lines of the data table.
Once the data had been collected, the authors calculated the mean RMS fit for each of the charge assignment methods and determined that Gasteiger-Huckel and AM1-BCC charges produced the best results (hmm). Let’s take a look at this data and see if we agree with the conclusions from this paper. We’ll start out by doing something close to what was done in the paper. We’ll plot the mean values along with error bars showing the standard deviations.
As we can see from the plots above, Gasteiger-Huckel does indeed have a low RMSD but look at the error bars. It seems like there is quite a bit of overlap. In addition, the plot above is just a variant on the infamous dynamite plot. If you read my last post, you’ll know that there are better ways to look at distributions. Let’s look at the same data as a boxplot.
Based on the boxplots above, we can see that there is quite a bit of overlap in the distributions. This indicates that there probably isn’t a clear winner, but we’d like to know if there is a statistically significant difference in the mean values for the nine reported distributions. As many of us remember from our undergraduate Analytical Chemistry class, we can use a t-test to compare the means of two distributions. The thing that Chemistry professor failed to mention (at least mine did) is that the t-test is only applicable when certain conditions are met.
The data is normally distributed
The data is drawn from independent distributions
The variance within the groups is equal within the population
The first condition is rarely met in drug discovery. We often deal with distributions that are skewed, bimodal, or just hard to describe. It’s often easy enough to just visualize the distribution and determine whether it’s normal. We can do that with the data from this paper. Looking at the figure below, we can see that the distributions don't appear to be normal.
There are also statistical tests such as the Shapiro-Wilk test that can be used to get a more quantitative description of the normality of a distribution. In the Shapiro-Wilk test, the null hypothesis is that the data is normally distributed, thus a p-value less than a chosen alpha (typically 0.05) indicates that the data is not normally distributed. If we calculate a p-value for our datasets with the Shapiro-Wilk test, we can see that none of the distributions is normal.
For more on statistical tests for normality and associated Python code, see this page by Jason Brownlee. If you’re interested in applications of statistics and machine learning in Python, check out Brownlee's books they’re all great!
Let’s consider the second factor, “the data is drawn from independent distributions”, what does this mean? Let’s consider a classical clinical trial, where we have a treatment group and a control group. The treatment group is completely independent of the control group, there is no overlap of patients between the groups. Now let’s consider the case like the one above where we are comparing the impact of different approaches for charge assignment on the performance of a docking program. In this case, each charge assignment method is applied to every molecule in the dataset. The same samples are used for every method, so the samples are not independent.
We’ve already violated two of the necessary conditions for applying the t-test, so it’s time to look for another method. Fortunately, we can rely on a field known as non-parametric statistics. Rather than dealing directly with the values in our data table, non-parametric statistics typically deals with the rank orders of the values. There are a number of non-parametric tests for comparing distributions, but one of the most commonly employed is the Mann Whitney U test. This post is getting kind of long so I won’t go into a lot of detail on how the Mann Whitney test works. If you’re interested in the inner workings of this test, check out the corresponding Wikipedia Page.
While a non-parametric test like Mann-Whitney can be used with a non-normal distribution, we can’t use the p-values obtained by comparing pairs of distributions to directly compare the significance of the difference in the means when more than two methods are compared.
As an example, let’s assume we’re conducting a clinical trial of a blood pressure lowering agent, and that our null hypothesis is that the means of the distributions of responses for the treatment and control groups are equal. Let’s further assume that in order to compare the mean values for our control and treatment groups, we perform a t-test and obtain a p-value of 0.05. This p-value indicates that there is a 5% probability that our result could have occurred by chance. Of course, this is for two tests. If we had performed 3 tests, our chance of randomly invalidating the null hypothesis would be 3 * 5% or 15%.
A number of corrections for multiple testing have been developed. One of the simplest is the Bonferroni correction where the alpha value required for significance is divided by the number of tests being performed. So if we are performing 3 tests and we use 5% as our threshold for significance (p=0.05) we would need a p-value of 0.05/3 or 0.017 in order to conclude that the difference in the mean values is significant. The Bonferroni correction works well when 3 or 4 comparisons are considered, but becomes overly conservative when the number of comparisons becomes large. An alternate, and preferred, technique is known as Holm’s method. In Holm’s method, the correction factor is adjusted downward as the number of tests becomes larger. This adjustment allows Holm’s method to avoid the tendency of the Bonferroni correction to become overly conservative as the number of tests increases.
Fortunately, if we’re programming in Python, we don’t have to worry too much about the mechanics of correcting for multiple comparisons. There is a well written and easily installed package, called scikit-posthocs, to do the work for us. Maksim Terpilowski, the author of this package, has implemented a number of methods to perform statistical tests when multiple comparisons are being made. In addition, the package displays a helpful graphic to aid in the interpretation of the results. In this graphic, all of the adjusted pairwise p-values are shown as a heatmap where significant p-values are colored green and non-significant p-values are colored red. Let’s take a look at our dataset using this repesentation.
It turns out that none of the differences is significant. That’s kind of a bummer and it doesn’t really show the power of scikit-posthocs. Let’s take a quick look at one more example. In this case, we’ll use a dataset that I’m using for a talk at the 2019 ACS National Meeting in Orlando. The ACS talk, entitled “The Utility of Acidic and Basic Compounds in Medicinal Chemistry” is a follow-on to a paper that Paul Charifson and I published in J. Med. Chem in 2014. The talk (and the paper) are focused on the different behaviors of acidic, basic, neutral and zwitterionic compounds, and how these properties can be used in drug discovery. For the purposes of this example, we’ll look at a set of 1,763 compounds with measured solubility from the ChEMBL database. I calculated the major microspecies for each of these molecules and classified them as acidic, basic, neutral, or zwitterionic. The first few rows in the table look like this.
The critical columns in the table are "Sol_uM" the thermodynamic solubility in uM and "Charge_State" which indicates whether the compound is acidic, basic, neutral, or zwitterionic at pH 7.4. Let's look at the data as a boxplot. We can see that neutral compounds are less soluble than charged compounds. It also looks like there's some difference between acidic and basic compounds.
Let's use scikit-posthocs to see if any of these differences are significant. If we look at the first row in the table below, we can see that there is a significant difference between the solubility of acidic and basic compounds (p < 0.05). The difference between acidic and neutral compounds is more significant (p < 0.001). However, the difference in the mean solubilities of the acidic and zwitterionic compounds is not significant. We can look across subsequent columns to examine the differences between basic, neutral, and zwitterionic compounds.
This post is already really long so I should probably bring this one to a close. Hopefully, I've shown you that there's quite a bit that needs to be done to properly compare computational methods. While there's a lot to this, packages like scikit-posthocs make it easy to do the comparisons. The next time you publish or review a paper, remember to consider the impact of multiple comparisons. Please checik out the code on GitHub.
In Cheminformatics, we often deal with distributions. We may want to look at the distributions of biological activity in one or more datasets. If we're building predictive models, we may want to look at the distributions of prediction errors. A recent Twitter thread on "dynamite plots", and the fact that they should be banned from all scientific discourse got me thinking about a few different ways to look at distributions. I thought it might be useful to put together some code examples showing how one can plot distributions. As usual, all of the code I used for analysis and plotting can be found on GitHub.
Dynamite Plots (don't use these) To demonstrate this concept, we'll look at the dataset published by Lee et al based on an earlier publication by Wang et al, that I used in my last post. This dataset consists of molecules which had a range of activities against eight different targets. One way to plot this data is to use the classic dynamite plot. These plots, which were named for the similarity to the dynamite detonators depicted in Roadrunner cartoons, show the mean of a distribution as a bar plot and the standard deviation as a whisker above the plot.
We've all seen these plots or the even worse case where the authors simply display a bar plot displaying the mean with no error bars. As you'll see if you do a Google search for "dynamite plot", many statisticians and data geeks consider these plots to be an abomination. There is a lot not to like about dynamite plots.
They only show the mean and standard deviation
They don't tell you anything about how the data is distributed
They only show the upper whisker, which makes comparisons difficult
Box Plots One alternative to the dynamite plot is the box plot. Box plots show a distribution as a box, which represents the middle 50% of the distribution, with whiskers that define the limits of the data. The line in the middle of the box typically represents the median of the distribution. The distance from the bottom to the top of the box is referred to as the interquartile range (IQR). The box is extended with whiskers that define 1.5 * IQR above and below the top and bottom of the box. Any points outside the range defined by the whiskers are considered outliers and typically drawn as points. As you can see below, the box plot provides much more information on the overall activity distributions.
Beeswarm Plots An alternative to the box plot is the beeswarm plot. In the beeswarm plot, all of the data points are plotted in sets according to their y-values. If multiple points have the same value, they are arrayed horizontally. One advantage to the beeswarm plot is the ability to use the color to show an additional property. For instance, we could use a beeswarm plot to display the pIC50 of the compounds, then use the color to display the fold error in the predicted pIC50.
Violin Plots Box plots tell us a lot more about how the data is distributed, but we could still be missing key parts of the picture. The box shows us where the middle 50% of the distribution and limits are but it doesn't tell us things like whether the data has a bimodal distribution. One way to look at these distributions is to use a violin plot. A violin plot is similar to a box plot, but instead of a box and whiskers, the data is plotted using a distribution known as a kernel density estimate (KDE). A KDE creates a smoothed curve that provides an estimate of the probability density of a distribution. One can think of a violin plot as a KDE that has been turned on its side and mirrored. The violin plot also shows a box plot at its center.
Ridge Plots From the plot above, we can see that the pIC50 distributions for most of the targets appear Gaussian, but the distribution for cdk2, mcl1, and tyk2 seem somewhat bimodal. While violin plots are useful, many people, including me, find them visually overwhelming. Most of us are more used to looking at distributions that are displayed horizontally rather than vertically.
One visualization that has been receiving a lot of attention, the ridge plot, attempts to address these limitations of the violin plot by arraying a set of overlapping KDEs in a vertical grid. I find the plot below the easiest to process. For me, this plot really highlights the differences in the distributions.
There is no single ideal plot for distributions. A lot of this depends on the aspects of the data that you want to examine or highlight. Hopefully, this post, and the corresponding code will provide some new ways for you to look at your data.
In Cheminformatics, we frequently run into situations where we want to select a subset from a larger set of molecules. K-means clustering, a simple, but often overlooked, technique can provide a useful solution. Let's look at a couple of situations where we might need to choose a subset from a larger set of molecules. After that, we'll briefly describe the method and look at how we can apply it using an Open Source implementation that is available on GitHub.
Use Cases
Let's say that a vendor has a set of 2 million compounds available for purchase, but we only have the budget to purchase 50,000. How would we choose the 50,000? The first thing we would probably want to do is to filter the 2 million to remove compounds that contain undesirable functionality or fall outside a desired property range. I wrote an earlier blog post on filtering chemical libraries so I won't spend any time on that here. Let's imagine that we did the filtering and we still have 1.75 million compounds remaining. How do we find a somewhat representative subset of 50,000? There are a number of ways to do this. During the heyday of Combinatorial Chemistry in the 1990s, far too many papers were published on methods for selecting subsets from larger compound libraries.
One relatively simple way to select subsets is to use a clustering method known as k-means clustering. One nice thing about k-means clustering is that it allows (actually requires) the user to specify the number of clusters that will be produced. This is also the biggest downside to the method. K-means doesn't use the overall structure of the dataset to determine the number of clusters. It partitions the data into the number of clusters specified by the user. If the number of clusters is small, we can sometimes end up with molecules that are very different in the same cluster. If the number of clusters is too large, we can end up with redundant clusters. Despite these limitations, k-means provides a reasonable, practical solution to a frequently encountered problem. Ultimately, this is like most of the techniques we use in Cheminformatics. We should never blindly use any method, we need to look at the results (or at least a subset) to ensure that they make sense.
We can run into similar situations when we have biological assays with relatively low throughput. Most pharmaceutical companies have compound libraries with hundreds of thousands or even millions of compounds. Unfortunately, many biological assays are labor intensive or can't be miniaturized to enable large numbers of compounds to be screened. Let's say that our corporate collection has 100,000 compounds, but our assay can only handle 10,000. We could use k-means clustering to select 10% of our screening database as a primary screen. Once we get the results of this primary screen, we could then use similarity searches in the larger database of 100,000 to select compounds for follow-up screens.
Virtual screening provides another interesting use case for k-means clustering. Let's say that we docked a million compounds into the active site of a target of interest. Of these million compounds, 10,000 have reasonable docking scores, but we only have the budget to screen 500 compounds. We could simply choose the 500 compounds with the highest docking scores, but this may not be the best strategy. Published results and lots of experience has shown that the relationship between docking scores and biological activity is somewhat weak. Another strategy might be to cluster the 10K compounds into 500 clusters and select the compound from each cluster with the highest docking score.
Once we've done the clustering in any of the cases described above, we still need to choose one or more molecules from each cluster. In the virtual screening example, the choice is somewhat obvious, we can simply select the molecule from each cluster with the highest docking score. What about selecting compounds for screening? How would we choose which molecule from each cluster to screen? There are a number of ways to approach this and the strategy we choose will be influenced by the ultimate goal of the experiment being performed.
Some people choose the molecule that is closest to the cluster center. The thought is that this will increase the diversity of the library and lessen the chance of selecting similar molecules from adjacent clusters.
If a reliable QSAR model is available, we could choose the molecule from each cluster with the highest predicted activity.
If we are selecting molecules from an existing screening collection, we could select molecules based on performance in prior assays. We may want to avoid molecules that have been shown to be more promiscuous and active in multiple assays.
When designing a library for screening one often wants to be able to quickly identify analogs for compounds which were hits in an experimental screen. One way to build a screening library where each molecule has multiple analogs is to select the molecule from each cluster with the largest number of analogs in the larger database.
There is no right or wrong way to cluster a set of molecules or to select representatives from clusters. Ultimately, choices should be influenced by the goals of the experiments being performed. In the implementation described below, I added a field to the output file called "Center" which is set to "Yes" for the molecule closest to the cluster center.
The K-means Clustering Method
As mentioned above, k-means is one of the simplest clustering methods. The method works something like this.
Assign the number of clusters, we will call this N.
Randomly select N molecules from the larger set of M molecules. These will be our initial N cluster centers. In many implementations of k-means, the initial selection of N molecules is not exactly random. A number of methods have been developed that try to spread the initial N molecules evenly over the space covered by the larger set of M molecules.
Determine the distance from each molecule in the larger set of M molecules to each of the N cluster centers. This distance can be calculated in a number of ways, but one of the most common is to calculate the Euclidean distance between the fingerprint for each molecule and the fingerprint for the molecule corresponding to the cluster center. Assign each molecule to the cluster corresponding to the closest cluster center.
Recalculate the cluster center as the geometric center of the fingerprints for all of the molecules in the cluster. We can do this by summing the individual fingerprint bits and dividing by the number of molecules in the cluster. At this point the cluster center is no longer the fingerprint of any molecule, it is simply a point in the high dimensional space defined by the fingerprints.
Now that we've redefined the cluster centers, we repeat steps 3 and 4 until either the cluster centers have converged (they stay in roughly the same place) or until we have reached a predefined number of iterations over steps 3 and 4.
K-means clustering has a couple of nice properties, one of which is that we typically don't have to use the whole dataset to identify a set of cluster centers. If we have a large dataset, it can take a while to iterate through steps 2-4 above to identify the cluster centers. One way to "cheat", and speed up the clustering, is to carry out steps 2-4 above with a smaller sample of the larger set of M molecules. In the implementation I wrote, I use 10% of the molecules to identify the cluster centers when the dataset has more than 10,000 molecules. Once the cluster centers have been identified, we can calculate the distance from each molecule in the larger set to each cluster center an assign the molecule to the cluster corresponding to the closest cluster center.
A variant on this idea is a technique called "mini-batch k-means" where subsets are used in each iteration through steps 2-4 above. The code I wrote uses the mini-batch k-means implementation in the SciKit Learn package.
An Open Source K-means Implementation
As mentioned above, I wrote a Python implementation of k-means clustering for molecules that is available on GitHub. The GitHub site has the details on installation and usage so I won't go into that here. In addition to the Python script that does the clustering, I added two Jupyter notebooks that provide examples of how the clustering results can be used. The first notebook shows a situation similar to the virtual screening use case described above. We start with a set of 10,000 molecules and cluster into 100 clusters. We then plot a histogram to look at the cluster populations and use a nice trick in Pandas to select the molecule from each cluster with the lowest LogP.
The second Jupyter notebook shows a couple of ways that we can visualize the clustering results. In this case, we begin with a set of 1,000 molecules consisting of 10 sets of 100, each of which is reported to bind to a particular biological target. We cluster this dataset, which was taken from a paper by Cortés-Ciriano and Bender, then use the Jupyter notebook to visualize the clustering results as a heatmap. One cool aspect of this notebook, which was motivated by an earlier blog post from Greg Landrum, is the ability to interactively visualize the chemical structures of the compounds for each target and their cluster assignments.
This k-means clustering implementation is capable of handling relatively large datasets. I was able to cluster 1 million molecules into 10,000 clusters in 1.3 hours on my MacBook Pro. The memory requirements are relatively small, so larger sets should be doable. Hopefully, some of you will find the code useful. Please add a comment or GitHub issue if you have difficulties or improvements to the code.
I'd like to use this post to provide a few suggestions for those writing papers that report the performance of predictive models. This isn't meant to be a definitive checklist, just a few techniques that can make it easier for the reader to assess the performance of a method or model.
As is often the case, this post was motivated by a number of papers in the recent literature. I'll use one of these papers to demonstrate a few things that I believe should be included as well as a few that I believe should be avoided. My intent is not to malign the authors or the work, I simply want to illustrate a few points that I consider to be important. As usual, all of the code I used to create this analysis is in GitHub.
For this example, I'll use a recent paper in ChemRxiv which compares the performance of two methods for carrying out free energy calculations.
1. Avoid putting multiple datasets on the same plot, especially if the combined performance is not relevant. In the paper mentioned above, the authors display the results for 8 different datasets, representing correlations for 8 different targets, in the same plot. This practice appears to have become commonplace, and I've seen similar plots in numerous other papers. The plots below are my recreations from the supporting material available with the paper (kudos to the authors for enabling others to reanalyze their results). The plots compare the experimental ΔG (binding free energy) with the ΔG calculated using two different flavors of free energy calculations. Lines are drawn at 2 kcal/mol above and below the unity line to highlight calculated values that fall outside the range of predictions typically considered to be "good".
If we look at the plots above, we can see that the plot on the right appears to have a few more points outside the 2 kcal/mol window, but I find it hard to discern anything else. The inclusion of multiple datasets, each of which spans a portion of the overall dynamic range, can also give the illusion that the fit is more linear than it actually is.
An alternate approach is to trellis the data and plot the results side by side for each target. Note that I've included the coefficient of determination on each plot.
2. Include the appropriate metrics to assess the predictions. I've seen quite a few papers that only include the root mean squared error (RMSE) or mean absolute error (MAE) and don't report any statistics having to do with correlation. Some people will argue that correlation statistics will be unrealistically pessimistic for datasets with a small dynamic range. I can also argue that RMSE or MAE will be unrealistically optimistic for datasets with a small dynamic range. At the end of the day, people have differing views on which statistics should be included, so why not include them all. My recommendation would be to include the methods listed below. There is open source code in scikit learn or scipy to calculate all of these.
Rather than reporting the Pearson r, I prefer to use its squared value, the coefficient of determination. The R2 value expresses the amount of variance in the dependent variable that is predictable from the independent variable(s).
3. Include error bars on your plots. Every month I see new papers which compare the performance of a variety of predictive models. In most cases, the performance metric, which is usually an RMSE or a correlation coefficient, is plotted as a bar graph. These graphs rarely have error bars. It's useful and pretty easy, to calculate error bars for your correlations or for a measure like RMSE. This can be done by bootstrapping or by analytically calculating a confidence interval for the correlation. I talked about this a bit in a previous post, but the best discussion I know of is in this paper by Anthony Nicholls. In the plot below, I display R2 for each of the datasets as a bar graph with error bars drawn at the 95% confidence interval. Note that the 95% confidence interval for these correlations is very large. This is because the confidence interval is a function of two factors, the correlation and the number of datapoints used to calculate the correlation. If the correlation is low or the number of datapoints is small, the confidence interval will be broader.
4. Calculate the effect size for the difference in the correlations. As Anthony Nicholls points out in the paper mentioned above, when we compare two methods on the same dataset, the errors are not independent. As such, we cannot determine whether an overlap in the error bars indicates that the methods are not different at the 95% confidence level. One widely used method of comparing models is to calculate the effect size. A useful discussion of the effect size, along with some examples and code can be found on this page. One of the most commonly used measures of effect size is Cohen's d, which for dependent samples like the ones we're dealing with, can simply be calculated as the mean of the differences in the metric being calculated, divided by the standard deviation of the difference.
Note that the formula above is slightly different from the one on the aforementioned web page. That particular example treats independent samples, where a pooled standard deviation is used. That page has a nice description of Cohen's d and it's interpretation, which I will quote here.
Cohen’s d measures the difference between the mean from two Gaussian-distributed variables. It is a standard score that summarizes the difference in terms of the number of standard deviations. Because the score is standardized, there is a table for the interpretation of the result, summarized as:
Small Effect Size: d=0.20
Medium Effect Size: d=0.50
Large Effect Size: d=0.80
If we calculate Cohen's d for the example above, we get a value of 1.2 indicating a very large effect and a significant difference between the methods.
We can also use the equation below to translate Cohen's d into a probability between 0 and 1.
If we plug a Cohen's d of 1.2 into the equation above, we end up with a probability of 0.8. Based on the 8 datasets we compared, FEP should outperform TI 8 times out of 10.
This post provides an overview of a few factors that should be considered by authors and by reviewers of computational papers. There is a lot more to be discussed and I hope I can address related topics in future posts. Those wanting to know more about this very important topic are urged to read these three papers by Anthony Nicholls
It has never been my intent to use this blog as a personal soapbox, but I feel the need to respond to a recent article on AI in drug discovery.
A recent viewpoint by Allan Jordan in ACS Medicinal Chemistry Letters suggests that we are nearing the zenith of the hype curve for Artificial Intelligence (AI) in drug discovery and that this hype will be followed by an inevitable period of disillusionment. Jordan goes on to discuss the hype created around computer-aided drug design and draws parallels to current work to incorporate AI technologies in drug discovery. While the author does make some reasonable points, he fails to highlight specific problems or to define what he means by AI. This is understandable. While the term AI is used frequently, most available definitions are still unclear. Wikipedia defines AI as “ intelligence demonstrated by machines", not a particularly helpful phrase. We wouldn’t consider a person who can walk around a room without bumping into things as particularly intelligent. However, a robot that can do the same thing might be considered an AI. It is unfortunate that the popular press has inspired expectations of sentient computers, akin to HAL 9000 from the movie “2001 A Space Odyssey”, that will replicate human thought processes.
I prefer to focus on machine learning (ML), a relatively well-defined subfield of AI. In essence, machine learning can be thought of as “using patterns in data to label things.” In drug discovery, machine learning can take a number of different forms ranging from the classification of phenotypes from cellular images to models for predicting physical properties and biological activity. Much of the recent hype around AI has centered on the application of deep neural networks, often referred to as “deep learning”. While many refer to neural networks as motivated by the human brain, they are really just systems for fitting non-linear relationships to data. Over the last few years, advances in machine learning have impacted drug discovery in a number of areas.
Image analysis - The application of deep learning methods has brought about impressive improvements in the speed and accuracy achieved in standard image classification benchmarks. These improvements have already impacted drug discovery efforts. Many of the same algorithms used to identify cat pictures on the internet can be applied to the identification of specific cellular phenotypes.
Organic synthesis planning - Recent publications have shown that machine learning methods are capable of generating reasonable routes to synthetically challenging molecules. In his viewpoint, Jordan makes the point that continued use of ML in synthesis planning will become self-reinforcing and the same reactions will end up being overused. In practice, this will probably not end up being an issue. ML researchers have dealt with biased data in a number of other fields and have techniques for addressing unbalanced datasets. There are, however, two other areas that may end up having a more significant impact on the use of ML in synthesis planning. The first of these is data availability. The vast majority of electronic data on organic reactions is the property of a small number of entities. If these entities decline to supply the data, the field will have very few options. While it may be possible to create a public database of reactions from chemical patents, the extraction process is non-trivial and the scope of the extracted reactions may be limited. The other confounding factor is the absence of negative data. In order to be truly effective a synthesis planning system should have access to reactions that worked as well as those that didn’t. Unfortunately, unsuccessful reactions are rarely published in the literature or in patents, so it may be difficult to truly determine the scope of a reaction.
Predictive models - Computational chemists have been using machine learning to build QSAR models for more than 15 years. More recently, many computational chemists have augmented their more traditional machine learning approaches with deep learning methods. These methods have been employed in a few different ways. One of the first approaches was to use deep learning methods to build models based on widely used molecular descriptors. These efforts tended to generate models that, at best, provided a modest improvement over the current state of the art. In a 2015 publication, a group from Merck and the University of Toronto showed that deep learning produced an average R2 increase of 0.04 over the standard method used at Merck. Subsequent work from Google and Stanford showed that deep learning could also be used to create new molecular descriptors. To date, these methods, referred to as graph convolutions also show marginal improvements over the current state of the art. Graph convolutions have only been in existence for a few years so it is difficult to assess their impact. Recent developments show promise and it will be interesting to see if new representations will produce better predictive models. While some have argued that deep learning methods will be more effective on large datasets, this has yet to be demonstrated in a convincing fashion. Another potential downside to deep neural networks in QSAR is their lack of interpretability. The ultimate goal of most predictive models in medicinal chemistry is to help a discovery team to interpret SAR and to decide which molecules to synthesize next. To be most effective, a model should enable a team to determine the impact of specific parts of a molecule on the property, or properties, being predicted.
Quantum chemistry - A number of recent publications have reported the use of machine learning methods to reproduce quantum chemical potentials. The ANI-1 method can generate DFT level potentials, which would typically take more than two days to calculate, in less than a second. Recent reports have also shown that these methods can be used to calculate gradients and perform molecular dynamics. While these results appear promising, the extensibility and applicability of the methods is still an open question.
Jordan makes the point that AI will not have the capacity to impact new areas such as PROTACs or RNA therapies. It is true that any machine learning model is only as good as the data used to train that model. Hopefully, as we improve our abilities to computationally capture the fundamental nature of molecular interactions, machine learning will have more of an impact on new areas. One exciting area of machine learning research, transfer learning, may be able to provide such an impact. In transfer learning, a larger more general dataset can be used to enable a model to learn fundamental concepts. The model can then be fine-tuned on a more specific dataset. This technique is now being widely used in image analysis, where training a model on a set of general images from the internet can enable the model to learn concepts like edges and objects. The model can then be fine-tuned for specific tasks like identifying tumors in radiology images. In in a similar fashion, QSAR models can be initially trained on a large amount of public data (e.g. properties from quantum chemical calculations) then fine-tuned to predict physical properties like aqueous solubility.
In Jordan’s perspective, he points out that adopting AI requires significant investment and buy-in. While it is true that machine learning may not be at a point where it is accessible to the medicinal chemist at the bench, it is certainly within the reach of any practicing computational chemist. The appearance of toolkits like scikit-learn and Keras have made it straightforward for any competent programmer to implement a machine learning model. The DeepChem project has made this even easier by integrating machine learning with widely used cheminformatics methods. Projects like Google’s AutoML, auto-scikit-learn, and auto-keras have automated best practices for model parameter tuning. We have also reached a point where it is no longer necessary to invest in expensive computer hardware in order to run machine learning models. Services such as Amazon Web Services, Google Cloud and Microsoft Azure have made it possible to lease the powerful GPU computer systems necessary to run machine learning programs for less than one US dollar per hour. As with any modeling exercise, ML practitioners should have a sufficient command of statistics to enable them to assess the quality of the model. It is far to easy to be fooled by a poorly validated model.
Another interesting aspect of Jordan’s viewpoint is the desire for AI systems to mimic the thought processes of a medicinal chemist and “make nonobvious and illogical connections across incomplete data sets.” I would argue that the decisions of a medicinal chemist are rarely illogical and usually influenced by prior exposure to discovery programs or from reading the literature. These decisions come from the same “chemists who made this also made this” concept that Jordan points out as a flaw in AI systems. While machine learning is not currently integrated with the medicinal chemistry literature, one can imagine a day in the near future where it will be possible to ask questions like “show me any precedented replacements for an oxazole that reduced hERG binding while maintaining metabolic stability.” Tools like this could dramatically expand the scope of medicinal chemists knowledge beyond personal experience. Another “illogical” process employed by medicinal chemists is “pushing on the walls” of a protein binding site. Medicinal chemists will often use their intuition to intentionally introduce clashes between a ligand and protein binding site. These clashes can sometimes reveal areas where a protein is flexible and may even reveal new pockets that can be used for optimization. Given sufficient data, it should be possible to train an ML system to detect areas where an approach like this might be productive. Again, ML can be used as an adjunct to, rather than a replacement for, the chemist's intuition.
Any new development will have its share of zealots as well as its naysayers. It is definitely the case that AI in general, and particularly in drug discovery, has received more than its share of recent attention. Unfortunately, much of the current hype stems from a fundamental lack of understanding of the techniques being used as well as their intent. I can understand this confusion. Any machine learning method is simply trying to establish a relationship between a set of observables, X, and a response, Y. After watching a number of recent presentations at conferences on AI in Drug Discovery, I am often unclear on X, Y and the nature of the actual problem being solved. It is important for us to cut through the hype and focus on areas where we can make an impact. This is an exciting time for machine learning research. New developments coming from organizations like Google, Facebook, Microsoft, and Twitter are creating software with the potential to impact a wide array of fields including drug discovery. One important aspect of machine learning research is that, unlike computational chemistry and cheminformatics, almost all of the new developments are free and Open Source. This willingness to share code is pushing the field forward and enabling rapid advances. Like any field, AI is experiencing a hype phase that is creating unreasonable expectations. It is my sincere hope that, rather than falling into a period of despair, we embrace these new developments and use them as another tool to get medicines to patients.
As I've said before, my goal is not to use this blog as a soapbox. I prefer to talk about code, but I thought I should respond to Peter Kenney's comments on my post, AI in Drug Discovery - A Practical View From the Trenches. I wanted to just leave this as a comment on Peter's blog. Alas, what I wrote is too long for a comment, so here goes.
Thanks for the comments, Pete. I need to elaborate on a few areas where I may have been unclear.
In defining ML as “a relatively well-defined subfield of AI” I was simply attempting to establish the scope of the discussion. I wasn’t implying that every technique used to model relationships between chemical structure and physical or biological properties is ML or AI.
I should have expanded a bit on the statement that ML is “assigning labels based on data”, a description that I borrowed from Cassie Kozyrkov at Google. I never meant to imply that I was only talking about classification problems. The way I think about it, a numeric value can be considered a label that we are learning. Machine learning is certainly not limited to classification. All of the commonly used ML methods in Cheminformatics, random forest, gradient boosting, SVM, and neural nets support both classification and regression.
You make the assertion that ML may be better for classification than regression, but don't explain why.
"I also have a suspicion that some of the ML approaches touted for drug design may be better suited for dealing with responses that are categorical (e.g. pIC50 > 6 ) rather than continuous (e.g. pIC50 = 6.6)"
In my experience, the choice of regression vs classification is often dictated by the data rather than the method. If you have a dataset with 3-fold error and one log of dynamic range, you probably shouldn’t be doing regression. If you have a dataset that spans a reasonable dynamic range and isn’t, as you point out, bunched up at the ends of the distribution, you may be able to build a regression model.
Your argument about the number of parameters is interesting.
"One of my concerns with cheminformatic ML is that it is not always clear how many parameters have been used to build the models (I’m guessing that, sometimes, even the modelers don’t know) and one does need to account for numbers of parameters if claiming that one model has outperformed another. "
I think this one is a bit more tricky than it appears. In classical QSAR, many people use a calculated LogP. Is this one parameter? There were scores of fragment contributions and dozens of fudge factors that went into the LogP calculation, how do we account for these? Then again, the LogP parameters aren't adjustable in the QSAR model. I need to ponder the parameter question and how it applies to ML models which use things like regularization and early stopping to prevent overfitting.
I’m not sure I understand your arguments regarding chemical space. You conclude with the statement
“It is typically difficult to perceive structural relationships between compounds using models based on generic molecular descriptors”.
One of the most popular descriptors used with machine learning today is the ECFP fingerprint (or its open source cousin the Morgan fingerprint). These are the same descriptors typically used to calculate molecular similarity and establish notions of chemical space. It is relatively easy, and often good practice, to construct a representation such as a self-organizing map to understand the relationships between chemical structures and ML predictions to see if they are consistent with your current understanding of the SAR for a drug discovery project.
I don’t see your assertion that global models are simply collections of local models as heretical. In many cases, we’re using ensemble methods like random forest where the model is a collection of predictors. I also agree that the selection of training and test sets is tricky business. Even splitting a compound set and putting the compounds synthesized earlier in a drug discovery program into the training set and those synthesized later into the test set (time-split cross-validation) does not completely remove bias. As I pointed out in my scaffold hopping post, I’m not a huge fan of dividing training and test sets based on the scaffold. In many cases, different scaffolds may make different interactions in the same binding site. If this is the case, should scaffold A enable you to make a prediction about scaffold B? Again, these factors are going to impact any predictive model, regardless of how it’s constructed.
Validation is a lot harder than it looks. Our datasets tend to contain a great deal of hidden bias. There is a great paper from the folks at Atomwise that goes into detail on this and provides some suggestions on how to measure this bias and to construct training and test sets that limit the bias.
You are correct that determining the applicability domain of ML models is still an open question. This is true of any QSAR model, regardless of how it was constructed. There is a lot more to this than simply asking if a molecule being predicted is similar to a molecule in the training set.
It’s true that early in a drug discovery project we may not have enough data to build an ML model. As I mentioned in my post, one technique which may enable us to apply ML in situations where we have limited data is transfer learning. In transfer learning, we are fine-tuning a model that was trained on a larger set of related data. This technique is widely used in image analysis but has yet to be fully validated in our field. My feeling is that transfer learning will become more relevant when we can generate molecular descriptors that capture the fundamental biophysics of a binding event.
I’m also not advocating ML models as a panacea. As computational chemists, we have a number of tools and techniques at our disposal. Like any good golfer, we should choose the right club at the right time.
I have to disagree with the statement that starts your penultimate paragraph.
“While I do not think that ML models are likely to have significant impact for prediction of activity against primary targets in drug discovery projects, they do have more potential for prediction of physicochemical properties and off-target activity (for which measured data are likely to be available for a wider range of chemotypes than is the case for the primary project targets).”
Lead optimization projects where we are optimizing potency against a primary target are often places where ML models can make a significant impact. Once we’re into a lead-opt effort, we typically have a large amount of high-quality data, and can often identify sets of molecules with a consistent binding mode. In many cases, we are interpolating rather than extrapolating. These are situations where an ML model can shine. In addition, we are never simply optimizing activity against a primary target. We are simultaneously optimizing multiple parameters. In a lead optimization program, an ML model can help you to predict whether the change you are making to optimize a PK liability will enable you to maintain the primary target activity. This said, your ML model will be limited by the dynamic range of the observed data. The ML model won't predict a single digit nM compound if it has only seen uM compounds.
In contrast, there are a couple of confounding factors that make it more difficult to use ML to predict things like off-target activity. In some (perhaps most) cases, the molecules known to bind to an off-target may look nothing like the molecules you’re working on. This can make it difficult to determine whether your molecules fall within the applicability domain of the model. In addition, the molecules that are active against the off-target may bind to a number of different sites in a number of different ways.
At the end of the day, ML is one of many techniques that can enable us to make better decisions on drug discovery projects. Like any other computational tool used in drug discovery, it shouldn’t be treated as an oracle. We need to use these tools to augment, rather than replace, our understanding of the SAR.
In this post, we will look at examples of how two different open source Python libraries can be used to generate self-organizing maps. The MiniSom library is great for building SOMs for smaller sets with fewer than 10K molecules. The Somoclu library can use either a GPU or multiple CPU cores to generate a SOM, so it's well suited to larger libraries. While Somoclu is a lot faster than MiniSom, installation on non-Linux platforms can require a bit of extra work.
I've provided example use cases for both libraries as Jupyter notebooks. Hopefully, this will make it easier for readers to experiment with these methods. MiniSomThe MiniSom library is great for generating SOMs for smaller datasets consisting of thousands to tens of thousands of molecules. I found the MiniSom library easy to install on a Mac or a Linux platform. The MiniSom example notebook can be found here on GitHub.
Here's some benchmarking data using MiniSom. In the plot below we compare the time required to generate a SOM with 1,000, 5,000 and 10,000 molecules with 166-bit MACCS Keys and 1024 bit Morgan fingerprints. We set the number of cells in the SOM using the heuristic defined in the MiniSom docs.
number of cells = 5 * sqrt(number of molecules)
In addition, we set the maximum number of training cycles to 10*number of molecules. The benchmarks were all run on my 2017 MacBook Pro.
Somoclu
The Somoclu library can use a GPU or multiple CPU cores to accelerate the generation of the SOM. This library is a great choice for Linux systems and can be installed with a simple pip install. Unfortunately, since the default compilers on Macs don't support OpenMP, installation on non-Linux platforms can be more difficult. The Somoclu example notebook can be found here on GitHub.
Here's some additional benchmarking data showing the time required to generate a SOM with Somoclu. Since the Somoclu library has the ability to take advantage of multiple cores or a GPU, we benchmark on a larger dataset with 100,000 molecules. All of the benchmarks were run on a p2.xlarge instance on AWS (1 NVidia K80, 4vCPUs). Note that the current cost for an instance like this is only $0.90/hr. It's interesting that the GPU implementation is consistently slower than the multicore CPU implementation.
A Couple of Useful FunctionsThe Jupyter notebooks linked above include a function which plots the SOM as a series of pie charts showing the distribution of active and inactive compounds in each cell (see below). I've found plots like this useful when examining the results of HTS and other screens. These plots provide a quick overview of how the hits are distributed over the chemical space of the compounds that were screened.
In the plot above, the hits are shown in blue, while the inactive compounds are shown in orange. We can see that the majority of the hits are in three cells, while a few other cells contain smaller fractions of hits. The SOM not only enables us to quickly locate clusters of hits but also lets us find cells which contain a mixture of active and inactive compounds. These cells can be the most interesting since they often point out small changes in chemical structure that are responsible for larger changes in biological activity.
I've also included a function that shows how to examine the contents of a particular SOMcell by generating a grid showing the structures of the molecules contained in that SOM cell. Ideally, we'd like to have a function that will allow us to click on a cell and see a grid with the molecules in that cell. I typically do this with Vortex from Dotmatics. It shouldn't be that hard to put together a web app to do this, but that's something we'll leave for another day.
In this post, I will explain how self-organizing maps (SOMs) work. In the first part of this post, I'll explain the technological underpinnings of the technique. If you're impatient and just want to get to the implementation, skip to part 2.
A few years ago I was having a discussion with a computational chemistry colleague and the topic of self-organizing maps (SOMs) came up. My colleague remarked, "weren't SOMs one of those 90s fads, kind of like Furbys"? While there were a lot of publications on SOMs in the early 1990s, I would argue that SOMs continue to be a useful and somewhat underappreciated technique.
What Problem Are We Trying to Solve?
In many situations in drug discovery, we want to be able to arrange a set of molecules in some sort of logical order. This can be useful in a number of cases.
Clustering. Sometimes we want to be able to put a set of molecules into groups and select representatives from each group. This may be the case when we only have resources to screen a subset of the compounds we have available. In this case, the SOM performs a function similar to what we get with k-means clustering. One advantage of the SOM is that it can provide a visual representation of the relationships between clusters.
Understanding the results of high-throughput (or other) screens. When we run a screen, one of our first objectives is to examine the results and identify potential SAR. One way to do this is to create a projection where similar compounds are in similar parts of space. As we will see in subsequent sections, SOMs can provide an overview of how hits are distributed in chemical space.
Comparing compound sets. Let's say that we have a database of compounds for fragment screening. If we have the opportunity to add new compounds to our fragment database, we would first like to see how similar the new compounds are to our current collection. The SOM provides a convenient method of visualizing which parts of chemical space are covered by each collection.
How Does the SOM work?
In order to explain SOMs we need to take a step back and take a look at what we can do with a set of colors. Let's say that we have a random set of colors like the one shown on the left and we want to arrange these colors logically like the ones on the right.
As we all probably know, colors can be represented as a vector of three integers between 0 and 255 with the values representing the red, blue, and green components of the color.
Since we have colors represented as vectors, we can calculate the Euclidean distance between these vectors to identify which colors are most similar. In the example below, we calculate the distance between a vector representing the color red and vectors representing other colors. As we can see in the figure below, red is closest to purple and farthest from green.
Let’s say that we want to map colors into a 10x10 grid. We can start by creating a 10x10 array of vectors. Each vector consists of 3 elements, each of which is assigned a random value between 0 and 255.
Now that we have a grid initialized with random vectors, we will select a random color. In this case, we'll select blue.
We then calculate the Euclidean distance between our selected vector (0,0,255) and the random vectors we created above. We then find the closest cell to the vector representing blue. We refer to this cell as the best matching unit (BMU).
We then add some "blue" (the vector (0,0,255)) to the cells surrounding the BMU. An exponential function is used to adjust the amount of "blue" added to surrounding cells. We add large amounts of "blue" to adjacent cells. A decreasing amount of blue is added to cells as we get farther away from the BMU.
We then repeat this process of selecting a random color, identifying the closest cell to that color, and adding the selected color to the adjacent cells. Repeating this process over thousands of cycles allows us to arrange a set of vectors so that similar vectors are close together and different vectors are far apart. The video below shows how the SOM moves from a random set of colors to a set that is organized so that similar colors are close together.
Remember that we are not really dealing with colors, we are simply adding vectors to create a set of basis vectors. Any new vector can then be placed into the appropriate cluster by calculating the Euclidean distance from that vector to each basis vector. The cell with the smallest Euclidean distance is the BMU.
Applying the SOM to Molecules
At this point, you have probably figured out that we can use the same techniques with molecules that we just showed with colors. We can use any of a variety of methods to represent a molecule as a bit vector, where each bit represents the presence or absence of a particular substructure.
If we're building a SOM for molecules, we begin by initializing the SOM with random vectors. We then select a random molecule from our dataset, calculating its fingerprint and treating this fingerprint as a vector. Once the fingerprint has been generated, we calculate the Euclidean distance from this vector to each of our random vectors. The random vector with the smallest Euclidean distance is selected as the BMU. The weights of the surrounding cells are then adjusted as shown above. This process is repeated several thousand times to train the SOM. After training, we have a set of basis vectors that can be used to cluster new molecules. One of the great aspects of this technique is that we can train the SOM on a subset of the data. If we have a set of 1 million molecules, we can typically train the SOM on a smaller subset (perhaps 100K) of molecules. We can then use the basis vectors generated during training, to cluster the entire 1 million molecule database.
That's it for the preliminaries. If you'd like to see how this works in practice, please check out part 2 of this post.