## Follow Noel O'Blog on Feedspot

or

Noel O'Blog by Noel O'boyle - 3M ago
There are a number of graph algorithms with not-so-obvious names, and "transitive reduction" is one of those. When I first needed to implement this algorithm, I couldn't figure out whether it had a name, and I can tell you that it's hard to google for something when you don't know what it's called.

Anyway, what transitive reduction means is to remove the maximum number of edges from a directed acylic graph while keeping the 'reachability'; that is, you can remove any edge A->B in the graph if there's some other way to get from A to B. In practice, this is super-useful in tidying up a graph representing pairwise relationships, e.g. a partially-ordered set. For example, suppose you know that A>B, B>C and A>C. The graph showing this would have three edges, but you can reduce it to just two, A->B->C, that summarises the entire ordering.

I used this algorithm in my fingerprint paper, but more recently I've been working on another problem that also requires it, but this time the graphs are much bigger. At first, I used the tred program that comes with graphviz, but the graphs became too large for tred to finish within a reasonable amount of time. Next I used NetworkX, first from CPython and then from PyPy. The latter could just about handle it but required 30GB of memory and ran overnight, and so I moved onto C++. First I tried with igraph. This was painfuly slow - slower than the Python. It all comes down to the data structures that the library is using; I'm deleting 100s of millions of edges and though I batched up the deletions, it was just too slow.

Finally, I ended up at Boost Graph Library. This library is unusual in that you get to choose the data structures that the graph uses. If you need fast edge removal, then you can choose a linked list in which to store the edges. For the graphs of interest, the code ran in a matter of minutes using a single GB of RAM.
`#include <iostream>#include <boost/graph/adjacency_list.hpp>#include <boost/graph/graphviz.hpp>using namespace boost;typedef adjacency_list<listS, vecS, directedS> Graph;typedef typename graph_traits<Graph>::vertex_descriptor Vertex;typedef typename graph_traits<Graph>::edge_descriptor Edge;class DFS{public:  DFS(Graph &graph): m_graph(graph)  {    seen = (long*) calloc(num_vertices(m_graph), sizeof(long)); // sets to 0    children = (long*) calloc(num_vertices(m_graph), sizeof(long)); // sets to 0    child_edges = (Edge*) malloc(num_vertices(m_graph) * sizeof(Edge));  }  ~DFS()  {    free(seen);    free(children);    free(child_edges);  }  void visit(Vertex root)  {    SEEN++;    dfs_visit(root);  }  void setParent(Vertex parent)  {    CHILD++;    typename graph_traits<Graph>::out_edge_iterator ei, ei_end;    for (tie(ei, ei_end) = out_edges(parent, m_graph); ei != ei_end; ++ei) {      Edge edge = *ei;      Vertex src = source(edge, m_graph);      if (src == parent) {        Vertex tget = target(edge, m_graph);        children[tget] = CHILD;        child_edges[tget] = edge;      }      else {        children[src] = CHILD;        child_edges[src] = edge;      }    }  }private:  void dfs_visit(Vertex v)  {    seen[v] = SEEN;    if (children[v] == CHILD)      remove_edge(child_edges[v], m_graph);    std::pair<graph_traits<Graph>::adjacency_iterator, graph_traits<Graph>::adjacency_iterator> adj_vertices_it;    for (adj_vertices_it = adjacent_vertices(v, m_graph); adj_vertices_it.first != adj_vertices_it.second; ++adj_vertices_it.first) {      Vertex n2 = *(adj_vertices_it.first);      if (seen[n2] != SEEN)        dfs_visit(n2);    }  }  Graph &m_graph;  long *seen;  long SEEN = 0; // flag used to mark visited vertices  long *children;  long CHILD = 0; // flag used to mark children of parent  Edge *child_edges;};void help(){  std::cout << "Usage: tred input.graph output.dot\n"            << "\n"            << "Where the input.graph is a DAG described with the following format:\n"            << "   NUM_VERTICES\n"            << "   SRC_1 DST_1\n"            << "   SRC_2 DST_2\n"            << "   ...\n"            << "\nVertices are assumed to be numbered from 0 to NUM_VERTICES-1\n";  exit(1);}int main(int argc, char* argv[]){  if (argc != 3)    help();  unsigned VERTEX_COUNT;  std::cout << "Reading...\n";  std::ifstream inputfile(argv[1]);  if (!inputfile) {    std::cout << "ERROR: Cannot open " << argv[1] << " for reading\n";    exit(1);  }  inputfile >> VERTEX_COUNT;  Graph graph(VERTEX_COUNT);  int a, b;  while (inputfile >> a >> b)    add_edge(a, b, graph);  DFS dfs(graph);  typename graph_traits<Graph>::adjacency_iterator ai, ai_end, bi, bi_end;  typename graph_traits<Graph>::vertex_iterator vi, vi_end;  std::cout << "Reducing...\n";  for (tie(vi, vi_end) = vertices(graph); vi != vi_end; ++vi) {    Vertex n1 = *vi;    if (n1 % 100 == 0)      std::cout << n1 << "\n";    dfs.setParent(n1);    for (tie(ai, ai_end) = adjacent_vertices(n1, graph); ai != ai_end; ++ai) {      Vertex n2 = *ai;      for (tie(bi, bi_end) = adjacent_vertices(n2, graph); bi != bi_end; ++bi) {        Vertex n3 = *bi;        dfs.visit(n3);      }    }  }  std::cout << "Writing...\n";  std::ofstream outputfile(argv[2]);  if (!outputfile) {    std::cout << "ERROR: Cannot open " << argv[2] << " for writing\n";    exit(1);  }  write_graphviz(outputfile, graph);return 0;}`
• Show original
• .
• Share
• .
• Favorite
• .
• Email
• .
Noel O'Blog by Noel O'boyle - 5M ago
One of Roger's observations came to mind recently when I was looking at some SMARTS patterns in Open Babel that were causing some problems: "People cannot write SMARTS correctly." Instead, he argues, they should use tools to create SMARTS for them. The problem is not so much that patterns don't match molecules of interest, but that they match additional molecules (false positives) and may miss others (false negatives). False positives are a particular problem when transformations (e.g. SMIRKS) are applied to matching molecules, as they can result in exotic valence and charge states that were unintended.

The following is a description of a tool I put together to help craft SMARTS patterns. It is based upon the idea that a SMARTS pattern is completely defined by the molecules it matches. Thus, by looking at the matches in a database of sufficient size, we can see whether there are any mismatches.

So far, so obvious. What's novel is that the tool only returns distinct matches based upon simple atom types of the matching atoms, and sorts them by probability of those atom types (most unusual first). Thus, even for common substructures the number of hits is small and information-rich. Furthermore, by using my favourite trick of sorting the reference database by size (ChEMBL here), this means that the first hit found for each distinct match is likely to be small in size.

An example will illustrate: let's suppose we want to create a SMARTS pattern to match alcohols. I know that "CO" in SMILES indicates this, and so let's start with that. Only 28 hits are returned (from a 10% subset of ChEMBL), of which the first three are:
Ok, obviously we don't want a charged oxygen, so let's try "C[O+0]" and look at the first three hits:
And the oxygen should have a hydrogen, "C[Oh1+0]":
Now, about that carbon. I only had sp3-hybridised carbons in mind, so how about "[CX4][Oh1+0]"? Here is the full list of hits at this point:
At this point the hits are what I expected to see. Does this mean I should stop here? No, but it's a reasonable start.

After using this tool for a while, you begin to appreciate the importance of nailing down exactly what you want to see: if you only want to match neutral carbons, you should say so; if you don't want the oxygen to have additional neighbours, you should say so. Once every degree of freedom is nailed down...then it's finished. You might think this is going too far, but the molecules that exist in databases don't necessarily conform to typical valence rules - they may be unusual (consider TEMPO) or downright incorrect. Your SMARTS pattern should match exactly what you had in mind and not leave any wiggle room to match anything more: "[Cv4X4+0]-[OD1h1+0]"

This tool is by no means a panacea; you still need to think for yourself, and consider for example what to do about aromaticity, or the presence of explicit hydrogens or hydrogen isotopes, or whether to reorder the pattern for speed. But one could imagine a tool that could ease the burden further and do some of this for the user, or for example allow the user to mark "match" or "don't match" for the set of hits returned to automatically refine the query.
• Show original
• .
• Share
• .
• Favorite
• .
• Email
• .
Noel O'Blog by Noel O'boyle - 8M ago
At the last ICCS, Andrew Dalke and I got talking. We had both been around the poster session, and had heard first-hand the problems of "invalid SMILES" that were affecting the generation of SMILES strings using deep neural networks. I remember Xuhan Liu in particular asking me whether I had a solution to this. Well, I had been turning over an idea in my mind for a little while, and by the time I got talking to Andrew, I had decided to do something about it. Now Andrew knows a thing or two about SMILES himself, so I shouldn't have been surprised that he had also come up with something (along with two references from 1964 to back it up). We put the two ideas together and called it DeepSMILES.

The problematic areas of SMILES syntax involve paired ring closure symbols for cycles, and parentheses for branches. These particular aspects of the syntax are difficult to reproduce correctly when generating SMILES strings using machine-learning methods, and so a certain percentage of generated SMILES tend to fail basic syntax checks. While there have been a variety of approaches aimed at improving the SMILES generation (with quite some success), it is reasonable to assume that the syntax also causes difficulties during the learning phase.

Our approach is not to use SMILES, but instead an alternative syntax that does not have these problems. Paired ring closures are replaced by a single digit, the ring size; paired parentheses are replaced by close parentheses indicating branch size. See the talk below, the preprint, or the GitHub site for more information. Feedback (positive or negative) is welcome, either here or on GitHub.

• Show original
• .
• Share
• .
• Favorite
• .
• Email
• .
Noel O'Blog by Noel O'boyle - 10M ago
I've done a bit of work over the last year or two stripping back some accumulated cruft in Open Babel. One question that's been on my mind during this process is how much can you simplify the toolkit's core but yet leave it capable of complex behaviour?

Take the case of aromaticity handling. Just this part of the toolkit on it's own raises many questions. Should aromaticity be lazily perceived or require an explicit function call? What happens when the user modifies a molecule? What about copying a molecule? Or just copying a subset of the atoms to another molecule? What should happen when you add two molecules together? If you read aromaticity from a SMILES string, what if it's a different model than the toolkit uses internally? Should the SMILES writer reperceive aromaticity, or just use it as presented?

Often the easiest solution to these problems is to always do the maximum amount of work possible (i.e. throwing away perceived aromaticity information at every opportunity), which I wanted to avoid at all costs. So I went through removing bits and simplifying things, making sure that aromaticity information was copied where possible, and hoping that in the end all of the complex behaviour that I wanted to maintain would still be possible without adding back fixes or kludges. And fortunately it was. I even managed to add additional behaviour, an option to keep the aromaticity information as read from a SMILES string.

I'm a firm believer that there's no point adding features or improving things if you don't write documentation explaining why/how/when they should use it. What's the point doing this work if no-one knows how they can take advantage of it? So, I've just written up documentation on how Open Babel handles aromaticity. The following (exclusive!) excerpt describes how aromaticity information is stored by Open Babel.

Handling of aromaticity

The purpose of this section is to give an overview of how Open Babel handles aromaticity. Given that atoms can be aromatic, bonds can be aromatic, and that molecules have a flag for aromaticity perceived, it’s important to understand how these all work together.

How is aromaticity information stored?

Like many other toolkits, Open Babel stores aromaticity information separate from bond order information. This means that there isn’t a special bond order to indicate aromatic bond. Instead, aromaticity is stored as a flag on an atom as well as a flag on a bond. You can access and set this information using the following API functions:

• OBAtom::IsAromatic(), OBAtom::SetAromatic(), OBBond::UnsetAromatic()
• OBBond::IsAromatic(), OBBond::SetAromatic(), OBBond::UnsetAromatic()

There is a catch though, or rather a key point to note. OBMols have a flag to indicate whether aromaticity has been perceived. This is set via the following API functions:

• OBMol::SetAromaticPerceived(), OBMol::UnsetAromaticPerceived()

The value of this flag determines the behaviour of the OBAtom and OBBond IsAromatic() functions.

• If the flag is set, then IsAromatic() simply returns the corresponding value of the atom or bond flag.
• If unset, then IsAromatic() triggers aromaticity perception (from scratch), and then returns the value of the flag.

See https://open-babel.readthedocs.io/en/latest/Aromaticity/Aromaticity.html for the nail-biting conclusion to this thrilling exposition.
• Show original
• .
• Share
• .
• Favorite
• .
• Email
• .
Noel O'Blog by Noel O'boyle - 11M ago
What happens when four software developers get together and compare their implementations of the CIP rules? This is the background to a recent preprint deposited in ChemRxiv.

The CIP system is a series of rules that describe how to assign a stereodescriptor (e.g. R/S, E/Z) to a stereocentre. When Bob Hanson decided to add support for CIP to JMol, rather than simply read the rules and implement it according to his interpretation as others have done, he decided to work with three other implementations to challenge each other on disagreements and clarify the wording of the rules.

The result is described in:
Algorithmic Analysis of Cahn-Ingold-Prelog Rules of Stereochemistry: Proposals for Revised Rules and a Guide for Machine Implementation. Robert M. Hanson, John W. Mayfield, Mikko J. Vainio, Andrey Yerin, Dmitry Redkin, and Sophia
Musacchio [https://doi.org/10.26434/chemrxiv.6342881.v1]

Essentially, the issue that the authors are addressing is the fact that existing implementations even in "highly respected software packages" disagree with each other (see John Mayfield's presentation). By comparing the implementations in JMol, Centres, Balloon and ChemSketch they were able to identify cases where:
"(a) the disagreement was due to different interpretations of CIP rules among software developers, (b) there was a problem with an algorithm or its implementation in code, or (c) the CIP rules themselves were flawed."

In all cases, however, they were able to come to a consensus, which led to "the discovery of a small number of errors in the Blue Book, two minor flaws in the CIP rules, and a proposal for a new rule".

The paper walks through their discussions of each rule in turn, looking at any issues arising and clarifying any ambiguities. It also includes a validation suite that covers all aspects of the rules and will allow future CIP implementations to avoid the pitfalls that have beset the field in the past.
• Show original
• .
• Share
• .
• Favorite
• .
• Email
• .
Noel O'Blog by Noel O'boyle - 1y ago
I've already gotten the "valid SMILES" issue off my chest, but there's something else that's been bugging me about deep learning papers in chemistry, and that's the use of canonical SMILES for training. Why would you use canonical SMILES? And if you did, wouldn't that introduce some problems?

This is one of those cases where this just feels like a mistake to me - and I'm going to do my best to articulate my concerns. However, I'm not familiar enough with the mechanics of DNNs to be sure, but I hope those in the field will consider the points I raise.

The problem

So, canonical SMILES. Behind the scenes, the canonicalisation procedure gives a unique label to each atom (typically 1 to N). These labels are then used to generate a canonical SMILES, typically by starting at the lowest label (but not necessarily). The canonicalisation procedure is based upon the attributes of the graph, with the result that the first atom in a canonical SMILES tends to favor particular atom types and avoid others.

For example, if you look at the atom types for the first atom in the canonical SMILES generated by RDKit for ChEMBL molecules, you will find that the second most common atom type in ChEMBL (namely, *-C(-*)=*) never appears as the first atom in a canonical SMILES string. This is by design and you'll see similar behaviour with other toolkits - SMILES strings tend to start with degree 1 atoms.

So what if the distribution of atom types is different for the first atom?

Well, firstly, I predict that as a result these atom types will be over-represented in structures generated by DNNs (and others under-represented). If you train on canonical SMILES, then the probabilities for the first atom will be determined by the atom types favored as starting atoms by canonical SMILES. Consider the extreme example where fluorine always occurs as the first atom in any canonical SMILES that contains it; you should see an increased number of fluorines as the first atom in the generated molecules. Now you could argue that the probabilities for the remaining atoms will be adjusted accordingly, but I believe that there is a strong edge affect and that any correction will attenuate as the SMILES string becomes longer.

Secondly, this bias makes the DNNs job harder. Instead of a relatively even distribution of atom types at all points in the SMILES string, the distribution will depend on the distance from the starting atom. It's now trying to learn something about the properties of canonical SMILES instead on concentrating on the task at hand...

...which bring me nicely to the third point. Predictive models attempt to deduce a property value from the structure, and a SMILES string is used by DNNs as a proxy for the structure. Using a canonical SMILES string is another step removed. What about a molecule with a very similar structure but very different canonical SMILES? Surely the goal of a robust model is to handle this well. Restricting good predictive power to only those structures that are both similar and have similar canonical SMILES is to develop a model with a reduced applicability. A fun task is do is to measure the degree to which this fitting to canonical SMILES occurs; this is left to the reader.

The solution

The solution is simple. Use random SMILES. A single one, or multiple. The use of multiple random SMILES has already been described by Thomas Bergwinkl and subsequently by Esben Jannik Bjerrum as a 'data augmentation technique', but I see it as just avoiding the inherent bias of using canonical SMILES. But either way, I like this quote from Thomas:
The output for alternative representations of a molecule should be the same, if you understand SMILES. Using alternative representations in the test data allows to verify if the neural network understands SMILES. Spoiler: After a while the output becomes very close for the alternatives!

So why do people use canonical SMILES in the first place? I have my theory on this.

I believe it's because the generative models more quickly converge to generation of syntactically valid SMILES strings when they train on canonical SMILES. And for some reason, the percentage of syntactically valid SMILES strings generated by the model has become a figure of merit.

But this makes no sense - who cares what this percentage is? Sure, we can all overfit to canonical SMILES and get high percentages quickly. But how is this a good thing? You know that feeling you get when you see a QSAR model with very high R2 on training data - that's how I feel when I see a high value for this percentage. If it's actually doing what it's supposed to be doing (i.e. learning the underlying structure rather than the training set of canonical SMILES), then the percentage should really be lower. What do I care if the percentage of syntactically valid SMILES is 1%? So long as that 1% solves my problem, it's irrelevant - these structures are spewed out of these models thousands per second (I presume, but even so).

Please let him stop talking now
Okay, okay - I'm done. What do you think?
• Show original
• .
• Share
• .
• Favorite
• .
• Email
• .
Noel O'Blog by Noel O'boyle - 1y ago
Just back from the recent ICCS in Noordwijkerhout, the Netherlands. I really enjoyed it although, as you can see from the picture, it was hard work at times.

Here are my notes on the scientific program, which I have just extracted from Twitter. Naturally, all errors are my own - I may have misunderstood or lost the thread. Also I didn't take notes on all of the talks.

If you are interested in a particular talk, and paste the provided link into a browser, you can see if anyone on Twitter added a comment on the tweet.

If you want to follow the entire Twitter conversation on the meeting, which used the hashtag #11thICCS, go to this link. Again, if a particular Tweet has replies, you need to click on it to see them.

As well as taking notes, I also presented a poster entitled "Can we agree on the structure represented by a SMILES string? A benchmark dataset". For more info, follow the link over to the NextMove blog.

Note to self:
Next time include the hashtag with every tweet. Otherwise it's hard to extract, and hard for attendees to follow automatically.
Image credit:
Image courtesy of Jason Cole.
• Show original
• .
• Share
• .
• Favorite
• .
• Email
• .
Noel O'Blog by Noel O'boyle - 1y ago
I'm a big fan of genetic algorithms. I used one back in the day for inverse design of molecular wires. They are deceptively simple, so much so that when you see one in action, churning out molecules that seem to magically improve generation on generation, it's a real eye-opener.

A key part of a genetic algorithm is the application of the crossover and mutation operators. The idea is to take a population of existing molecules, somehow combine pairs (crossover) while changing the structure slightly (mutation). When dealing with chemical structures, you can imagine operators that work directly on the chemical graph, or those that work on a proxy such as a SMILES string. Applying a genetic algorithm to molecules isn't a new idea (there was a patent back in 1993 from Dave Weininger on this topic) but that doesn't mean it's not useful still. And in contrast to generative models for DNNs, which have recently taken the limelight, using GAs is fast, doesn't require GPUs, and is easy to understand and improve. Are the results worse then using DNNs? I note that it is so claimed.

If the idea of using a GA sounds interesting, you may find the following code useful as a starting point. I've tried to make it as simple as possible while still doing something useful: given N random molecules from ChEMBL, chosen to be around the size of abilify but with ECFP4 of less than 0.2 to abilify, can we optimise to a structure with ECFP4 close to abilify? In my implementation, the mutation operator is very basic - it simply swaps adjacent characters in the SMILES string.

The challenge for the reader, if you choose to accept it, is to change/improve this genetic algorithm to get better performance. If you succeed, post the code publicly and leave a comment below. My recommended place to start would be to write a more sophisticated mutation operator (e.g. one that can introduce and break rings, and introduce new genetic material).

Notes:
1. The SMILES strings generated by the GA can look a bit iffy. However, remember that the goal of the GA is not to create nice SMILES strings, but to optimise a chemical structure - how it gets there is somewhat irrelevant. To tidy up the SMILES for consumption by other programs, just run them through Open Babel.
2. Requires the development version of Open Babel for ECFP support
3. The code illustrates again the use of a valence check. Note that the valence check differs depending on the use case. Here I only allow valences present in abilify, whereas in the previous blog post I was trying to quality control a chemical database.
4. The input database should consist of "anti-canonical" SMILES, e.g. generated as described in a previous blog post. Originally I did this on-the-fly but there's currently no way to control the random seed used in Open Babel (and I like reproducibility).
5. The code above is not slow, but you could make it go faster if you used multitple processors to create the children.
• Show original
• .
• Share
• .
• Favorite
• .
• Email
• .
Noel O'Blog by Noel O'boyle - 1y ago
The previous post used a fairly strict definition of a ring system; a set of atoms, all of which are in a ring, and which are connected by bonds, all of which are in a ring. But this ignores any attached atoms, and a case can be made that the removal of these attached atoms completely alters the identity of a ring system.

As a first step towards considering these atoms, a value of 2 can be passed to CopySubstructure as the fourth parameter. This sprouts dummy atoms to replace missing bonds, and so all para-substituted phenyl rings (for example) become *c1ccc(*)cc1.

To go beyond this and actually recover the identity of the attached atoms, well, that's a little bit trickier. To begin with, we need to think about which atoms we want to copy. Depending on your application, you might want to focus on atoms that influence a particular aromaticity model, or have tautomeric potential. Here I'm going to include any atom that's doubly bonded, and any non-C atoms that are singly-bonded.

The most obvious way to do this is to add these atoms to the atomsToCopy variable before doing the copy. Unfortunately, that approach quickly runs into difficulties as the atom might be attached to two rings (e.g. C1CC1OC2CCCC2), or worse still, the atom might itself be in a ring (e.g. C1CC1=C2CCCC2). In fact, the only way to do this "straightforward approach" is to first count the ring systems and label the atoms belonging to each (e.g. with a flood-fill algorithm), and then one-at-a-time copy each ring system and the attached atoms. It works, but elegant it ain't.

So let's do it a different way, a way that turns out to be quite a bit simpler and handles all of the ring systems in a single copy. Remember above when I mentioned para-substituted phenyl rings becoming *c1ccc(*)cc1? No? Well, I did. Anyway, let's suppose that we know the original atom to which each dummy atom corresponds; we could then change the dummy atom to have the same atomic number and charge as the original. Well, what d'yaknow - the fifth parameter to CopySubstructure can be used to find exactly this correspondence.

Oh yeah, there's one catch. Any stereo involving the attached atoms will be missing (I didn't implement this in CopyStructure). If this is important, then you'll need to use the first approach I described above. Anyway, here's the code, which should be used in combination with the code samples in the original post. For the two examples listed above, C1CC1OC2CCCC2 and C1CC1=C2CCCC2, it will produce C1CC1O.C1(CCCC1)O and C1CC1=C.C1(=C)CCCC1.
`atomorder = ob.vectorUnsignedInt()def CopyRingSystem_3(orig, copy):    atomsToCopy.Clear()    for atom in ob.OBMolAtomIter(orig):        if atom.IsInRing():            atomsToCopy.SetBitOn(atom.GetIdx())    bondsToExclude.Clear()    for bond in ob.OBMolBondIter(orig):        if not bond.IsInRing():            bondsToExclude.SetBitOn(bond.GetIdx())    atomorder.clear()    ok = orig.CopySubstructure(copy, atomsToCopy, bondsToExclude, 2, atomorder)    assert ok    mark_for_deletion = []    for atom in ob.OBMolAtomIter(copy):        if atom.GetAtomicNum() != 0:            continue        bond = next(ob.OBAtomBondIter(atom))        bondorder = bond.GetBondOrder()        nbr = bond.GetNbrAtom(atom)        origatom = orig.GetAtom(atomorder[atom.GetIdx()-1])        if bondorder == 2 or origatom.GetAtomicNum() != 6:            # Turn the asterisk into the 'original atom'            atom.SetAtomicNum(origatom.GetAtomicNum())            atom.SetFormalCharge(origatom.GetFormalCharge())            valence = origatom.BOSum() + origatom.GetImplicitHCount()            atom.SetImplicitHCount(valence - bondorder)        else:            nbr.SetImplicitHCount(nbr.GetImplicitHCount() + bondorder)            mark_for_deletion.append(atom)    for atom in reversed(mark_for_deletion):        copy.DeleteAtom(atom)return ok`
• Show original
• .
• Share
• .
• Favorite
• .
• Email
• .
Noel O'Blog by Noel O'boyle - 1y ago
I've just added a method called CopySubstructure() to Open Babel. It's something I've missed for a while, the ability to copy part of a molecule into a new one. The basic usage is where the user gives a list of atoms to copy. When you copy the substructure, all bonds between atoms in the substructure are retained (and only those). By default, hydrogen counts are adjusted to account for missing bonds.

As an example of use, if you copy all atoms that are in rings, you will find all of the ring systems:
`import pybelob = pybel.obatomsToCopy = ob.OBBitVec()bondsToExclude = ob.OBBitVec()def CopyRingSystem(orig, copy):    atomsToCopy.Clear()    for atom in ob.OBMolAtomIter(orig):        if atom.IsInRing():            atomsToCopy.SetBitOn(atom.GetIdx())    return orig.CopySubstructure(copy, atomsToCopy)`
So Clc1ccccc1OC2CCC2 will become c1ccccc1.C2CCC2. If you do this for all of ChEMBL, split on the dots, and collate using the canonical SMILES, you can find the frequencies of all ring systems. Well, almost...

Even for this 'simple' use case, there are some subtleties that may trap the unwary when copying substructures of a molecule. For example, aromaticity is copied as present in the original molecule, and this may or may not be what one wants - it's up to the user to decide, and to mark aromaticity as unperceived if they wish. In this context, it's worth bearing in mind that the Daylight aromaticity model includes contributions from some exocyclic double bonds (and these won't be present with the code above).

Another quirk is that while it may seem that the code above creates SMILES strings for isolated ring systems, that's not quite true. If two ring systems are connected by a bond, then that bond will be retained even where it itself is not in a ring (because all bonds between atoms in the substructure are retained - see the very first paragraph above). For example, c1ccccc1C2CCC2 will stay c1ccccc1C2CCC2.

Which brings us to the second argument to the method, a list of bonds to exclude. To get the behaviour we might expect, all we need to do is add all bonds that are not in a ring to this list of bonds to exclude:
`...continued from abovedef CopyRingSystem_2(orig, copy):    atomsToCopy.Clear()    for atom in ob.OBMolAtomIter(orig):        if atom.IsInRing():            atomsToCopy.SetBitOn(atom.GetIdx())    bondsToExclude.Clear()    for bond in ob.OBMolBondIter(orig):        if not bond.IsInRing():            bondsToExclude.SetBitOn(bond.GetIdx())    return orig.CopySubstructure(copy, atomsToCopy, bondsToExclude)`

Notes:
Here is a main() that could be used with the code above:
`if __name__ == "__main__":    copied = ob.OBMol()    with open("outputC.smi", "w") as out:        for mol in pybel.readfile("smi", r"C:\Tools\LargeData\sortedbylength.smi"):            copied.Clear()            ok = CopyRingSystem_2(mol.OBMol, copied)            assert ok            if copied.NumAtoms() > 0:                smi = pybel.Molecule(copied).write("smi").rstrip()                out.write("\n".join(smi.split(".")))                out.write("\n")`

## Read for later

Articles marked as Favorite are saved for later viewing.
• Show original
• .
• Share
• .
• Favorite
• .
• Email
• .