Phyo Phyo Kyaw Zin » Cheminformatics
22 FOLLOWERS
I am Phyo Phyo. I will share some tips on Cheminformatics techniques in the context of drug discovery. I am passionate about Cheminformatics. My research areas include molecular informatics, computer-aided drug design, machine learning (Quantitative Structure-Activity Relationship; QSAR modeling), and molecular docking.
Phyo Phyo Kyaw Zin » Cheminformatics
9M ago
Personal Update:
Before we embark on today’s tutorial, I wanted to share a personal update with you that sheds light on my recent hiatus from blogging. In the past few months, I’ve been immersed in the world of motherhood, cherishing precious moments with my newborn, Ellie.
As I transition back to my role in the biotech world, I find myself navigating the delicate balance of work and motherhood. It’s been a journey filled with new challenges, learning experiences, and a deep sense of fulfillment.
As I continue to share coding insights related to data science, visualization and Cheminformatics ..read more
Phyo Phyo Kyaw Zin » Cheminformatics
1y ago
Introduction
Today, we will dive into molecular substructure highlighting with RDKit – a powerful technique that illuminates the hidden intricacies within molecular compounds.
In this tutorial, I will be focusing on two things:
Highlighting a common substructure among molecules.
Highlighting differences among molecules, given a common substructure.
If you are interested in more Cheminformatics related tutorials, check my other blog posts here.
Section 1: Understanding the Power of Structure Highlighting
Before we jump into the technical aspects, let’s take a moment to appreciate why structur ..read more
Phyo Phyo Kyaw Zin » Cheminformatics
2y ago
Today’s tutorial is on how to merge multiple datasets using the Pandas library in python. We will add new columns based on a key column, and we will also aggregate information for the same column names from various datasets.
I have made five sample datasets (A1.csv, A2.csv, A3.csv, A4.csv, A5.csv) that we will be merging.
The code and the data can be found in this GitHub repository. I have organized the five datasets in this “to-merge” folder.
Each dataset contains
an ID column (key column on which we will be merging different datasets)
unique columns (other datasets don’t have that column)
c ..read more
Phyo Phyo Kyaw Zin » Cheminformatics
2y ago
In this two-part tutorial, I will show how to enumerate an in-silico amide-coupling library of reaction-based molecules with python code tutorials. In part 1, I will focus on how to write reaction-based transformations, and in part 2, I will show how to enumerate the library containing full molecules based on N building blocks wherein you can specify how many building blocks you want per molecule.
I previously developed PKS Enumerator and SIME software tools which are used to design virtual libraries of macrocycles/macrolides. Both were based on a string- or template-based enumeration, and I w ..read more
Phyo Phyo Kyaw Zin » Cheminformatics
2y ago
In early 2021, I gave a talk at the MIDD+ Conference held by Simulations Plus Inc. on data curation using one of the projects that I worked on — the Madin-Darby Canine Kidney (MDCK) project. In this blog post, I will be focusing on the general data curation aspects of that project.
Let me emphasize why data curation is an essential step in Cheminformatics. A machine learning model can only be as good as the data it is built on. If the data is noisy, filled with activity cliffs, and full of mistakes, you won’t get any useful model. Often, cheminformatics datasets aren’t quite large, like thousa ..read more
Phyo Phyo Kyaw Zin » Cheminformatics
2y ago
This blog post is for readers as well as myself. In this tutorial, I will show how to make different types of boxplots including horizontal, vertical, grouped boxplots, and interactive ones.
It’s not meant to be comprehensive. It’s just a collection of different styles and visualizations that I like.
For the code, you will need the following python libraries: pandas, NumPy, Plotly, Matplotlib, and seaborn. They all can be installed with either pip or conda.
I will be using fake data to show different types of boxplots. Normally, I would create a conda environment and install these required lib ..read more
Phyo Phyo Kyaw Zin » Cheminformatics
2y ago
Please check out the previous blog posts from this series if you haven’t done so already:
Part 1 algorithm for k-fold Cross-Validation
Part 2A of the Nested Cross-Validation & Cross-Validation Series where I went through a python tutorial on implementing k-fold CV regressors using random forest (RF) from scikit-learn with a simple cheminformatics dataset with descriptors and endpoints of interest.
Here in Part 2B, I will cover the python tutorial for the dataset containing high-dimensional matrices where each matrix represents features of a chemical structure (this is taken from one o ..read more
Phyo Phyo Kyaw Zin » Cheminformatics
2y ago
This is part 3 of the Nested Cross-Validation & Cross-Validation Series where I will explain the algorithm of nested cross-validation (NeCV), and compare Cross-Validation and NeCV.
Please read this blog first if you need to learn about cross-validation so that you can dive into NeCV after.
I would like to first clarify that there are variations in the implementations and algorithms of NeCV. The algorithm I will be describing is a common one used in several studies including an article I published in 2020.
Below is a diagram illustrating the algorithm of NeCV. (taken from https://pubs.acs.o ..read more
Phyo Phyo Kyaw Zin » Cheminformatics
2y ago
A few people have asked me to explain and share the code for Nested Cross-Validation. I think it makes sense for me to explain the basics of whats and whys in using the NeCV first before diving into the code, so I will be covering these topics in four separate blog posts.
For part 1, I will explain the algorithm for Cross-Validation (k-fold).
For part 2, I will explain how to implement the k-fold cross-validation algorithm in python with tutorials using two cheminformatics datasets (A) a simple dataset with descriptors and endpoints of interest, and (B) high-dimensional matrices where each mat ..read more
Phyo Phyo Kyaw Zin » Cheminformatics
2y ago
This is part 2A of the Nested Cross-Validation & Cross-Validation Series. I will go through a python tutorial on implementing k-fold CV regressors using random forest (RF) from scikit-learn with the first dataset: (A) a simple cheminformatics dataset with descriptors and endpoints of interest.
In Part 2B, I will cover the same python tutorial for the second dataset: (B) high-dimensional matrices where each matrix represents features of a chemical structure (this is taken from one of my Ph.D. projects; MD-QSAR with Imatinib derivatives).
Please check out part 1 of this series to learn more ..read more