The n-octanol-water partition coefficient is a partition coefficient for the two-phase system consisting of n-octanol and water. It is used to measure the solubility of substances.
The original data used in this example is downloaded from the FDA web.
The n-octanol-water partition coefficient, or logKow, measures the relationship between fat solubility (lipophilicity) and water solubility (hydrophilicity) of a substance. A substance would be more soluble in fat-like solvents such as n-octanol if the value exceeds one. On the other hand, if this value is less than one, it is more soluble in water.
This value is used, among others, to assess the environmental fate of persistent organic pollutants. Compounds with high coefficients (values greater than 5) tend to accumulate in the organism’s fatty tissue (bioaccumulation).
It is also used in drug research, where it plays an important role, where this value provides a good estimate of how a substance is distributed within a cell between the membranes (lipophilic) and the cytosol (hydrophilic).
This value is not measurable for all substances, so a good model that allows its prediction will be a useful tool in developing and elaborating drugs for the best treatment of diseases.
Materials and methods
This dataset contains physicochemical properties for 16,523 chemical compounds. These properties have been obtained from Pubchem, a web where chemical compounds are described, and their properties are stored. All of these properties have been either determined via experimental procedures in the laboratory or via software.
We will download the data from FDA, where the raw files have been processed to tabular format. We also have used the PubChem API to retrieve the physicochemical properties for all the compounds they have records of.
The final merged data has 16523 rows, corresponding to chemical compounds. Each compound has data for 34 physicochemical properties, including the xlogp. This is the value we are going to assess.
This variable was calculated using different computational methods and confirmed via experimental procedures.
We will build an approximation model as we try to predict a continuous variable (xlogp).
We will use the normalized squared error for the training methodology with a L2 regularization term. As for the optimization algorithm, we will use the Quasi-Newton method.
Also, we will only use variables that we can calculate or infer using the chemical formula of the compound, that is 9 of the 34 properties retrieved from Pubchem:
- MolecularWeigth: mass of a molecule. It is calculated as the sum of the mass of each constituent atom multiplied by the number of atoms of that element in the molecular formula.
- HeavyAtomCount: any atom except hydrogen in a chemical structure.
- Complexity: rough estimate of how complicated a structure is, seen from both the point of view of the elements contained and the displayed structural features, including symmetry. This complexity rating is computed using the Bertz/Hendrickson/Ihlenfeldt formula.
- BondStereoCount: total number of bonds with planar (sp2) stereo [e.g., (E)- or (Z)-configuration].
- DefinedAtomStereoCount: number of atoms with defined planar (sp2) stereo.
- UndefinedAtomStereoCount: number of atoms with undefined planar (sp2) stereo.
- HBondAcceptorCount: the number of hydrogen bond acceptors in the structure.
- HBondDonnorCount: the number of hydrogen bond donors in the structure.
We will use a network with these properties as input variables and a scaling layer with the same number of neurons as the number of inputs we have in each case.
We will use ten neurons with the hyperbolic tangent as the activation function for the perceptron layer, and an extra perceptron layer with a linear activation function. We also have unscaling and bounding layers, with one neuron each. This number of neurons is to obtain better model regularization.
Finally, our probabilistic layer with one neuron gives us the value for the xlogp assessed.
In the previous image, we can see the architecture of the model we will train in the next steps, with all the layers described previously.
We have built a model for predicting the xlogp of chemical compounds. This model gives us the estimated value of this coefficient for a compound.
For our trained model, the training error is 0.308775, and the selection error is 0.337858.
Next, we will use the goodness of fit of our model to describe how well it fits a set of observations. The value we wil be looking at will be the R2, which describes how well our model explains the variability of the data. The larger this value, the better our model explains the variability in the data.
In the previous image, we can see how well we can predict our data, for the ideal case, all the points should be inside the black line. This means that the predicted value is equal to the real value. We have some variance, but more or less, our datapoints are all aggregated near the line. Our model has an R2 of 0.64, which means we can explain 64% of the variability in the data.
With this, we can conclude that we have generated a model that works properly to calculate the xlogp of a compound.
- Image adapted from: ACS Omega 2017, 2, 9, 6244-6249 September 28, 2017