Drug discovery is a decade-long process which often fails before clinical application or approval. Understanding its mechanisms can therefore help avoid this failure and lean towards success. For this reason, we propose an analysis on the molecular, chemical and disease-related indicators for success in drug discovery.
In particular, this website will first walk you through different classes of disease, and the corresponding targeted proteins, and how they relate to various indicators of success in academia and industry. Thereafter, we dive deep into the biochemical and molecular features of ligands, and discover how these shape the world of drug discovery.
Before presenting our answers to all these questions, it is important we become familiar with the data we are using, and how it is distributed.
On top of indicating what analyses should be carried out, this already gives us a first idea of what the world of drug discovery looks like, and what usual binding kinetics are.
This figure shows the distribution of each metric. In particular, we first appreciate that most measurements are made at physiological pH and body temperature. Measures made outside these ranges are expected to be in extreme conditions or very different model organisms. In addition, we observe that all binding kinetics (aside pH and temperature) are adequately distributed on a log-scale. Two particularly important metrics in pharmacology (notably shown by their large frequency in the data), Ki and IC50, show mean values of 10^2 nM and 10^2.5 nM respectively. In particular, IC50 is the most widely used measure of a drug's efficacy, indicating the concentration of a drug needed to inhibit a process by half. These values, however, are suspected to be highly influenced by the targets, diseases, and molecular features of interest, all of which will be thoroughly analysed.
This leads to very important questions, and perhaps the first in the drug discovery process: how does a targeted disease affect research? Is there a relationship between targeted diseases and success indicators? How may this change over time?
Is there a relationship between the targeted diseases and success indicators?
Some disease classes are intuitively more researched than others:
Unsurprisingly, cancer is the most studied disease area in drug discovery, and is stable over time. Interestingly, we also see, when focusing solely on AIDS, that this disease had a major research focus in the 1980s, coinciding with its pandemic. Other examples include immunodeficiency diseases, for which research is steadily rising since the beginning of the century.
Still, it remains unclear how this distribution is related to actual success in drug discovery. We therefore introduce the first metrics of success: publication and patent citations.
To gain a more comprehensive understanding of how research focuses on various disease classes, research article counts and citation trends are highly relevant as first indicators of interest, influence and importance.
In addition, to get a sense of the overall importance of a disease class, we introduce a new metric of success, the H-index. The H-index (where H is the number of publications that have above H citations) is a measure of impact, taking into account both the number of publications in a category, as well as how much these are cited.
A main advantage of this index is its insensitivity to outlying data. In particular, it is only slightly affected by individual papers which may have obtained great praise, but rather indicates a consistent success.
Here, we observe with no great surprise that research on cancer has the most published articles as well as citations. Secondly, neurodegeneration sees many published and cited articles, closely followed by dysplasia. Interestingly, dysplasia may become a cancer with time. Thus, the largest studied diseases seem to be the main ones affecting our aging society: cancer and neurodegeneration. Others include cardiac issues (QT syndrome), immune problems and inflammation, obesity and diabetes, and other neurological problems such as epilepsy. Interestingly, when looking at the H-index only, cardiac issues seem less impactful, while neurological issues such as epilepsy gain in importance. But from these studied diseases, what actually reaches the pharmaceutical industry?
By looking at patents instead of articles, we observe that cancer has both the highest number of patents and patent citations of all disease classes. Even though it is still the fourth patented disease class, neurodegeneration does not seem to be as important in industry as in academia. Instead, we see a rise in the importance of immunodeficiency, inflammatory-regulating and blood disease drugs, especially notable when classified by H-index.
For drug discovery, knowing how each of these disease classes perform in academia and industry is important. The main objective however, remains success in clinical trials and patients.
The clinical trials data once again stresses that cancer is the most prominent disease class.
Completed trials compared to all clinical trials show a rapid decrease in numbers as of 2008-2009. A direct conclusion, irrespective of diseases, is that clinical trials take much time to complete, around 5 years, if not more. The clinical phase of drug discovery therefore represents a major hurdle in this process.
In drug discovery, the disease studied seems to heavily affect the reach and impact of the associated research. Influenced by our aging population, we observe an especially high interest in cancer, neurodegeneration and neurological conditions, inflammatory and autoimmune diseases, as well as cardiac and blood-related diseases.
With this in mind, feel free to go back and interact with the data, exploring particular diseases of interest!
We will now investigate the next question in drug discovery, one level lower, namely how they are related to protein classes, and how these classes shape drug discovery. Indeed, as seen in the following diagram, the largest protein classes are linked to many different diseases.
But are certain targeted protein linked to better success rates? What other factors may influence this?
We now turn our attention to the targeted proteins. Proteins are the essential binding partners for drug molecules, and targeting the right protein class can significantly influence the success of both academic research and industrial applications.
Some protein classes, like receptors and kinases, are naturally more studied due to their roles in initiating biological pathways, but how does this translate to success? The metrics presented above for diseases can help answer these questions, when applied to target classes.
Here, we observe that the three most influential protein classes to target in drug discovery are neurotransmitter receptors, hormone receptors and growth factor receptors. When looking at overall impact measured by the H-index, growth factors become slightly more influential than hormone receptors, but this general trend remains.
Linking back to what was found to be the most influential diseases, it is no surprise these targets appear, as nearly all top 10 targets as measured by citation count, article count and H-index are related to neurological diseases or cancer. Interestingly though, we observe that research on cancer is spread on many more classes of proteins (growth factor receptors, CDKs, protein kinases, RTKs, etc.), while neurological conditions focus either on neurotransmitter receptors or transporters.
When looking at industry influence through patents, neurotransmitter receptors and transporters don’t seem to have as much influence as in academia, in line with our observations related to diseases. Instead, non-receptor tyrosine kinases are the most patented and influential targets in drug discovery. However, this might position neurotransmitters as the most promising, emerging protein class to study.
Indeed, its strong position in academia hints towards a very solid foundation of knowledge, which simply has not yet reached patentable results, compared to the historically established targets in research on cancer. With the same argument, we equally identify hormone receptors, which have a high academic H-index, while having a low industry H-index, as promising targets in drug discovery.
As for diseases, distinct target classes have highly different influences and reach in drug discovery. Unsurprisingly, the main targets are linked to biological pathways of the most influential diseases found earlier. By looking at differences between academia and industry, we are able hypothesise what might be emerging targets in drug discovery. However, so much more is important in discovering what brings success in drug discovery. In particular, going yet a level deeper, the molecular features of targets and ligands themselves will truly differentiate success from failure.
What are the relationships between targeted protein classes and success in drug discovery?
What molecular features determine success in drug discovery?
Why would molecular features of ligands be decisive in drug discovery? Intuitively, the answer is relatively simple: the shape, weight, charge, and even atoms decides how ligands bind to targets. Below, we get an idea of what these features look like, and how binding kinetics are distributed among drugs and targets.
We see here the relations between binding kinetics and four molecular properties: H-Bond donors and acceptors, important in chemical binding, molecular weight, important in overall structure, and CLogP, an indicator of hydrophobicity. In general, we observe that certain features seem to be correlated, while binding kinetics allows separation of ligands into clusters. Alone however, these features do not indicate much with respect to drug discovery. Also, the kinetic clusters are difficult to define specifically, as they could be due to varying experimental paradigms, as well as ligands’ mechanisms of binding.
The most meaningful insights will therefore lie in the deeper chemical features, which originate from the ligand’s structure. Therefore, we directly look at these structures’ fingerprints, and group all similar molecules (which will therefore have similar features and kinetics), to investigate how the molecules themselves shape drug discovery.
A molecular fingerprint is a representation of a molecule in a digital form that allows for easy comparison with other molecules. Fingerprints capture structural information about a molecule, such as the presence of particular atoms, bonds, and functional groups, and transform this information into a series of binary values (bits). The meaningful information of these high dimensionality fingerprints can then be extracted by PCA to group molecules by structure.
RET
(Best viewed in Chrome)
D2R
(Best viewed in Chrome)
Given the large influence in drug discovery of cancer and neurodegeneration previously analysed, and the highly influential target classes of tyrosine kinases and neurotransmitter receptors, we specifically analyse the molecular features of RET, a proto-oncogene tyrosine kinase, the dopamine receptor D2 (D2R), and their implications in research success. To do so, we first look at how these structures develop over time.
This offers great insight into the research strategy itself. Indeed, from an initial central cluster, we see that subsequent drugs are developed by branching out in distinct directions. This stresses the iterative nature of drug discovery, in which, from an initial drug, researchers and institutions modify small components (adding side chains, aromatic rings, bonds, etc.) in the hopes of finding an improved version of the drug. This improvement is shown by the decreasing IC50 in branches over time (increasing efficacy).
When focusing on specific institutions, we observe that each focuses on their own specific modifications, all published as clusters, with differences within the clusters most probably being tiny structural changes. Thus, each branch from the initial central cluster is a biochemical direction taken by an institution. But which institution found success, and most importantly, how?
To answer this question, we look at the same clusters, for D2R and RET, overlaid with the success metrics previously presented.
Following the steps of drug discovery, we first look at academic publications and their citations. Doing so, we observe that the central cluster, which we remember to be the first in time, clearly has the most academic impact. Again, this stresses how iterations of drugs rely on the original publication. Thus, success in academic research clearly relies on novelty.
When looking at molecules that are patented, we again see the same importance of the central cluster. However, the newer branchings discussed above equally enjoy higher patent citation, showing how new structural modifications, and increased efficacy (lower IC50) brings success in industry.
Indeed, the successful cluster when measuring patent impact is not necessarily the first. For example, RET’s most cited patented ligand originates from improved versions compared to the original academic publication. In clinical applications, we again observe that the most successful cluster, when measured by the number of phase 4 trials, is again the original molecule. This may simply be due to the duration of clinical trials, which often take years before reaching phase 4. All in all, these results show how, from a first successful academic publication, researchers improve on a drug’s efficacy, as measured by IC50. But what are these molecules?
Taking RET as an example, we find the representative molecule of each cluster (median IC50), which indeed show differences in both efficacy and structural elements.
To tie everything together, and discover which particular structural elements predict this IC50, we use machine learning tools (Logistic Regression and Random Forests), and determine which have the largest predictive power (measured by coefficients in logistic regression and mean decrease in impurity in random forests).
Looking at the logistic regression model, we observe how ligands are classified between low, medium and high IC50. In general, features such as the molecular weight, the maximal absolute partial charge, the number of aliphatic carbocycles, and count of NO groups are highly significant and influential in deciding IC50. Other molecular descriptors, such as the number of aromatic rings, carbocycles and heterocycles, have no significant predictive power for IC50. More interestingly, we look at how these coefficients change depending on the level of IC50. For example, molecular weight seems to favor low IC50 values, maximum absolute partial charge favors low to medium IC50 values, while the Hall-Kier indicator (a measure of accessibility between bonds) has larger weights when predicting higher IC50s.
Further supporting these results, we see that the most important features in the random forest model include the maximum absolute partial charge of a molecule, the count of HNO groups, the electrotopological state, and the Hall-Kier index, and number of aliphatic carbocycles. We therefore recognise that these results are reproducible across methods, pointing to a biological truth.
All in all, a high drug efficacy is therefore notably influenced, based on both classifications, by higher molecular weights, higher maximum absolute partial charges, and lower number of aliphatic carbocycles.
Throughout this analysis we have travelled from disease to molecules. This finally allows us to map the molecular path to health. In particular, we have shown how distinct structural and chemical features, such as higher molecular weights, higher maximum absolute partial charges, and lower number of aliphatic carbocycles increase the efficacy of drugs. In practice, this is achieved by branching out and exploring the structural space by large screenings of slight modifications to an initial academic publication. This then brings new patents, new features and potentially future clinical trials, although these are lengthy in time.
At a higher level, although historically established targets, such as CDKs, RTKs and growth factor receptors are still highly influential, there exist emerging targets, like neurotransmitter and hormone receptors. Globally, this analysis maps the current state of drug discovery, focusing on health issues of the 21st century, such as cancer, immunodeficiency, inflammation and neurodegeneration.