Emilio Dorigattia,
Jonathan Groß
b,
Jonas Kühlbornb,
Robert Möckel
b,
Frank Maier*a and
Julian Keupp
*b
aDevelopment NCE, Analytical Development, Boehringer Ingelheim Pharma GmbH & Co. KG, D-55218, Ingelheim (Rhein), Germany. E-mail: frank.maier@boehringer-ingelheim.com
bDevelopment NCE, Chemical Development, Boehringer Ingelheim Pharma GmbH & Co. KG, D-55218, Ingelheim (Rhein), Germany. E-mail: julian.keupp@boehringer-ingelheim.com
First published on 24th July 2025
Liquid chromatography-tandem mass spectrometry (LC-MS/MS) is an essential analytical technique in the pharmaceutical industry, used particularly for elucidating the structure of unknown impurities in the synthesis of active pharmaceutical ingredients. However, the interpretation of mass spectra is challenging and time-consuming, requiring significant expertise. While recent computational tools aimed at automating this process have been developed, their accuracy in determining the chemical structure limits its use in practice. In this paper, we introduce a new method called SEISMiQ for elucidating unknown impurities from their MS/MS spectra. We are able to significantly improve elucidation accuracy by integrating domain experts' knowledge, specifically the impurity sum formula and known substructure, into the model's training and inference process. Further performance improvements can be achieved through transfer learning using simulated MS/MS spectra of impurities from an in-house database. Finally, the need for any experimental data collection for finetuning can be circumvented by simulating the entire drug substance synthesis process in silico via reaction templates.
Several computational approaches have been developed to increase the speed and reliability of the MS/MS spectra interpretation workflow, with a particular focus on metabolomics.4–7 Initial in silico solutions ranked molecules in a given list of candidates to surface molecules whose mass spectrum would be most similar to the given spectrum.8–12 Such procedures were generally based on predicting relevant structural information from the MS/MS spectrum and matching these with the corresponding structural information computed from the candidate molecules. While this ranking approach could help practitioners in daily work, it is limited by its inability to propose novel structures not already in the initial list. The recent evolution of deep generative models removed the necessity of a pre-specified list of candidates and enabled de novo structural elucidation where the molecular structure is predicted from scratch rather than by searching a known pool of molecules.13–18 The major challenge in the field is the inherent ambiguity of MS/MS spectra and the relative scarcity of open datasets and benchmarks, with the largest available covering only about 29000 different molecules.19 The difficulty of obtaining high quality expert annotations of MS/MS spectra is likely to prevent the growth of such available data to the amounts used to train the latest molecular generative models.20,21 Common workarounds for this issue include using pretrained models15,16,18 and augmenting the training set with large numbers of simulated MS/MS spectra,14,22 approaches which have seen notable innovations recently.7,23–28
While these developments have greatly raised de novo elucidation accuracy, the performance of these models is not yet at the level desired by practitioners to enhance their productivity in drug substance impurity elucidation. In order to correctly elucidate impurities from MS/MS spectra, analytical chemists leverage a wide range of domain knowledge regarding the synthetic route that generated the impurity, including starting materials and their impurities, the conditions under which reactions take place, possible unwanted side reactions, over reaction and others (Fig. 1a). This information provides substantial insights about the potential impurity structure, including for example fragments shared with the main compound and sites of variation. By focusing on a purely de novo setting, current models for structural elucidation remain unable to leverage this knowledge and as a result do not achieve the desired accuracy level while at the same time making easily avoidable elucidation mistakes. Motivated by this, we introduce a novel method which we call SEISMiQ for elucidating small molecules from MS/MS spectra and specifically apply it to the problem of elucidation of unknown structures in the synthesis process of a drug substance. We demonstrate how to integrate the knowledge of domain experts into the training and inference process of the model to improve elucidation accuracy (Fig. 1b). By finetuning it on simulated MS/MS spectra of related impurities, we further enhance the model's performance, showing for the first time the potential of transfer learning from simulations. Lastly, we simulate the entire synthetic route in silico, including impurity formation events, removing the need for any experimental data collection for finetuning (Fig. 1c). To facilitate future research in this area, we open source our implementation including training code, data, and pretrained checkpoints at the following link: https://www.github.com/Boehringer-Ingelheim/seismiq.
We evaluated our model on the Critical Assessment for Small Molecule Identification (CASMI) challenges,35 as well as the newly released MassSpecGym benchmark19 (Fig. 2). Our model achieved top-128 accuracies of 76.4%, 43.8% and 33.3% respectively for CASMI 2016, 2017 and 2022, and top-5 accuracies of 72.8%, 35.8% and 27.9% when the predictions were ranked using CSI:FingerID36,37 scores. ESI S1† reports top-k performance for different values of k and ranking measures; for the remainder of this paper, we report top-128 performance of our model. All CASMI molecules and their spectra were removed from the model's training and validation sets to ensure an unbiased evaluation. As the MassSpecGym benchmark was published after our model was trained, we used for the evaluation only molecules that were not already in the model's training set. This resulted in 992 MS/MS spectra on which our model reached a top-128 accuracy of 36.5%. On this benchmark, we did not find a difference in performance between different instrument types, and a slight decrease for [M + Na]+ adducts (ESI S2†).
MSNovelist15 reached a top-128 accuracy of 57% and top-1 of 26% on CASMI 2016 when using the sum formula predicted by SIRIUS, which was correct in 93.8% of the cases. MS2Mol22 does not require a sum formula as input and reached a top-25 accuracy of 9% on CASMI 2022. MassGenie14 reported an accuracy of 53% on a subset of 93 challenges of CASMI 2017 with small molecular weight that were also used to train their model, while Mass2SMILES13 correctly elucidated 2/236 challenges of CASMI 2022. Spec2mol18 could not be evaluated on the CASMI challenges, as their model requires four input spectra, combining positive and negative ionization with high and low collision energy, to elucidate a molecular structure. MADGEN38 is a scaffold completion model that obtained a top-10 accuracy of 1.6% on MassSpecGym when choosing a scaffold from a list of 256 options, and 38.6% when given the true scaffold of the molecule. The improvement in performance of our model can be attributed to the larger and more diverse training dataset, the data augmentation protocol employed during training, the larger model size, and the fact that the correct sum formula is given as input. As most of these models lack public and freely usable implementations, we limit ourselves to reporting their performance as originally stated in the respective publications.
We also assessed our model on an internal dataset composed of 174 experimentally detected impurities of several small molecule drug substances collected during routine operations in analytical development (ESI S3† reports data collection standards). On this dataset, our model correctly elucidated only nine (5%) of the impurities, while providing predictions with Tanimoto (computed using the RDKit39 fingerprint algorithm with default settings of 2048 bits, paths of length between one and seven bonds, excluding hydrogen atoms) of at least 0.8 for 49 (30%) impurities, highlighting the challenge posed by the lack of representative training data for reliably elucidating impurities. This problem is exacerbated in a pharmaceutical setting, where substrates change significantly from one drug substance project to the next, posing considerable challenges for creating a truly representative training set.
Based on these considerations, we specifically selected a model architecture that can take this known common substructure as expert-provided input and complete it into a fully formed molecule (Fig. 1b). To do this, we construct a SMILES string of the common fragment such that the last position in the string corresponds to the attachment point between the fragment and the impurity site of variation. We then let the model complete this SMILES string, thereby generating the remaining structure of the impurity conditioned on the known fragment and relative attachment point (Fig. 3a). We quantitatively validated the ability of our model to elucidate the molecular structure when prompted in this way by simulating different known fragments from the test datasets. Specifically, we generated fragments by breaking all single bonds of each molecule in the dataset, prompted the model with the SMILES of each of the two fragments in turn and evaluated how close the model's predictions were to the whole molecule.
On the public test datasets, this resulted in 48628 fragments with an average of 27 and a maximum of 70 missing atoms (Fig. 3b). On such fragments, the model obtained an accuracy of 96.3% when it was tasked to complete fragments missing up to 10 atoms, 71.5% for fragments missing up to 30 atoms, and 35.4% for fragments beyond 30 (Fig. 3c). Nonetheless, for these fragments the average Tanimoto of the predicted molecules was 0.82 (Fig. 3d) and in 73.0% of the cases the Tanimoto similarity was at least 0.675, indicating a close agreement to the ground truth.22 ESI S5† reports confidence intervals and standard errors for all accuracies.
Despite these encouraging results, the molecular structures under consideration were entirely new for the model and never seen during training. Drug substance impurities, however, tend to be structurally similar among each other; during the development and optimization process of the synthesis pathway a significant number of related impurities are characterized, and several distinct projects make use of similar reactions to synthesize the respective main compounds. These considerations motivated us to make use of this historical data and investigate ways to incorporate this implicit process knowledge into the model.
While there was no difference in performance when completing fragments missing up to five atoms, the positive effect of finetuning is apparent starting from ten missing atoms (Fig. 4b). Between ten and twenty missing atoms, the pretrained model obtained an accuracy of only 18.5%, while the model finetuned on historical impurities excluding the test ones obtained 68.4% correct predictions. Finetuning on simulated spectra of the test impurities further raised accuracy to 91.1%. The gap between pretrained and finetuned models further widens for de novo elucidation, where the accuracy of 5.2% of the pretrained model was improved to 58.9% by using historical data and 90% when using simulated spectra of test impurities. Our finetuning protocol caused, however, detrimental effects on the performance on molecules that were not related to the finetuning set. In the CASMI challenges, for example, accuracies decreased by 32, 31 and 22 percentage points for the years 2016, 2017 and 2022 respectively. In ESI S6† we analyze training and validation curves for pretrained and finetuned models, including a comparison between the pretrained model and a model finetuned on CASMI itself. In ESI S7† we perform a quantitative evaluation of the similarity between simulated and experimental spectra.
These results show that it is possible to considerably boost elucidation accuracy by fine-tuning the model on simulated MS/MS spectra of structurally related or even identical molecules, at a certain price on unrelated molecules. Obtaining such a dataset is, however, extremely time consuming, as it requires substantial efforts to manually collect and elucidate hundreds or thousands of impurities.
We developed an impurity predictor based on SMARTS reaction templates,40 describing how the products in a chemical reaction are formed by combining fragments of the starting materials (Fig. 5a and b). We integrated data from an internal electronic laboratory notebook reporting performed reactions with the corresponding starting materials and analytically detected impurities (Fig. 5c) allowing us to cover both the desired reactions forming the main compound and additional processes that generated the detected impurities. For this approach, only the starting materials and respective product(s) are needed, while reagents as well as reaction conditions can be neglected. After performing a sanitization check with RDKit, the template was extracted using the RDChiral package.41 We did not filter templates by score since the formation of impurities in production batches can sometimes be difficult to explain using traditional organic transformation rules and knowledge. Using these reaction templates, we could reproduce the synthesis route of an asset by iteratively applying all templates to the starting materials of each chemical step as well as all products resulting from the previous steps. Known downstream impurities that were not predicted can be covered by manually adding the structure of interest to the inputs of the respective step. This procedure resulted in a dataset of impurities for the complete manufacturing process of an asset that is entirely simulated from first principles only based on the knowledge of the synthesis route (Fig. 5d, sample SMARTS templates as well as synthetic impurity datasets based on public templates can be found in ESI S7 and S8†).
We focused on the synthesis route of an internal asset from Boehringer Ingelheim's development pipeline. This molecule consists of 41 non-hydrogen atoms, possesses a molar weight of ca. 600 g mol−1 and is comprised of various functional groups, multiple annulated rings as well as a chiral spiro carbon. The synthesis route for this asset spans seven distinct steps involving four different starting materials and covering multiple reaction types like condensation, oxidation, or reductive amination reactions. While we cannot disclose the chemical structure of this API, we believe that it constitutes a challenging test case for our methodology that is representative of the complexity and variety of real world new chemical entities under active development in the pharmaceutical industry. We excluded from the template extraction procedure all reactions reported in the electronic laboratory notebook that were part of the test asset synthesis route, leaving us with 4446 templates in total (summary statistics can be found in ESI S10†). Their application resulted in 20813 simulated impurities with mass below 1200 Da, and 154
756 corresponding simulated mass spectra. Our test dataset contained 61 experimentally-detected impurities related to this asset, and the impurity generation procedure correctly predicted 27 of these 61 impurities. In general, the chemical space covered by the simulated impurities included close matches for all experimental impurities (Fig. 6a) and revealed additional impurity clusters that were not detected, possibly because the reaction conditions did not allow for such impurities to form in sufficient quantities, or they were not stable or isolated under the given work-up conditions.
![]() | ||
Fig. 6 Results of impurity simulation and model finetuning. (a) UMAP42 visualization of the simulated (blue) and experimentally detected (red) test asset impurities. (b) Accuracy (y axis) on the fragment completion task for the test asset impurities as a function of missing atoms to be completed (x axis) for the pretrained model (blue), a model finetuned on the simulated spectra of historical in-house impurities (orange), and a model finetuned on the simulated spectra of the simulated impurities (green). Error bars represent the standard error of the mean estimator. |
We finetuned and evaluated our model following the same protocols as before, excluding from the finetuning set the 61 experimentally detected test asset impurities, and we compared this model with the model finetuned on historical in-house impurities described in the previous section. Both models correctly predicted the same 46 (78.0%) impurities in a de novo setting. For fragment completion, finetuning on simulated data appeared to result in better performance, although the small dataset size of only 61 impurities caused some fluctuations (Fig. 6b). Nonetheless, when averaging the model's performance across fragments of all sizes, the model finetuned on simulated data had 5.6% higher accuracy (88.7% vs. 83.1%).
The results in this section show that an entirely in silico simulation approach of process impurities and their MS/MS spectra can compensate, without loss of accuracy, the absence of relevant experimental data for finetuning, and result in a model with significantly higher performance compared to a model pretrained only on public data.
We explored three ways of dealing with this challenge. First, we employed data augmentation at training time teaching the model to complete a user-provided molecule fragment. Analytical chemists are able to identify which parts of the impurity are identical to the main compound and providing this fragment to the model resulted in considerable gains in elucidation accuracy. Second, we finetuned the model using an internal dataset of historical, experimentally detected impurities. We showed that our model can successfully transfer knowledge from the corresponding simulated MS/MS spectra further boosting elucidation accuracy both in a de novo and in a fragment completion setting. Third, considering the significant time and monetary investments required to obtain such a dataset, we simulated the entire synthesis process of an asset and predicted the impurities that are likely to be generated in the real world. We found that finetuning exclusively on these simulated impurities and their simulated MS/MS spectra resulted in a model that is slightly more accurate than a model finetuned on experimental data, thus enabling accurate elucidation of impurities without the necessity of any prior experimental measurement.
Our work is not free from limitations. First, our finetuning protocol introduced some overfitting to the dataset used for finetuning, despite our use of common overfitting mitigation strategies43,44 including dropout, weight decay, and data augmentation. The danger of overfitting to small datasets is to some extent unavoidable45–47 and in the context of a transfer learning setup like ours still constitutes a fertile ground for current research48–54 with no accepted best practices.45,55 For our model, the risk of overfitting can be reduced by generating more relevant synthetic data for finetuning, for example by leveraging our impurity simulation approach and by increasing the variety of MS/MS spectra predictors. Furthermore, the danger of unreliable predictions could be identified at inference time by comparing the input MS/MS spectrum with the finetuning dataset for example via CSI:FingerID36,37 fingerprints, using the pretrained model as a fallback, or by employing uncertainty quantification techniques making the model more robust to out-of-distribution inputs.48–52,56,57 Second, our method in its current form assumes a single attachment point when completing fragments; it is not uncommon however for impurities to differ from the main compound in more than one location. Such cases cannot be encoded into a single prompt that allows our model, as presently trained, to predict all attached fragments in a single shot. This task in its general form is known as scaffold completion, and several recent works explore adapting chemical language models like ours for this kind of predictions.58–60 At present, this limitation can be circumvented simply by filtering out the model's predictions that do not conform to the known fragment or via a custom fragment-aware beam search sampling procedure. Third, the data used to train our model was composed of metabolites and small drug-like molecules, making it inappropriate to elucidate larger molecules such as peptides. Furthermore, the accuracy of the MS/MS simulation approaches employed to generate the training spectra could also limit our model's performance on some compound classes and certain adduct types.5 Nonetheless, we did not find significant differences between the accuracy of the MS/MS predictions for our internal test dataset and the CASMI challenges (ESI S5†), suggesting that the models we employed are equally applicable to metabolites and drug substance impurities.
In conclusion, we achieved substantial advancements in the de novo elucidation of impurities from MS/MS spectra by considering the unique aspects of impurity generation and embedding them into the model's training and inference procedures. In the pharmaceutical industry impurity characterization is essential for optimizing manufacturing processes, understanding degradation pathways, ensuring drug substance stability, maintaining quality control, and achieving regulatory compliance. This work represents a significant step forward as we show how to increase the accuracy of a weak baseline into a model that is practically useful to assist analytical chemists in daily production workflows. By incorporating impurity elucidation earlier in the process, we strive to alleviate the workload of analytical chemists and facilitate semi-automated elucidation workflows, ultimately enhancing the efficiency and mitigating the cost of drug substance development, a process that typically spans five to ten years and costs upwards of 1 billion USD.61,62
To improve the generalizability of our model and reduce the risk of overfitting, we leverage several techniques to further augment this dataset during training. First, we chose a random number of peaks between 5 and 50, randomly sampled from the entire spectrum with probability proportional to their intensity. Further, the m/z value of each peak was slightly perturbed by a random amount of uniform noise with magnitude 0.02, thus making the model resilient to measurement noise. Each peak was paired with the corresponding neutral loss, and both were encoded with a sinusoidal position encoding to a dimensionality of 512. The frequencies for the sinusoidal encoding included the atomic masses of H, C, N, O, Cl, S, P, K, F, and Br, and 246 additional frequencies between 3–5 (0.0041) and 37 (2187) evenly distributed in base three log-space.
The model takes as input molecules encoded as SMILES strings. Initial experiments on a small development dataset revealed that this encoding performs on par or slightly better than SELFIES74 and DeepSMILES,75 while having favorable computational requirements due to their shorter lengths. SMILES strings were tokenized with one token per atom, so that C and Cl were mapped to different tokens, resulting in 305 tokens in total. The model was presented with the SMILES string in canonical order 25% of the time, and a randomized atom order is used in the other 75% of the time. Regardless of the order, 50% of the time we kekulized the SMILES, and removed all stereochemistry information. The SMILES tokens were encoded with a learnable embedding in addition to a sinusoidal encoding for the token position. In addition, the model received as input an encoding of the remaining heavy atoms to be generated to complete the molecule, computed based on the sum formula of the molecule and updated with every generated token as done in MSNovelist.15
The model was trained using the cross-entropy loss with a label smoothing of 0.1, and re-weighted samples to correct for over- and under-represented molecules in the training dataset. We used the AdamW optimizer78 with learning rate of 3 × 10−5, linearly annealed from 6 × 10−7 over the first 1000 training steps. A dropout of 0.2 and weight decay of 1 × 10−2 was applied. The model was trained concurrently on four NVIDIA A100 GPUs each using batch size of 64 and mixed 16-bit precision. Based on the validation metrics, the model did not exhibit signs of overfitting, and we stopped training shortly after 22 epochs, or 3.5 million training steps, taking a total of 23 days.
Model fine-tuning was performed by freezing the transformer and tuning the MLPs, which had overall 23 M trainable parameters. We used the AdamW optimizer with weight decay 10−4 and initial learning rate of 10−4, exponentially decayed with a factor of 0.995 over 250 epochs. No early stopping was performed.
In practice, we followed a data mining approach by combining internal data of process development and analytically detected impurities, excluding every reaction that is related to the test asset from this approach. For a performed experiment in our electronic lab notebook for the development of new APIs, we combined the reagents with all detected impurities that were detected. From this in silico constructed reactions, we performed in the first step an atom-mapping based on RXNMapper.82 Subsequently, we extracted a reaction template for this hypothetical reaction leading to an observed impurity applying the RDChiral library.41 Overall, this approach resulted in 4446 templates.
With the internal template set present, we subjected each possible bimolecular combination of reactants for a given experiment of asset to RDKit to generate possible impurity structures39 over two rounds of generation.
Footnote |
† Electronic supplementary information (ESI) available. See DOI: https://doi.org/10.1039/d5dd00115c |
This journal is © The Royal Society of Chemistry 2025 |