Jeffrey M.
Ting
*,
Teresa
Tamayo-Mendoza
,
Shannon R.
Petersen
,
Jared
Van Reet
,
Usman Ali
Ahmed
,
Nathaniel J.
Snell
,
John D.
Fisher
,
Mitchell
Stern
and
Felipe
Oviedo
Nanite, Inc., Boston, Massachusetts 02109, USA. E-mail: jeff@nanitebio.com
First published on 6th November 2023
Materials informatics (MI) has immense potential to accelerate the pace of innovation and new product development in biotechnology. Close collaborations between skilled physical and life scientists with data scientists are being established in pursuit of leveraging MI tools in automation and artificial intelligence (AI) to predict material properties in vitro and in vivo. However, the scarcity of large, standardized, and labeled materials data for connecting structure–function relationships represents one of the largest hurdles to overcome. In this Highlight, focus is brought to emerging developments in polymer-based therapeutic delivery platforms, where teams generate large experimental datasets around specific therapeutics and successfully establish a design-to-deployment cycle of specialized nanocarriers. Three select collaborations demonstrate how custom-built polymers protect and deliver small molecules, nucleic acids, and proteins, representing ideal use-cases for machine learning to understand how molecular-level interactions impact drug stabilization and release. We conclude with our perspectives on how MI innovations in automation efficiencies and digitalization of data—coupled with fundamental insight and creativity from the polymer science community—can accelerate translation of more gene therapies into lifesaving medicines.
Unlike their small molecule counterparts, therapeutic biologics present distinct challenges. DNA, RNA, and genomic editing ribonucleoproteins4 are larger, hydrophilic, ionic, and prone to degradation. Prospective polymer delivery systems need to balance opposing attributes for these payloads by providing (i) colloidal stabilization across multiple biological barriers,5 and (ii) efficient payload release at the site of action.6 This dichotomy complicates the (mostly) well-understood molecular engineering approaches used for small molecule drugs that rely on conventional controlled drug delivery principles and computational foundations.
Because of the vast design space of chemistries and architectures, it remains difficult to intuitively devise an ideal polymer vector that can fulfill every desired function in macromolecular biologics delivery. Nevertheless, polymer chemistry has advanced to the point where unlimited structures can be created, as described in recent perspectives on controlled reversible-deactivation radical polymerization,7 chemical functionalization,8 site-specific bioconjugation,9 and electrostatic self-assembly.10 High-throughput synthesis and screening campaigns have taken advantage of this versatility to tailor specialized polymers around a single drug of interest.11–13 Challenges remain, however, in the efficient deployment of the vast toolbox of potential polymeric delivery systems across an enormously divergent set of therapeutic modalities.
One potential solution to navigate this immense design space is the marriage of experimental and synthetic data with materials informatics (MI) to develop a deeper understanding of structure–function relationships between polymer-mediated binding and delivery of various drugs. MI depends on collecting, cleaning, and organizing machine actionable data into a framework to leverage machine learning (ML) algorithms and artificial intelligence (AI) applications.14,15 Unfortunately, materials data curation is often a formidable challenge because information sources are dispersed, inhomogeneous, and inaccessible. This challenge is particularly true for polymer science, where progress has lagged on laying the groundwork for reconciling large polymer datasets with digitalization.16–19
In this short Highlight article, we feature three examples that apply these principles and demonstrate polymer synthesis/screening campaigns for three distinct cargos: (1) small molecule drugs, (2) nucleic acids, and (3) proteins. These vignettes show how rapid data generation can facilitate ML models to produce multifunctional nanoparticle candidates for the therapeutic of interest (Fig. 1). High-throughput polymer chemistry, nonviral drug delivery, and MI are connected through close collaborations across multiple teams with distinct skillsets in each use case. A glossary of MI terms and methodologies is provided at the end of this Highlight for readers’ reference. Finally, we provide an outlook for expanding these themes to pharmaceutical applications in nonviral gene therapy. The breadth and diversity of genetic drugs span physiochemical attributes that must be accounted for in data-driven polymer design from a near infinite chemical space. To this end, laboratory workflow automation and data management best practices are discussed that can prioritize therapeutic formulations with higher likelihood of successful delivery. In our view, assembling these physical and digital pieces together can usher in the next era of potent, affordable genetic drugs to market.
Banningan et al.22 recently explored a ML approach to predict fractional drug release and proposed a universal framework for designing LAI systems. They curated a data set of 43 drug–polymer combinations that consist of commercially available polymers with 31783 partial and 181 complete drug release properties from previous publications. As a starting point, 17 descriptors from experimental conditions and physicochemical properties of the drug and polymers were examined. After training different models and assessing the performance of release predictions, the authors selected a tree-based regression model called light gradient-boosting machine (LGBM) for further refinement. Two models were trained: the first excluded any points of the drug release curves in the training data features, while the second used three initial measurements included as the features. The team selected the second model for further analysis.
Agglomerative hierarchical clustering based on Spearman's rank correlation was performed to remove redundant variables from the final predictive model (Fig. 2(A)). This statistical test determines the presence of the monotonic relationship between two variables. By arranging variables into hierarchy of clusters from this test, they found that removing two features from clusters with strong correlations (i.e., the fractional drug release at 0.5 day or T = 0.5 and the number of heteroatoms or NHA) resulted in a model with similar accuracy. Meanwhile, despite a strong correlation between drug molecular weight (MW) and topological polar surface area (TPSA), the removal of these features reduced model accuracy. This example shows how descriptors like T = 0.5 and NHA were removed while others like MW and TPSA were retained, resulting in 15 finalized features. To further determine which features are important in the model, a SHapley Additive exPlanations (SHAP) analysis was performed (Fig. 2(B)). SHAP is a method to explain predictive outputs of ML models by computing the relative contribution of each input feature. It was found that the model's most influential features were time, and specifically T = 1.0, the drug's one day measurement of release. Other significant drivers were the MWs of the polymers and drugs, respectively. The results did not display the potential synergy of other features, as suggested by the analysis of feature removal by agglomerative hierarchical clustering analysis.
![]() | ||
Fig. 2 (A) Heatmap of the absolute Spearman's rank correlation of the initial 17 input features for LAI development. The dendrogram displays agglomerative hierarchical clustering analysis, e.g., T = 0.25, T = 1.0, and T = 0.5. Pink and blue represent 0.0 and 1.0 correlation values, respectively. (B) Swarm plot of SHAP values of the 15-feature LGBM model. The colors pink and blue represent relatively low and high values, respectively. (C) Table with the proposed design criteria to select “fast” and “slow” drug-release profiles based on the 15-feature LGBM model, SHAP analysis, and observed trends in PCA and tSNE plots. For instance, low molecular weights of drug cargo and polymer system are associated with a “fast” drug release profile. Adapted with permission from ref. 22. Copyright 2023 Nature Publishing Group. |
Finally, the authors proposed and experimentally tested two LAI formulations focused on microparticles that are easy to produce based on commercially available PLGAs. They used the experimental measurement T = 1.0 as a proxy, one of the most influential features in the SHAP analysis. They suggested that the low-values of the fractional drug release might correspond to “fast” sustained release profiles compared to LAI systems with a relatively high value (“slow” release). Moreover, by analyzing low-dimensional clustered plots (principal component analysis; PCA) and an unsupervised clustering algorithm (T-distributed Stochastic Neighbor Embedding; t-SNE), it was observed that some features were generally related to the values of fractional drug released at T = 1.0 and, therefore, to a “slow” or “fast” release. Furthermore, they proposed a LAI design criterion (Fig. 2(C)) and selected two drug-PLGA pairs to function as “fast” and “slow” release systems. For the first LAI, a 10 kDa PLGA and salicylic acid (SA) were chosen for their relatively low MWs, relatively low logP of SA, and relatively low TPSA value of SA. By comparison, a “slow” release LAI system consisted of a 50 kDa PLGA and olaparib (OLA), where both components have relatively high MWs, relatively high log
P of OLA, and high TPSA value of OLA. They prepared and characterized samples using an oil-in-water emulsion method23 and observed excellent agreement between predicted and experimental release profiles. The authors speculate that further improvements could be effected by incorporating factors that were excluded from the model, such as polymer degradation in the PLGA formulation. Nevertheless, this benchmarks a powerful method to establish design rules for other LAI pairings, assuming that such prospective systems have access to sufficient training data.
Recently, Kumar et al. addressed this issue by revisiting an established polymer library and determining if the same design constraints apply for delivery of a different cargo.25 These 43 polymers spanned commonly investigated cationic and hydrophilic monomers and were originally investigated as vectors for ribonucleoprotein (RNP) delivery. This study demonstrated that successful delivery of RNP cargo was most dependent on the polyplex surface charge and the degree of cooperativity during polymer deprotonation (nHill). In the new study, the same library was re-examined with the following objectives: (1) identify polymers that efficiently facilitate intracellular delivery of plasmid DNA (pDNA), (2) determine if design constraints applied for RNP payloads are relevant to pDNA payloads, (3) co-deliver RNP and pDNA payloads for homology-directed repair, and (4) translate the results to specific targets. Eight candidates showed substantial increase in transgene expression, with the polymer P38 (comprising 2-(diisopropylamino)ethyl methacrylate and 2-hydroxyethyl methacrylate monomers, poly(DIPAEMA52-st-HEMA50)) as the lead candidate from the library screen. It was determined by quantitative confocal microscopy that P38 was able to effectively deliver pDNA to two distinct cell types, HEK293T and ARPE-19, showing relatively high levels of nuclear import and the ability to escape endosomal compartments.
P38 was also determined to be the lead candidate for RNP delivery, and it was initially suspected that the polymeric design criteria might be identical for both RNP and pDNA delivery. Thus, to further elucidate any structure–function relationships between polymer attributes and payload type, SHAP analysis was used. The SHAP analysis revealed that the design parameters affecting cellular uptake, delivery efficiency, and toxicity are all cargo dependent (Fig. 3). Notably, RNP delivery is dependent on hydrophobic interactions in addition to electrostatic interactions, and that these are both necessary for cytosolic release. On the other hand, hydrophobic interactions are negligible for successful pDNA delivery, which relies on the optimization of polycation protonation equilibria and pDNA binding affinity. Despite the payload dependent divergence in vector design, it is important to note that polymer compositions such as P38 can simultaneously satisfy the requirements of both payloads. This was demonstrated by using P38 to successfully co-deliver RNP and pDNA payloads for homology driven repair at a higher rate than JetPEI, a commercial polymer routinely used as a gold standard in gene delivery. In addition to identifying a promising polymer for delivery of two distinct payloads, this important work introduces a robust framework for deconvoluting payload-specific structure–function relationships.
![]() | ||
Fig. 3 (A) SHAP illustrates the importance and contributions of polyplex features to delivery efficiency, cellular toxicity, and uptake for pDNA (pink) and RNP (blue) payloads. (B) Average treatment effect (ATE) analysis estimates causal structure–function trends of pDNA polyplexes from the top features from SHAP analysis. Positive and negative effects (error bars denote 95% CI) denote protagonistic and antagonistic relationships, respectively. Adapted with permission from ref. 26. Copyright 2022 American Chemical Society. |
Tamasi et al. recently showed a unique approach to screen such polymer–protein hybrids (PPHs) using a learn-design-build-test paradigm for three model enzymes.30 In this report, the authors prepared a series of heteropolymers that varied the (1) number of methacrylate monomer combinations (Fig. 4(A)), (2) balance of ionic, hydrophilic, and hydrophobic moieties (composition limited to ≤70 mol% hydrophobic and ≤50 mol% ionic monomer), and (3) targeted degree of polymerization (DP; from 50 to 200). PPHs were formed with horseradish peroxidase (HRP), glucose oxidase (GOx), and lipase (Lip) by thermal stress. The output objective was to evaluate retained enzyme activity (REA), defined as the ratio of activity level following thermal stress to its initial activity level. 500+ unique heteropolymers were prepared for enzymatic activity screening.
![]() | ||
Fig. 4 Active learning enables rational design of polymer–protein hybrids (PPHs), comprising random copolymers with compositions that compatibilize protein surfaces. (A) Rendered surface chemistries of horseradish peroxidase (HRP), glucose oxidase (GOx), and lipase (Lip) whose amino acid attributes (ionic = blue, hydrophilic = green, hydrophobic = magenta) correspond to selected methacrylate chemistries. (B) Schematic of a “Learn–Design–Build–Test” PPH discovery paradigm, which includes Gaussian process regression surrogate models, Bayesian optimization, automated synthesis by a robotic platform, and high-throughput characterization assays. (C) Representative analysis reveals distinct priorities in copolymer features for each protein by normalized mean absolute SHAP explanations. Adapted with permission from ref. 30. Copyright 2022 Wiley John & Sons. |
Closed-loop optimization was carried out by first training Gaussian process regression (GPR) models with a dataset of 504 initial polymers, followed by Bayesian optimization of the GPR model to down-select and identify lead polymer candidates for further synthesis campaigns. Evaluation of enzyme stability assays expanded the polymer-enzyme activity database for further model training and materials design (Fig. 4(B)). This workflow allowed the authors to better understand how chemical features influenced PPH performance of each protein of lead PPIs. While calculated SHAP values of REA showed expected trends, some unexpected relationships were revealed. For instance, smaller chain lengths and the hydrophobic monomer MMA were favorable for HRP, but the introduction of different hydrophobic monomers such as BMA was not beneficial (Fig. 4(C)). The authors proposed a possible mechanism of HRP stabilization as a chaperone-like assistance from shorter copolymer sequences that prevented structural refolding. SHAP analysis of GOx and Lip show distinct differences in heteropolymer design that were further improved by round-by-round Bayesian optimization coupled with experiments. This platform illustrated how ML workflows coupled with high-throughput materials experimentation can result in greater insight and speed to construct designer PPHs.
However, translation of promising polymers/drug leads from the bench requires significant capital investment and resources on the path towards commercialization. In the remainder of this Highlight, we focus on nonviral gene therapy in particular, where there are numerous opportunities to produce more affordable and safer genetic medicines.31 Some grand challenges stem from open-ended questions in molecular biology and nanomedicine, while others are more practical in terms of lab automation and establishing digital ecosystems for enabling material discovery. We discuss these topics and what may be needed for future pharmaceutical applications. More focused reviews on AI and nanomedicine,32 nucleic acid therapeutics with polymer complexes,33 and automation and data-driven design of polymer therapeutics34 are available elsewhere. Furthermore, other biotherapeutics, such as ribonucleoproteins35 or therapeutic peptides,5 are outside the scope of this article.
Cargo | Delivery destination | Notable attributes | Select therapeutic product example(s)44 |
---|---|---|---|
a bp = base pairs, complementary repeat units in a nucleic acid molecule. | |||
Plasmid DNA (pDNA) | Nucleus | • Long (1000s bpa) double-stranded, circular molecule | • N/A; some DNA vaccines have been FDA approved for veterinary use such treating canine melanoma in 2010 |
• Versatile, robust with low production cost from bacteria culture | |||
• Requires entry to restrictive nuclear barrier | |||
Antisense oligonucleotide (ASO) | Cytoplasm (RNA) | • Short (∼20 bases) single-stranded, linear molecule | • Kynamro (FDA approved 2013) |
• Forms duplexes with RNA targets for promoting RNase degradation or for sterically blocking translation | • Waylivra, Volanesoren (FDA approved 2019) | ||
• Limited often by efficient internalization and endosomal escape | |||
• Chemical modifications are commonly used in ASO design | |||
Messenger RNA (mRNA) | Cytoplasm | • Long (100–1000s bases) single-stranded, linear molecule | • Comirnaty, tozinameran (lipid nanoparticle-RNA FDA approved 2020) |
• High expression, versatility, and therapeutic efficacy | • mRNA-1273 (lipid nanoparticle-RNA FDA approved 2020) | ||
• Susceptible to RNase degradation, endosomal entrapment, and immune stimulation/response | |||
• Delivery vehicles are multicomponent, as general examples: lipid nanoparticles (PEGylated lipids, ionizable lipids, helper lipids, cholesterol), lipid/polymer hybrid nanoparticles (PEGylated lipids, cationic lipids, helper polymers), polyplexes (cationic polymers) | |||
Small interfering RNA (siRNA) | Cytoplasm (RNA) | • Short (15–30 bp) double-stranded, linear molecule | • Onpattro, patisiran (lipid nanoparticle-RNA FDA approved 2018) |
• Forms complexes to bind and cut mRNA for blocking translation | • Givlaari, Givosiran (GalNAc–siRNA conjugate FDA approved 2019) | ||
• Complementary and regulates the expression of a single target RNA to down-regulate protein expression levels | • Oxlumo, lumasiran (GalNAc–siRNA conjugate FDA approved 2020) | ||
• Many siRNA approaches motivated by cancer therapy |
Oligonucleotides (typically defined as less than 100 bases) are short, single-stranded linear nucleic acids. In comparison to pDNA, they exhibit different complexation behavior that has been linked to differences in charge density, chain flexibility, hydrophilicity, and helicity.37 Antisense oligonucleotides (ASO) are a particular type of payload that is functional for gene silencing. ASO are short (∼20 bases) and designed to bind to an endogenously expressed messenger RNA (mRNA) molecule. The ASO-mRNA duplex is then recognized and degraded by RNase H, resulting in reduced expression of the gene encoded by that mRNA.38 Unlike pDNA, ASO do not need to enter the nucleus to have an effect: they silence gene expression through mRNA binding in the cytoplasm. However, delivery vehicles are still important for ASO, as they can facilitate uptake and protect against degradation.38
Gene silencing can be achieved through the delivery of short RNA molecules such as siRNA. siRNAs are short, double-stranded RNA molecules that can suppress gene expression through the action of a cytosolic protein complex known as an RNA-induced silencing complex (RISC). The siRNA sequence is designed to be complementary to part of an endogenous mRNA. The antisense strand of the siRNA guides the RISC to cleave the endogenous mRNA, thereby inhibiting protein production of the encoded gene. Like mRNA, siRNA nucleotides can be modified to protect against degradation. Delivery vehicles for siRNA can improve the cargo's stability and cellular uptake.38 Hu et al. provide an extensive historic overview of therapeutic siRNA and a roadmap of their opportunities based on pre-clinical and clinical delivery platforms.45
Robotic systems can handle rote work and enable more complex operations to be completed in parallel. As an example, consider the use of liquid handling to prepare polyplexes. Input variables associated with mixing (i.e., polymer concentration, cargo concentration, and solution salinity or pH) lead to enormous self-assembly outcomes that impact size, stability, and nanoparticle dynamics. Fig. 5 highlights the challenges that occur with just three input parameters alone: polymer molecular weight, N/P (ratio of polymer/nucleic acid), and solution pH. A systematic screen of 10 molecular-weight polymer samples combined with a single cargo at 10 N/P ratios at 10 different pH levels translates to 1000 unique polyplexes that could in turn be conceivably plated in hours with automation (Fig. 5(A)). However, subdomains in the total dimensional parameter space can be more efficiently probed with prioritization assistance from MI techniques (Fig. 5(B)), identifying local optima (hot spots) of activity. Fig. 5(C) illustrates how these hot spots can be deconstructed further to allow the examination of additional dimensional factors. In this manner, nucleic acid stabilization and release can be better understood as a function of input parameters for subsequent workflows in polyplex characterization.
In the biological parameter space, automation and high-throughput screening has progressed significantly from its origins in small molecule drug discovery. Comprehensive reviews4,34,51 present high-throughput evaluation of bioperformance in cellular and animal models. We direct readers to these works for more detailed perspectives. A common thread for these emerging techniques is establishing more autonomous workflows in the lab infrastructure (e.g., plate preparation from libraries, assay standardization, or incorporation of non-invasive analytical techniques) and acceleration of decision making from the large quantity of collected data. We discuss these points further below.
Following a top-down approach to meet a defined system's requirements, a map of equipment tasks should first be developed in a hierarchical manner. Each instrument should be chosen to perform in an automation friendly manner using (ideally) commercially available labware and consumables. Screening materials can be visualized at several levels, each of which have defined input and output (I/O) values. Building level 1 involves enumerating a list of lab-wide equipment. These I/O nodes should have a clearly defined purpose. Level 2 of the automation hierarchy necessitates organization of the nodes in a logical sequential manner. Process modularity, defined as inserting either redundant or alternative equipment at a process node, is crucial for engineering the pipeline to be adaptable and efficient. Creating a map of a work cell that depicts each step in the process with the I/O of each machine sub-system can be helpful.
Level 3 of the automation hierarchy focuses on optimizing resource utilization and the time course of each process. Delays or sample storage backlogs in the protocol are evidence of inefficiencies in the pipeline that can be resolved with redundancy removal or alternatives. Gantt charts may be useful in identifying timeline inefficiencies. From this, modifications to workflows can be gauged for their impact on sample production. Finally, at level 4, technical issues in the function of each process node are considered. Individual performance of each step is evaluated at the system level, whereas each device is detailed separately as its own finite system.
High-throughput synthesis and screening from automated system workflows must rely on an equally organized data management plan. Predictive ML models can find patterns in high dimensional data only if the data, and also its metadata can be connected. Metadata is data that contextualizes data by providing machine-readable descriptions and explainability. We conclude below with our perspective on creating a digital infrastructure that can accommodate enormous chemical and biological datasets in polymer-based gene therapy.
Ideally, a fully automated set-up is preferred when the run can be automatized with available hardware and an application programming interface (API). From the digital infrastructure point of view, data acquisition can often be integrated with software packages such as ChemOS,59 HELAO,60 or for physical simulations ChemOS 2.0.61 These experiments offer high-throughput experimentation and high-throughput virtual screening capabilities with ML training.62 This is a common theme for close-loop-optimization with ML examples in autonomous laboratories for inorganic,63 organic64 and polymer chemistry,19,65–67 or even in nonviral delivery with lipid nanoparticles.68
In practice however, full automation often cannot be implemented economically. Experiments may have long turnaround times, reducing efficiencies in data acquisition. Hardware may lack a suitable API for I/O designation for development and optimization.69,70 Standardized electronic laboratory notebooks (ELNs) can play an essential role. ELNs offer a user-friendly interface to record data through an API, so that data entry is both rapid and captured in a digitally useful form for MI applications.71 Additional researcher input can be provided to improve experimental protocols, record quality control notes, and constrain data ingestion as appropriate.15 Importantly, ELN adoption can allow access to negative results that are crucial for building balanced datasets. Negative data is usually not accessible or reported in public sources, a recognized problem in the field.72,73 Commercially available software now offer APIs in combination with in-house development tools.74
The third source of data is published information. There are growing public datasets for polymers,75,76 but much remains to be done compared to other materials genome or protein databases. Most published information is still not immediately machine actionable and requires substantial data extraction. Although large datasets of small molecules are widely available, these are often inadequate to extrapolate to polymers as they do not consider polymer synthesizability or other chemical constraints. Nevertheless, several polymer datasets exist. PoLyInfo77 and Polymer Genome78 are proprietary databases of existing polymers with focus in physical, mechanical, electrical, and chemical properties of, mostly, homopolymers. Querying these databases at scale is restricted and only a fraction of polymers and measured parameters are relevant to delivery applications. More varied datasets have been proposed by constructing virtual polymer using generative deep learning models, including PI1M,79 polyBERT,80 and the Open Macromolecular Genome (OMG).81 In our opinion, the most relevant strategies to leverage these datasets are: (i) limited screening of restricted databases such as PoLyInfo and Polymer Genome based on known delivery vehicles, (ii) virtual screening of large virtual databases such as PI1M, OMG, and polyBERT (or fine-tuning their open-source generative models and representations), and (iii) for a polymer system with drug delivery potential, construct a dataset and/or ML model based on screening small molecule databases with polymer synthesizability constraints with an approach similar to the OMG.
With new AI technologies, datasets can be automatically extracted from text and figures, even from complex structures such as metal organics frameworks,82 catalysts,83 and chemical reaction schemes.84–86 The rapid rise of generative models can also be used to aggregate molecule data from public resources.87 Although not yet widely used in the field, deep learning has significant potential to accelerate polymer design in drug delivery. Beyond the discussed applications of generative models to define an accessible space for chemical design and suggest promising vehicle candidates, deep learning also has great potential for powerful polymer representation and generalization. Natural language architectures such as polyBERT learn useful representations for polymer property prediction80 and generate linear random polymers. For more complex polymer architectures, graph representations88 and extensions of BigSMILES89 have been proposed which can be used as input to transformer or graph neural network architectures.
A vital piece of the data processing pipeline includes metadata collection. Metadata arises from different unit operations and parameters needed to understand results. Standardized formatting of data and metadata have been agreed upon by some materials communities, such as crystallography schema.94 However, there is no consensus yet for nonviral drug delivery materials, including polymers. Conversations are yet needed to define and standardize metadata within the same lab group or organization to ensure data quality, consistency, and completeness.95
However, it remains unclear how to best address aspects of these principles while protecting intellectual property (IP), especially within companies where data is strategically valuable. In a thoughtful perspective,101 Delannoy describes this problem between academic and industrial stakeholders: “the productivity of a project is measured by its capacity to transpose the research into products or services that will create business opportunities for the company. Patents are protected and even confidential for some time before being publicly published. For the academic partner, research is mostly evaluated by the scientific publications and communications that are published during a project. Papers are clearly public. It is therefore key to ensure a good balance between the protection of IP that the company needs and the objective to publish that the university seeks.” Biopharma groups have considered aspects of this as part of their digital transformation process,102 but it still remains unclear how to best handle this issue moving forward.
Many approved gene therapies to date can be traced to academic laboratories or small companies.103 While important advances have been made, scaling of these efforts has been hampered by both the complexity of the data and the difficulty of integrating and interpreting disparate data streams. MI methods offer enormous potential because of the possibility of scaling experimental and computational methods, including polymer-cargo formulation. In our own experience, we believe that MI can be leveraged for the efficient design of polymer nanoparticles for nonviral gene therapy. Our SAYER™ platform104 connects high-throughput polymer chemistry, polyplex nanoparticle characterization, screening/delivery in vitro and in vivo, and predictive polymer design guided by ML and AI. Fit-for-purpose delivery vehicles can be rapidly produced, down-selected for cargo and tissue specificity, and assayed in a time- and cost-effective manner.
Advances in the integration of experimental and MI methods will continue to evolve over the coming decades. The existing R&D landscape will likely develop in profound and unexpected ways. The rise of generative AI is one such high-impact example. We have attempted above to demonstrate how this integration of experimental and MI methods can be used to discover unexpected correlations and new insights into structure–property relationships. For the field of gene therapy in particular, involving highly complex interplay between physical, biological, and clinical factors, such modalities can fuel more rapid innovation and affordable solutions that benefit people worldwide.
• SHAP: SHapley Additive exPlanations, a method used to understand the role of the features in an ML method.
• PCA: Principal component analysis, a method to transform high-dimensional data to low-dimensions.
• tSNE: T-distributed Stochastic Neighbor Embedding, a clustering algorithm used to visualize high-dimensional data in 2 or 3 dimensions.
• BO: Bayesian optimization, an algorithm typically used to optimize properties expensive or difficult to evaluate.
• GPR: Gaussian process regression.
• API: Application programming interface, a protocol that allows to read, write and/or manage different instances programmatically, such as software applications or a piece of hardware.
• ELN: Electronic laboratory notebooks, a software application that allows experimentalists to keep a digital logbook.
This journal is © The Royal Society of Chemistry 2023 |