Nathan J. Szymanski and
Christopher J. Bartel*
University of Minnesota, Department of Chemical Engineering and Materials Science, Minneapolis, MN, USA 55455. E-mail: cbartel@umn.edu
First published on 4th July 2025
Generative artificial intelligence offers a promising avenue for materials discovery, yet its advantages over traditional methods remain unclear. In this work, we introduce and benchmark two baseline approaches – random enumeration of charge-balanced prototypes and data-driven ion exchange of known compounds – against four generative techniques based on diffusion models, variational autoencoders, and large language models. Our results show that established methods such as ion exchange are better at generating novel materials that are stable, although many of these closely resemble known compounds. In contrast, generative models excel at proposing novel structural frameworks and, when sufficient training data is available, can more effectively target properties such as electronic band gap and bulk modulus. To enhance the performance of both the baseline and generative approaches, we implement a post-generation screening step in which all proposed structures are passed through stability and property filters from pre-trained machine learning models including universal interatomic potentials. This low-cost filtering step leads to substantial improvement in the success rates of all methods, remains computationally efficient, and ultimately provides a practical pathway toward more effective generative strategies for materials discovery. By establishing baselines for comparison, this work highlights opportunities for continued advancement of generative models, especially for the targeted generation of novel materials that are thermodynamically stable.
New conceptsThis work establishes comprehensive baselines for the generative discovery of inorganic crystals, comparing traditional methods with generative approaches. By developing and benchmarking two baselines – random enumeration of charge-balanced prototypes and data-driven ion exchange – against three generative models, we demonstrate a new framework for assessing trade-offs in generative materials discovery. The baseline methods excel at proposing stable materials, while the generative models offer greater structural novelty and outperform baselines in identifying materials with targeted properties, such as band gaps near 3 eV. However, both approaches face challenges in finding materials with exceptional properties, such as maximizing bulk modulus, underscoring the need for more diverse training data. By introducing machine learning-based screening as a post-generation filtering step, we significantly enhance success rates across all approaches. This integration highlights a subtle yet crucial balance between stability, novelty, and property optimization, offering a clear framework to evaluate generative AI in materials science while supporting the advancement of future models. |
The recent emergence of generative artificial intelligence (AI) offers a promising route for designing new materials, particularly inorganic crystals.5,6 Early efforts focused on generative adversarial networks (GANs)7–9 and variational autoencoders (VAEs),10–14 while more recent developments include large language models (LLMs),15–19 diffusion-based techniques,20–24 normalizing flows,25–27 and geodesic random walks.28 These models are often trained on computed materials from open databases such as the Materials Project29 to generate thermodynamically stable structures, with some also conditioned on specific properties for application-driven campaigns.
There exists a growing number of successes in generative AI for materials design, with validation provided by ab initio calculations and experimental synthesis. For example, text-based models such as Chemeleon30 have leveraged contrastive learning to align crystal structures with natural language descriptions, enabling composition- and structure-conditioned generation in a variety of chemical spaces – most notably achieving successful phase prediction in the Li–P–S–Cl system relevant to solid-state batteries. Other models such as FlowLLM26 combine language representations with Riemannian flow matching to refine generated structures, increasing their stability rate threefold. Among diffusion models, MatterGen22 has emerged as a particularly effective method capable of generating materials with targeted chemistry, symmetry, and functional properties – even leading to the synthesis of an AI-generated compound, TaCr2O6, whose experimentally measured bulk modulus was within 20% of the predicted value.
Despite these recent successes, it remains difficult to systematically assess the performance of different generative models in a consistent fashion. Tools like matbench-genmetrics provide important frameworks and metrics for evaluating the validity of structures proposed by generative models,31 while matbench-discovery addresses the challenge of benchmarking stability predictions made by machine learning (ML) models and interatomic potentials.32 Yet, the extent to which generative models outperform established methods, such as ion exchange or high-throughput screening, is not yet fully understood. Baselines are therefore essential to clarify where these models offer the greatest advantages – whether in producing stable materials, generating novel structures, or achieving targeted properties – and to identify their limitations. Such benchmarks are key to integrating generative models into existing workflows and driving tangible progress in materials discovery.
In this work, we establish two baseline methods for the generation of inorganic crystals: random enumeration of charge-balanced chemical formulae in structure prototypes sourced from the AFLOW database,33,34 and ion exchange performed on stable compounds with desired properties from the Materials Project.29 These methods are benchmarked against four generative models – CrystaLLM,15 FTCP,12 CDVAE,13 and MatterGen22 – for the generation of (1) materials that are stable and novel, (2) materials with a band gap near 3 eV, and (3) materials with high bulk modulus. We also integrate two graph neural networks, CHGNet35 and CGCNN,36 to filter and retain generated materials predicted to be stable or exhibit desired properties. This evaluation sheds light on the comparative strengths and weaknesses of traditional and generative approaches to materials discovery, while also providing a set of baselines against which future generative models can be benchmarked.
![]() | ||
Fig. 1 Histograms showing DFT-computed decomposition energies (ΔEd) of novel materials (not already present in the Materials Project) generated by two baseline methods and four generative models. For each of these six approaches, 500 materials were considered. The left column (blue) contains results from the baseline methods: random enumeration and ion exchange. The right columns (red) contain results from the generative models: CrystaLLM,15 FTCP,12 CDVAE,13 and MatterGen.22 Triangular markers indicate median decomposition energies. |
![]() | ||
Fig. 2 Cumulative distribution functions (CDFs) showing the percentage of materials that satisfy a decomposition energy (ΔEd) cutoff, with each line color-coded by the method used to generate these materials. The left panel displays CDFs for 500 novel materials generated directly by each method, including two baseline approaches (random enumeration and ion exchange) in blue tones and four generative models (CrystaLLM, CDVAE, FTCP, and MatterGen) in red/purple tones. The right panel displays CDFs for 500 novel materials filtered by CHGNet-predicted stability, including only those CHGNet predicts to have ΔEd ≤ 0. Filtered energy distributions are also displayed in ESI,† Fig. S7. |
FTCP12 encodes materials using real-space features (lattice vectors, one-hot encoded element vectors, site coordinates, and occupancies) and reciprocal-space features derived from a Fourier transform of elemental property vectors. Two FTCP-based autoencoders were trained on the MP-20 dataset: one conditioned on formation energy and electronic band gap and another conditioned on formation energy and bulk modulus. Due to the limited availability of elastic property data, the latter autoencoder was trained and validated on a subset of MP-20 containing 9361 materials. Materials were then generated by randomly sampling points in the latent space nearby known materials from the training set. Default hyperparameters supplied at https://github.com/PV-Lab/FTCP were used for the training and generation steps.
CDVAE13 combines a variational autoencoder (VAE) with a diffusion model to generate new materials. Sampling from the latent space predicts composition, lattice vectors, and the number of atoms in the unit cell, which are used to randomly initialize structures. The diffusion component of the model then “de-noises” these random structures by iteratively perturbing atoms toward equilibrium positions. We trained CDVAE on MP-20 without any conditioning of its latent space, allowing it to be used only for the generation of stable materials. During the generation step of each model, we did not place any constraints on the elements or symmetries that may be created. Default hyperparameters supplied at https://github.com/txie-93/cdvae were used for the training and generation steps.
MatterGen22 is a more recent diffusion model designed to create stable materials by jointly denoising atom types, coordinates, and lattice parameters through an equivariant score network. MatterGen can be fine-tuned for targeted properties but was used here in its base configuration (available at https://github.com/microsoft/mattergen) for unconstrained generation, re-trained on the MP-20 dataset for consistent comparison with the other models evaluated in this work. This model was used to generated structures in batches of 128 using the provided command-line tool. A similar procedure was used to generate materials from a MatterGen model that was pre-trained on a much larger dataset (Alex-MP-20), with results provided in the ESI.†
Thermodynamic stability with respect to all known competing phases in MP was evaluated using the decomposition energy (ΔEd).43 For unstable materials with ΔEd > 0, this measure is equivalent to the energy above the convex hull (Ehull). It quantifies the energy difference between the proposed material and the lowest energy combination of competing phases. For stable materials with ΔEd ≤ 0, the decomposition energy is the energy by which the proposed material lies below the existing convex hull (if the proposed material were not included in its construction). Total energies acquired from DFT calculations were transformed into formation energies (ΔEf) using the MaterialsProject2020 compatibility scheme, which accounts for GGA/GGA+U mixing and implements elemental reference energy corrections as described in previous work.44 Competing phases were identified by constructing a phase diagram for each chemical system using the PhaseDiagram module from pymatgen.37 These phase diagrams included all entries from MP (as of June 2025) as well as the entries generated from each AI model or baseline approach, allowing for an evaluation of stability against both known and hypothetical phases.
Electronic band gaps were computed by analyzing the eigenvalue band properties obtained from VASP calculations using pymatgen,37 with the band gap defined as the energy difference between the valence band maximum and conduction band minimum. The bulk modulus of each structure was computed by fitting a Birch–Murnaghan equation of state45 to relaxed (but fixed volume) total energy calculations performed at seven volumes ranging from 97% to 103% of the equilibrium volume. These volumes were generated by isotropically scaling the lattice vectors of the relaxed equilibrium structures. The equilibrium bulk modulus and its pressure derivative were extracted from the fit, providing a measure of each material's resistance to volumetric deformation.
For property-specific screening, we leveraged two CGCNN36 models: one trained on 16458 DFT-calculated band gaps and the other on 2041 bulk moduli from MP. When generating materials with a high bulk modulus, we applied an acceptance criterion of CGCNN-predicted bulk moduli exceeding 200 GPa. Analogously, when generating materials with a band gap near 3 eV, we selected candidates with CGCNN-predicted band gaps in the range of 2.8 to 3.2 eV.
The high stability rate of ion exchange is impressive but perhaps unsurprising, given its proven efficacy in discovering new materials using high-throughput calculations performed over the past decade.48–53 This is especially true when leveraging known structure prototypes that can host a wide variety of compositions. For example, prior work has identified several hundred stable compositions in the perovskite structure through ion exchange of known materials.54–56 Similar approaches have also been applied to successfully uncover stable compositions in the spinel and delafossite structures,57,58 reinforcing the effectiveness of ion exchange for materials discovery.
Among the generative models, CrystaLLM produced novel materials with the widest range of energies (ΔEmedd = 442 meV atom−1) but a stability rate (2.4%) that is second only to MatterGen (3.0%). While both remain well below the stability rate of ion exchange (9.2%), the generative models provide unique flexibility in terms of capacity and training data. For example, the CrystaLLM model used in Fig. 1 had 200 million parameters and outperformed a smaller model variant with only 25 million parameters and a lower stability rate of 1.6% (ESI,† Fig. S2). Expanding the training set also improves performance, with CrystaLLM achieving a stability rate of 2.8% when trained on ∼2.3 million structures (ESI,† Fig. S3). Similar improvements are possible with MatterGen, whose stability rate increases from 3.0% to 5.4% when trained on structures from the Alexandria database59 in addition to those from MP-20 (ESI,† Fig. S4).
As both CDVAE and MatterGen are diffusion models, they result in a similar distribution of energies with ΔEmedd of 207 and 188 meV atom−1, respectively. These two models also produce comparable energy distributions to FTCP, a variational autoencoder, which yields ΔEmedd = 205 meV atom−1 and a moderate stability rate of 2.0%. Given the recent success of diffusion models, the competitiveness of FTCP (which is a VAE) is somewhat unexpected. We attribute this to its latent space sampling strategy, which biases generation toward materials that are structurally similar to known stable compounds. This approach enhances the model's stability rate, similar to the observed benefits of template-based strategies (such as ion exchange) in the baseline methods. However, it also comes at the cost of reduced novelty – 1309 materials were generated from FTCP before obtaining 500 materials not already present in MP (a novelty rate of ∼38.2%).
In this work, novelty is defined as a material being absent from MP. While this does not necessarily indicate the material has never been synthesized or is absent from all computational databases, it signifies that the material was not used for training of the generative models or as a template for ion exchange. All materials shown in Fig. 1 meet this definition of novelty. However, the total number of generated materials required to obtain 500 novel ones varied across methods. The rate at which novel materials were produced by each method is listed in Table 1.
Method | ΔEmedd (meV atom−1) | Stability rate (%) | Novelty rate (%) | Novel prototype rate (%) | Novel prototype stability rate (%) |
---|---|---|---|---|---|
Random | 409 | 1.4 | 98.6 | 0 | 0 |
Ion exchange | 85 | 9.2 | 72.4 | 0 | 0 |
CrystaLLM | 442 | 2.4 | 98.2 | 1.0 | 0 |
CDVAE | 207 | 1.8 | 96.0 | 8.2 | 0 |
FTCP | 205 | 2.0 | 38.2 | 1.8 | 0 |
MatterGen | 188 | 3.0 | 91.8 | 7.2 | 0 |
Between the two baseline methods, random enumeration yields a much higher novelty rate (98.6%) than ion exchange (72.4%). This reflects the unconstrained nature of random enumeration, which leads to the sampling of many previously unexplored chemical compositions. In contrast, our approach to ion exchange closely reflects traditional screening efforts,60 and is therefore more likely to reproduce materials already present in computational databases such as MP. However, the use of ion exchange also comes with the benefit of generating more stable materials, resulting in a higher stability rate (9.2%) than random enumeration (1.4%).
Three of the generative models – CrystaLLM, CDVAE, and MatterGen – exhibit high novelty rates >90%. In contrast, only 38.2% of the materials generated by FTCP are novel. This result is consistent with FTCP's strategy of sampling around known materials in its latent space, a factor that likely also contributes to its reasonably high stability rate. To assess the impact of the sampling strategy used by FTCP, we generated several new sets of materials at iteratively greater distances from known materials in its latent space. The results, shown in ESI,† Table S2 and Fig. S5, demonstrate that sampling further away from known materials leads to higher novelty rates (reaching 95%) but also lower stability rates (≤1%). The clear inverse correlation between these two metrics underscores the tradeoff that exists between stability and novelty during materials discovery campaigns.
To more broadly assess novelty, we examined the fraction of generated materials absent from two additional sources: Alexendria,59 a computational database containing over 4.5 million structures, and the ICSD,61 an experimental database with approximately 300000 structures. As detailed in ESI,† Table S3, the novelty rate of each method decreased slightly upon comparing to these additional databases. However, the overall trends remain unchanged: random enumeration achieves the highest novelty rate (94.0%) while FTCP exhibits the lowest (35.0%). It is notable that even this lower novelty rate constitutes a substantial fraction of the generated materials, which suggests there remains ample opportunity for materials discovery even as these expansive databases continue to grow. While the novelty rate distribution is interesting, we argue that this is less important than the stability rates since novelty assessments are computationally inexpensive compared to stability assessments, which require DFT calculations.
It is worth noting that the stability rates of novel materials reported in this work are generally lower than the “SUN rates” (stability, uniqueness, and novelty) reported in prior work. For example, MatterGen and CDVAE have previously reported SUN rates of ∼38% and ∼14%, respectively.22 Both values are much higher than the 1.8–3.0% stability rates found in our current study. This discrepancy arises primarily from differences in stability criteria. Previous work considered all materials within 100 meV atom−1 of the convex hull to be “stable.” Directly comparable metrics based on this definition are provided in ESI,† Table S4. However, we enforce a stricter definition of stability in the main text of our work, requiring that materials lie on the convex hull (ΔEd ≤ 0) to be considered stable. While many previously synthesized materials are computed to be thermodynamically unstable with DFT, the probability a material can be synthesized is inversely proportional to the magnitude of this instability.62,63 It is difficult to define a general “rule-of-thumb” for accessible ΔEd values, as this will depend on the nature of a material, its competing phases, and available synthetic routes.43,63,64 Materials computed to be on the hull are likely to be stable at ambient conditions, though synthesizing even hull-stable materials can be challenging.65–67 We argue that a stricter stability criterion is more meaningful, though a looser cutoff may still be appropriate if one is less concerned with the risk of false positives that would lead to unsuccessful synthesis attempts.
The generative models tested here may not lead to the highest stability rates, but they are unique in their ability to generate new structural frameworks that cannot be mapped to any known prototypes. This sets them apart from baseline methods, which rely entirely on existing templates and therefore exhibit 0% prototype novelty rates (Table 1). The generative models evaluated in this work achieve prototype novelty rates ranging from 1.0% (CrystaLLM) to 8.2% (CDVAE). As shown in ESI,† Fig. S6, a majority of these materials lie far above the convex hull, exhibiting ΔEd > 100 meV atom−1. While those from MatterGen are closer to the convex hull on average, none achieve ΔEd ≤ 0. The lack of proposed structures that are both stable and adopt novel prototypes highlights the need for generative models that can effectively balance thermodynamic stability with structural novelty.
The unfiltered results in the left panel of Fig. 2 serve as a reference to compare each method's ability to generate stable materials. These are CDFs of the same histograms shown in Fig. 1. As observed in the prior section, ion exchange performs best in generating materials that are stable or close to the convex hull. This is evidenced by a steep rise in its CDF, positioned far to the left of all other methods. There is close competition among the next best three methods – MatterGen, CDVAE, and FTCP – whose CDFs overlap throughout a wide range of energies, reaching 80% near ΔEd ≈ 300 meV atom−1. In contrast, CrystaLLM and random enumeration yield CDFs that increase more gradually, reaching 80% only at energies above ΔEd ≈ 600 meV atom−1. This suggests that most materials produced by these two methods are unlikely to be accessed experimentally.63,64
The right panel of Fig. 2 highlights the beneficial effect of CHGNet filtering, as the consistent leftward shift in all CDFs indicates a greater proportion of materials that are stable or close to the hull. However, results still vary substantially across different generation approaches. Filtered materials from ion exchange show a high stability rate of 15.2%, with the corresponding CDF reaching 80% at ΔEd ≈ 100 meV atom−1. Random enumeration also improves after filtering, achieving a stability rate of 7.6%. Among the generative models, CrystaLLM and FTCP benefit the most from CHGNet filtering, with stability rates increasing to 17.0% and 22.4%, respectively. In contrast, MatterGen and CDVAE show only modest gains, with relatively small shifts in their CDFs and updated stability rates ranging from 3.8% to 8.8%.
We speculate that filtering is less effective for the diffusion models as they often generate materials that fall outside of CHGNet's training distribution – for example, in under-sampled chemistries or structures that are far out-of-equilibrium – potentially reducing the accuracy of stability predictions and limiting their performance gains. Indeed, ESI,† Fig. S8 shows large mean absolute errors (154 to 156 meV atom−1) on structures from CDVAE and MatterGen. Large prediction errors (139 meV atom−1) are also observed on structures generated through random enumeration. Despite adhering to known structure templates, random enumeration more often produces exotic compositions with less representation in CHGNet's training set.
To assess the compositional diversity of generated materials, we provide heatmaps of element frequencies and histograms showing the number of elements per novel compound in ESI,† Fig. S9 and S10. Random enumeration produces compositions spanning much of the periodic table with relatively even distributions of pnictides, chalcogenides, and halides. However, these compositions are generally limited to ternary prototypes, reflecting the dominance of known three-element structures. In contrast, ion exchange produces a more diverse set of compounds – including quaternaries and quinaries – but the overall composition space is narrower, skewed toward oxides which are disproportionately prevalent in MP. This contributes to ion exchange providing the lowest CHGNet prediction error (47 meV atom−1) among all methods.
Similarly, most of the generative models produce oxides at disproportionately high rates – reflecting bias in the MP-20 dataset on which they were all trained. For example, CDVAE generates a relatively narrow range of compositions, with 23.4% containing oxygen. However, it also produces more complex chemical formulae than template-based methods, with up to nine elements per compound. These multicomponent oxides are more often novel than compositions with fewer elements; however, they also compete with many compounds in the high-dimensional phase diagram, which contributes to the lower stability rate of CDVAE. Other methods generate a balance of binaries, ternaries, and quaternaries with broad periodic table coverage but a slight preference for oxides and halides. This further reflects bias in the MP-20 dataset used for training, which can be mitigated by expanding the set to include more diverse compounds. For instance, the proportion of oxides generated by MatterGen drops from 21.2% to 10.9% after incorporating the Alexandria dataset into its training. FTCP demonstrates the most compositional diversity of the models tested here, sampling a wide range of elements with only 4.4% of its materials containing oxygen. Although FTCP is trained on the MP-20 dataset, we suspect its strategy of latent space sampling enables it to interpolate between known compounds and explore regions of composition space that are underrepresented in the training data.
Random enumeration produced a wide variety of materials, with 30.2% of the novel ones being metallic. Only 11.2% of these materials exhibited a band gap within 0.5 eV of the desired value (3 eV), demonstrating the low success rate of computational screening when no guidance is provided. Applying CGCNN to filter these randomly enumerated materials improved the results considerably. By only retaining materials with CGCNN-predicted band gaps near 3 eV, the proportion of metals dropped to 18.8%, and 21.4% of the filtered materials exhibited band gaps within 0.5 eV of the target. As with CHGNet-filtering, this showcases the utility of ML-based screening for quickly refining large pools of candidate materials.
Data-driven ion exchange performed even better than CGCNN filtering of randomly enumerated materials, leveraging its ability to generate hypothetical compounds by substituting ions in known materials from MP that already have band gaps close to 3 eV. This method resulted in only 5.6% of the novel materials being metallic and a substantial 37.2% of them having a band gap within 0.5 eV of the target. This strong performance may not be entirely surprising as many of the compositional changes introduced by ion exchange are relatively minor, especially when the substituted element constitutes a small fraction of the overall chemical formula. This mirrors our findings from the previous section, highlighting the tradeoff between achieving success – whether in targeted properties or stability – and prioritizing novelty or diversity in the generated structures.
FTCP outperformed all other methods in targeting electronic band gap, with 61.4% of its novel materials exhibiting a band gap within 0.5 eV of the desired value (3 eV). This success likely stems from FTCP's latent space sampling informed by known compounds with band gaps close to the target, which enables the generation of materials with structural or compositional similarities to the reference points. Thermodynamic stability remains an important consideration, as only 3.0% of the novel materials generated by FTCP are stable, compared with 15.2% of those generated by ion exchange (ΔEd distributions provided in ESI,† Fig. S11).
Using the same four methods described above (for targeting a desired band gap), we next generated materials with the objective of maximizing bulk modulus. This task fundamentally differs from the previous band gap-related objective by focusing on materials with extreme properties (e.g., maximal bulk modulus) instead of those within an intermediate range (e.g., band gaps near 3 eV). A total of 500 materials were sampled from each method, and their bulk moduli were computed using Birch–Murnaghan equations of state fit to DFT-computed energies. The resulting distributions of bulk moduli are shown in Fig. 4. Materials generated through random enumeration follow a Poisson-like distribution of bulk moduli with a peak near 50–60 GPa, closely resembling the known distribution of elastic properties for materials in MP.68 If we define success as finding novel materials with a bulk modulus ≥300 GPa, then random enumeration achieves this at a rate of only 3.0%. When CGCNN is applied to filter these materials, it causes a noticeable shift in the distribution toward higher bulk moduli, and 15.4% of the filtered materials exhibit a bulk modulus ≥300 GPa.
When applied to known materials in MP with high bulk moduli, ion exchange performs more modestly, with 8.6% of the resulting materials exhibiting a bulk modulus ≥300 GPa. This smaller shift in the distribution likely reflects the tendency for ion exchange to introduce only minor compositional changes, which limits its ability to substantially alter the mechanical properties of the original materials – many of which (in MP) do not exhibit anomalously high bulk moduli. FTCP performed slightly better than ion exchange but worse than CGCNN-based filtering of randomly enumerated materials, with 9.2% of the compounds generated by FTCP exhibiting a bulk modulus ≥300 GPa.
Compared to its strong performance on electronic band gap, we suspect FTCP is less effective here given the scarcity of materials with extremely high bulk moduli in MP. This lack of training data may limit the conditioning of the autoencoder's latent space on extreme bulk modulus values. FTCP also yields a low stability rate of 2.0% when targeting novel materials with high bulk modulus. CGCNN filtering and ion exchange face similar limitations, with stability rates of 2.0% and 1.8%, respectively (ΔEd distributions provided in ESI,† Fig. S12). These uniformly low percentages across all evaluated methods highlight the challenge of identifying “exceptional” materials, as the inherent scarcity of analogs in the materials space and limited training data inhibit the development of effective models for both generation and filtering.69
The strong performance of the baseline methods establishes a high benchmark for generative models to meet or exceed. For this task, we tested a variational autoencoder (FTCP),12 a large-language model (CrystaLLM),15 and two diffusion models (CDVAE and MatterGen).13,22 Our tests showed MatterGen to be most effective in generating materials on or close to the hull, though its stability rate of 3.0% still falls well below that of ion exchange (9.2%). Nevertheless, generative models excel in generating materials with a high degree of structural novelty; up to 8.2% cannot be mapped to any known structure prototype in the AFLOW database. The capability of generating entirely new structural arrangements is unique to the generative models, but their low stability rates leave much room for improvement. One promising direction is to expand the training data for these models. For example, we found that MatterGen achieves a higher stability rate of 5.4% when trained on materials from Alexandria in addition to MP-20. Similar improvements were observed for CrystaLLM.
It is important to note that many of the comparisons made in this work depend on the stability threshold used to define success. We adopted a strict criterion requiring materials to lie on the convex hull (ΔEd ≤ 0) to be considered stable, which reduces false positives but also penalizes the generation of near-stable candidates that may be synthesizable. A looser threshold would raise the stability rates across all methods and shift the relative performance of each approach. For example, a threshold of ΔEd ≤ 100 meV atom−1 increases the stability rate of the generative models to 11–20%, but these rates still far well below ion exchange, which achieves a stability rate of 58% using the same threshold (ESI,† Table S4).
In addition to generating a large proportion of materials near the convex hull, generative models also perform well in targeting specific properties when sufficient training data is available. For instance, FTCP achieves a high success rate of 61.4% in generating materials with a desired band gap near 3 eV, far surpassing the 37.2% achieved by ion exchange. This performance diminishes when targeting extreme values of properties such as high bulk moduli (>300 GPa) that are less well represented in the training set. However, improved results can likely be obtained by running additional calculations on materials with extreme property and feeding them back into the generative models as training data.
To enhance the performance of the methods discussed in this paper, machine learning models were used to filter proposed materials based on predicted thermodynamic stability or desired properties. Our results demonstrate that this is a highly effective approach. Filtering by predicted stability using a pre-trained uMLIP (CHGNet)35 substantially improves the stability rates of generated materials. A notably high 22.4% of novel materials generated by FTCP lie on the DFT convex hull after filtering. This performance boost achieved by filtering is diminished for some generative models like CDVAE and MatterGen, which produce more exotic materials that fall outside of CHGNet's training distribution. However, uMLIPs are likely to become more effective at filtering such materials as the breadth and diversity of their training data improves.71 This trend is evident in the correlation between prediction error and the filtered stability rate (ESI,† Fig. S13), suggesting that reducing prediction error in future uMLIPs should further improve stability rates.
Similar findings were observed when using a pre-trained graph neural network (CGCNN)36 to filter materials by predicted band gap and bulk modulus. Doing so leads to a near three-fold increase in the success rate of identifying materials with desired properties compared to random enumeration but remains relatively low (15.4%) when targeting extreme property values (e.g., a bulk modulus >300 GPa). It also leads to a decrease in the stability rate of the proposed materials, though incorporating a uMLIP-based stability filter could mitigate this issue. As with uMLIPs, these findings underscore the need to broaden and diversify training data for property prediction models to enhance the efficiency of generative approaches in identifying novel materials with exceptional properties.
Our findings demonstrate that there is still room for improvement in the design of generative models for inorganic materials, particularly when they are used to find new materials that are thermodynamically stable. To streamline the development of future models, we provide all of the data and code from this work in a publicly accessible GitHub repository (see Data availability statement). We envision these resources being used for benchmarking generative models and integrating them with traditional screening methods to enhance the success rate in discovering new materials that are likely to be synthesized and display desired properties.
Footnote |
† Electronic supplementary information (ESI) available. See DOI: https://doi.org/10.1039/d5mh00010f |
This journal is © The Royal Society of Chemistry 2025 |