Jiamu
Ma
,
Jianling
Yao
,
Xueyang
Ren
,
Ying
Dong
,
Ruolan
Song
,
Xiangjian
Zhong
,
Yuan
Zheng
,
Dongjie
Shan
,
Fang
Lv
,
Xianxian
Li
,
Qingyue
Deng
,
Yingyu
He
,
Ruijuan
Yuan
* and
Gaimei
She
*
School of Chinese Meteria Medica, Beijing University of Chinese Medicine, Fangshan District, 100029 Beijing, China. E-mail: rjyuance@126.com; shegaimei@126.com
First published on 8th March 2023
Currently, extraction process optimization is generally based on a few features, regardless of their different changing trends and the panoramic view of the extraction process. Comprehensive evaluation and understanding is hard to establish due to the small number of experiments. Here, machine learning-assisted optimization is demonstrated for better understanding the complex extraction process based on data from an orthogonal experimental design (OED). From two perspectives of panoramic characteristics and specific characteristics, several observations are adopted to evaluate the performance of the extraction process, including quantitative 1H NMR, HPLC fingerprint, molecular weight, yield of dry extract and content of components. The close relationship between influencing factors and the extraction performance is described by grey relation analysis. With the help of radial basis function neural network (RBFNN), a nonlinear fitting regression equation is developed for every observation and influencing factor. A genetic algorithm is then introduced for multi-objective optimization and Pareto fronts are obtained. To select the best combination of water extraction process and ethanol extraction process, a list of the combinations of Pareto front points from those extraction processes is formed and ranked using CRITIC-TOPSIS. Finally, the ideal extraction is characterized by molecular weight, monosaccharide composition and UHPLC-MS/MS. With the verification between OED experiments and machine learning, the changing rates of all observations range from 1.33% to 30.11%, which confirms that machine learning-assisted optimization gives better performance than conventional OED. Molecular weight could range from 61.5~594.9 kDa with some are over measuring range, furthermore mannose and glucose are the most abuntant monosaccharides of the polysaccharide from ideal extraction. 160 components are identified via UHPLC-MS/MS as well. In conclusion, ML is a powerful tool for predicting and understanding extraction processes, thus accelerating the development of eco-friendly extraction processes.
In the past few years, studies involving evaluating the performance of extraction have tended to use a few features to represent the overall performance. However, the selection of features is usually subjectively specific with a less panoramic view. Moreover, the selected features do not always follow the same changing trend. Not considering the dynamic changes that occur during the extraction process as a whole leads to incomplete understanding of the changes. In other words, an inappropriate evaluating scale may lead to misinterpretation of the results. Even though some studies have realized the importance of characterizing the extraction process, they still find it hard to explain the close relations between the influencing factors and the results. The influencing factors in the extraction process have significant meaning to maximize the use of a set material in order to reduce waste as well as to save energy. A series of methods have been introduced to optimize the performance by observing the content, the changing trend, and even the structure of bioactive substances.8 Orthogonal experimental design (OED) and response surface methodology (RSM) are considered to be mature optimization methods for the extraction processes. Despite some ideas mentioned before having been put into use, the high uncertainty experiments are still inadequate for comprehensive evaluation and decision-making.9 That is, optimization results are highly correlated with the completed experiments, in which the rest of all the spare experimental parameter space remains a black box. However, simultaneously optimizing influencing factors by various features in time-consuming experiments causes bottlenecks in a broad range of scientific and engineering disciplines.9,10 Facing the time costing procedure of optimization of extraction, computational methods have become a popular way of modelling and optimizing the extraction process.
Machine learning (ML) and statistical inference have been applied in optimizing processes and have also been applied across all chemistry research and development activities, including process optimization. To improve the efficiency of green engineering process optimization, automated and computational methods have been greatly welcomed.11–14 ML is usually applied to predict and describe the relationships between factors and observations.15,16 Linear and nonlinear regression are popular methods for that purpose, while multi-objective optimization is able to consider various aspects of performance and has wide applications in the area of engineering, environment, food, and drug discovery.17–21 The evaluation of multiple criteria in a complicated process is transformed into a multi-objective optimization problem, which is usually solved by a genetic algorithm.22 It is recognized to be a better way for multi-objective optimization with fewer experiments and requiring less time. Non-dominated sorting genetic algorithm II (NSGA-II) is an outstanding genetic algorithm for multi-objective optimization with strong compatibility with different types of data. Some helpful attempts have been accomplished by optimizing based on little experiments, like RSM.23 However, with similar solutions gained from the Pareto front, it is still a puzzle to decide which is the best solution.
Traditional Chinese medicine (TCM) is an important class of natural products with great popularity in China and Southeast Asia,24 which is believed to play a role in disease prevention and treatment, and is also leading the trend for investigating their function and relative bioactive substances. Though biomass from natural products is thought to be one of the most abundant potential sources of renewable energy,25 it is still vital to ensure maximum extraction of bioactive substances from raw material in the first place. Therefore, efforts have been made to extract higher yields of useful bioactive substances from those natural products with less residue generated.26 Due to the traditional use and the awareness of food and pharmaceutical safety, it is common to extract TCM or its products using water, sometimes accompanied by a suitable concentration of alcohol. The extraction method is vital to maximising the yield of bioactive substances, as well as for maintaining the bioactivity of the substances.2,27 Facing the complexity of fuzzy bioactive substances and their interaction with each other, establishing a stable and reliable extraction process is important to ensure efficiency in use. Polysaccharides, flavonoids, and lignans have been reported to have various bioactivities with great value for human health and they are also believed to be promising for drug discovery.5,28–32 All of them are among the major components found throughout natural products.33 However, the poor solubility and instability of glycosides have caused problems with the extraction of flavonoids and lignans in extensive applications. Polysaccharides have been shown to have a potential role in improving the bioavailability of active components.34,35 The combination of two or more ingredients and herbs is very common in TCM, while further combinations are an important approach in the development of new therapeutic agents.
In this study, a two-stage optimization process is presented based on OED and some ML algorithms, where BWG (a TCM combination named Baiji Wuweizi Granule) was taken as an example. The substances of the extraction process and ideal extraction fluid (shown in Fig. 1) were characterized by the content of key bioactive components, molecular weight (Mw), HPLC fingerprint, and quantitative nuclear magnetic resonance spectroscopy (qNMR). Furthermore, the secondary metabolite and monosaccharide contents of the ideal extracts were identified. OED experiments were performed first to obtain the raw data, and furtherly the dataset was generated. Then according to the slight differences of factors, grey relation analysis (GRA) was proposed for ranking the features according to the characterization of the OED experiments. In order to bridge the gap between extraction factors and observations, back propagation neural networks (BPNN) and radial basis function neural networks (RBFNN) were applied as nonlinear fitting regression models36,37 for every single observation, just after the factor importance ranking by GRA. According to the interaction effects of different factors and observations determined by RBFNN, the NSGA-II algorithm was used to build a multi-objective program for dynamic planning. To evaluate the performance of the two-stage extraction process, criteria importance through inter-criteria correlation and technique in order of preference by similarity to ideal solution (CRITIC-TOPSIS) was adopted to find the ideal extraction process from the combination of solutions from predictions, aiming to figure out the best solution combination from several similar solution sets. Our work established a robust, promising model for optimizing a complex extraction process with little raw data. Above all, this study provides a nice method for describing the changes during extraction. It is of great importance to study the extraction processes for TCM, and also to understand the relationships between factors and components for improving the extraction efficiency.
![]() | ||
Fig. 1 Computational and data-driven procedure for optimizing an extraction process with multiple objectives. |
![]() | ||
Fig. 2 A consecutive extraction technological flow chart of ethanol and aqueous extract fluid from BWG. |
A TCM combination (Baiji Wuweizi Granule, BWG) is presented to represent these bioactive substances. Talking about the secondary metabolites, citrus peel (Citri Reticulatae Pericarpium, CP) and Astragali radix (AR) are famous for their various flavonoids, while Schisandra chinensis (SC) is known for its lignans. Bletilla striata (BS) is abundant in polysaccharides, which have been confirmed to be the most important component for biological functions and have also been used in drug delivery systems.41,42 Nevertheless, polysaccharides from CP, AR, and SC can be developed into biomaterials with their secondary metabolites or others.43–46 Combined with the favourable mechanical properties of polysaccharides, the combination of secondary metabolites and polysaccharides is considered as a promising candidate for production.47 Herein, the extraction process of BWG is seriously considered as an example.
For the ethanol extraction process, four factors and three levels were set to conduct L9 (34) an orthogonal experimental design. As for the water extraction process, three factors and three levels were set, ignoring the interaction between factors. The blank column set in water extraction was a test for random error in the OED. All experiments were conducted after a single-factor test, and the settings for each process are shown in Table 1.
Extraction process | Factor | Level 1 | Level 2 | Level 3 | |
---|---|---|---|---|---|
EE | A | Ethanol concentration (%) | 40 | 50 | 60 |
B | Extraction time (h) | 1 | 2 | 3 | |
C | Number of extraction cycles (n) | 1 | 2 | 3 | |
D | Solid–liquid ratio (g mL−1) | 1![]() ![]() |
1![]() ![]() |
1![]() ![]() |
|
WE | A | Solid–liquid ratio (g mL−1) | 1![]() ![]() |
1![]() ![]() |
1![]() ![]() |
B | Extraction times (time) | 1 | 2 | 3 | |
C | Extraction time (h) | 1 | 1.5 | 2 | |
D | Null column | 1 | 2 | 3 |
Outliers of each experiment were removed in the first place. As shown in Table 1, four parameters and three factor levels for EE as well as WE methodology generated 57 experiments, including repetitions, as tabulated in Tables S3 and S5.† The experimental data a summarized into two main categories: panoramic characterization (qualitative parameters) and specific characterization (quantitative parameters). Panoramic characterization parameters included average peak area of the VIP peak from HPLC fingerprint analysis (Aave), average mass concentration from qNMR, and average molecular weight for polysaccharides. Specific characterization involved the content of schisandrin, hesperidin, total sugar, as well as the yield of dry extraction. Here we used seven main indices to represent the features of 36 experimental samples for the two-stage of the extraction processes. Detailed information on what calculations were carried out is provided in the ESI (Tables S1–S7†). For missing data that were not tested, mean imputation was used to replace them in the input dataset. The code for the model and dataset are available in GitHub (https://github.com/Lexie0926/ML-code-for-GC.git).
![]() | (1) |
Then the normalization equation was employed for dismissing the differences in types of factors.
![]() | (2) |
For calculating the coefficient of the relation between factor level and response indexes, the equation can be expressed as:
![]() | (3) |
The grey relational grade can be calculated as:
![]() | (4) |
Considering the various observations for valuing, the average value of the grey relational grade of each factor was calculated as follows:
![]() | (5) |
A BPNN model consists of three layers, including input, hidden, and output. In the ethanol extraction process, four input layers, five hidden layers, and one output layer were set to conduct the regression equation of each index. As for the water extraction, there were three, five, and one input, hidden, and output layers, respectively.
![]() | (6) |
RBFNN is a feedforward neural network with the unique best approximation.51 It has similar principle with BPNN, yet it performs better in describing local features of the input data by utilizing the strong approximation.50 The main equation of RBFNN is shown in the following
![]() | (7) |
The BPNN model was performed on the APP in MATLAB, and the code for the RBFNN was uploaded to the GitHub link mentioned previously. Using random selection, a total of 18 sets of raw data from OED of EE and WE were chosen to participate and using the method of leave one cross-validation (LOOCV) to distinguish the training and test data. Testing data comprised the rest of the experimental dataset. Five-fold cross-validation was used to select the optimal hyperparameters with the training data, and the remaining testing data were used to evaluate the model performance. Epoch was set as 1000. Sigmoid was used for activation function, then other hyperparameters were set as default. To check the validity of the machine learning model, SVM and MLR were tested for comparison. A higher regression coefficient (R2) indicates a better model. Model accuracy was evaluated by the root mean square error (RMSE) between the real value and the predicted one.
The significant observations of GRA results were set as decision variables in this NSGA-II model. The objective functions in the optimization model guide the direction of the optimization process. Herein, regression equation fitting with the best performance was adopted to be the objective function in the genetic algorithm model. Another important parameter of this model is constraints. The constraints of the multi-objective optimization model consist all factors, namely the input and output characteristics of the model are shown in Table 2.
Extraction process | Factor | Symbol | Value |
---|---|---|---|
Ethanol extraction | Ethanol concentration | XE4, % | [40, 60] |
Extraction time | XE3, hour | [1, 3] | |
Number of extraction cycles | XE2, n | [1, 3] | |
Solid–liquid ratio | XE1, g mL−1 | [10, 14] | |
Water extraction | Solid–liquid ratio | XW3, g mL−1 | [10, 14] |
Number of extraction cycles | XW1, n | [1, 3] | |
Extraction time | XW2, hour | [1, 2] |
As the NSGA-II algorithm is developed based on the Pareto front, it is important to set a proper option to reduce the calculation time and obtain the ideal Pareto frontier solutions. The population size, maximum iteration number, crossover probability, mutation probability, distribution index of crossover operator, and distribution index of mutation operator were set to 100, 500, 0.9, 0.1, 20, and 20, respectively.
Factor importance analysis is a powerful tool for identifying the usefulness of input factors in predicting target observations and can provide a basis for factor selection. It is a faster way to complete the optimization by starting from the observations. Data processing is about comprehensively analyzing the observations, regardless of the feature extraction and selection.58 To investigate the factors that influence the extraction performance, GRA was generated for determining the degree of the relation between observations and factors. Facing the complex situation of decision of important observations, GRA was used to push back the causes that shall lead to process changes. While conducting the matrix of factors and observations, GRA grade was calculated and the results was shown in Fig. 5.
![]() | ||
Fig. 5 Grey relation grade of each factors and observations of WE and EE. (A) Ethanol extraction process, (B) water extraction process. |
It is obvious from the grey relation grade that even the same factor can have a diverse influence on different observations. The grey relation grade of extraction times has the same grade on both extraction process, which means it is equally important to both the water and ethanol extraction processes. Ethanol concentration, extraction time and solid–liquid ratio had a big difference under different observations (Fig. 5).
To avoid the situation of the average being dominated by large terms, the geometric mean was adopted for calculating the average grey relation grade. The finite set of average grey relation grade was shown as follows:
RAE = {0.5885, 0.6040, 0.7911, 0.7990} |
RAW = {0.1935, 0.5730, 0.2732} |
The explanation of the relationship between factor and observation is important, which expresses the degree of how much the factors influence the extraction process.59 In the ethanol extraction process, the importance can be listed as: soild–liquid ratio > number of extraction cycles > extraction time > ethanol concentration. In the water extraction process, this average grey relation grade can be listed in the order of: number of extraction cycles > extraction time > soild–liquid ratio.
It was worthy talking about the average grey relation grade which shown extraction cycles was the important factor for both extraction process. On the other hand, the results of the GRA gave a guide to consider whether to accept the identified final extraction process or not.60 It is a strong and powerful proof to guide further analysis in this study. Moreover, this result also confirms the critical process parameters (CPP) of the two-stage extraction process, which may support the study quality control and evaluation by using quality by design (QbD).61,62
Data from OED samples were selected to be the participant due to their space-filling characteristic with desirable low-dimensional projection properties, which perfectly satisfies the demands of modeling.70 A total of 18 experimental samples from nine orthogonal arrays were randomly chosen. The RBFNN model displayed higher predictive accuracy in comparison to the BPNN model on all single observations. The detailed fitting model results are shown in Table 3. The results shown in Fig. 7 express the real and predicted values of the validation set, which indicate that the fitting model was not overfitting. Overall, BPNN and RBFNN both performed well on the fitting of the extraction yield of schisandrin and the total sugar content (R2 > 0.99). That can be mainly attributed to their high content and perfect solubility in the corresponding solvent. As for the regression fittings of the other observations, RBFNN was slightly better than BPNN in R2. On the observation of yield of dry extract of water extraction process, the R2 were around 0.97 for both models, which is greater than 0.90, indicating that both models are acceptable.
Taken together, RBFNN performed better than BPNN in bridging factors and observations. In contrast to the BPNN, the RBFNN benefits from its simplicity of structure, higher approximation properties, and faster calculation procedures while avoiding the problems of overfitting and local minimum.71 All these characteristics make RBFNN a better model in arbitrarily complex nonlinear relationships, which indicates that the extraction process may be a simple mapping relation.72
The Pareto fronts obtained by NSGA-II for ethanol extraction and water extraction are shown in Fig. 6. For the extraction process optimization, the goal is to select a high content of observed compounds and a low yield of dry extract. Due to the previously set constraints for each factor, the solutions calculated were likely to be sparse and discontinuous. 30 solutions were simulated in the model of the EE process. The Pareto front matching figure is shown in Fig.6(A), which includes two optimizing indexes. As for the WE process, it became more complicated. Two kinds of NSGA-II were conducted according to various observations to make comparisons, one of which was conducted using the panoramic parameters like average area of important peak from qNMR and HPLC fingerprint, as well as the average molecular weight of polysaccharides, while the other was modelled based on the specific parameters, including the content of some components. As results, there were 63 solutions and 94 solutions in those two models, respectively.
Unlike the single-objective optimization, the multi-objective optimization lies in the fact that there is not a mathematically explainable solution, so the Pareto frontier is usually formed to represent the optimal solutions.74 To overcome the complexity of the different solvent extraction processes and to figure out the best combination of WE and EE processes, CRITIC-TOPSIS, as a decision-making method, was employed to obtain the optimal value. TOPSIS is a commonly used decision-making algorithm, while it sometimes may be influenced by subjective reasons. Therefore, CRITIC, which is usually accompanied with TOPSIS, was adopted to unify the TOPSIS for the weight determination to make the decision model more realistic, more practical and more flexible.21,75 Therefore, two lists of 30 × 63 combinations and 30 × 94 combinations were formed.
According to the calculated degree of relative proximity, combinations were ranked, and the top 3 of each list are shown in Table 4. The top one was thought to be the ideal extraction process. It was almost the same as the computed process used for the EE, but to be closer to the reality and more operable, some factors were modified. Then the optimized computed EE process could be concluded as: 2.0 g SC of extracted by reflux with 24.0 mL of 60% ethanol for 1 h, and this process shall be carried out a total of three times.
Source of combination | Rank | Score | Factors of EE | Factors of WE | |||||
---|---|---|---|---|---|---|---|---|---|
A | B | C | D | A | B | C | |||
Specific parameters | 1 | 0.5655 | 59.98 | 1.04 | 2.87 | 1![]() ![]() |
1![]() ![]() |
1.65 | 1.54 |
2 | 0.5654 | 59.98 | 1.27 | 2.91 | 1![]() ![]() |
1![]() ![]() |
1.65 | 1.87 | |
3 | 0.5653 | 59.99 | 1.21 | 2.74 | 1![]() ![]() |
1![]() ![]() |
1.65 | 1.87 | |
Panoramic parameters | 1 | 0.6527 | 59.99 | 1.21 | 2.74 | 1![]() ![]() |
1![]() ![]() |
2.40 | 1.50 |
2 | 0.6517 | 59.97 | 1.27 | 2.73 | 1![]() ![]() |
1![]() ![]() |
2.99 | 1.00 | |
3 | 0.6512 | 59.98 | 1.04 | 2.87 | 1![]() ![]() |
1![]() ![]() |
2.99 | 1.00 |
As for the WE process, the factors were diverse, while the top one for each source of combinations remained similar. The optimized water extraction process can be described as follows: 6.0 g AR, 3.0 g BS, 2.0 g SC and 3.0 g CP were weighed, 168 mL water was put into the box and lasting for 1.5 hours boiling, and the process was carried out a total of two times.
Observations | Process resource | ||
---|---|---|---|
OED | Computing optimization | Change rate (%)a | |
a The column of change rate is in comparison with the OED process. | |||
Yield of dry extract in ethanol process (%) | 29.04 ± 0.06 | 26.89 ± 0.12 | −7.44 |
Extraction yield of schisandrin (%) | 23.02 ± 0.83 | 23.70 ± 0.52 | 2.95 |
Yield of dry extract in water process (%) | 31.33 ± 0.76 | 27.23 ± 0.49 | −13.09 |
Total sugar content (mg g−1) | 10.40 ± 0.02 | 11.87 ± 0.13 | 14.13 |
Hesperidin content (mg g−1) | 5.22 ± 0.11 | 6.37 ± 0.02 | 22.03 |
Yield of dry extract of overall extraction process (%) | 33.24 ± 0.09 | 29.13 ± 0.07 | −12.36 |
Astragaloside IV content (mg g−1) | 0.6291 ± 0.00 | 0.6514 ± 0.02 | 4.74 |
Wave (μM mL−1) | 0.1634 ± 0.03 | 0.2126 ± 0.03 | 30.11 |
![]() |
6.0717 ± 0.15 | 5.9722 ± 0.07 | −1.64 |
![]() |
3.75 ± 0.11 | 3.80 ± 0.01 | 1.33 |
Experimental period (h) | 10.5 | 7.0 | −33.33 |
ML provides technical support for the management and operation of extraction processes, which is more efficient than relying solely on conventional optimizations. ML-based data analysis and evolutionary learning mechanisms have the potential to establish a universal analysis process and a predictive model platform.76 In conclusion, the computing-optimized extraction process was the ideal extraction process.
![]() | ||
Fig. 8 Characterization of polysaccharide in ideal extract of BWG. (A) Mw analysis. (B) Monosaccharide compositional analysis by HPLC of ideal extract of BWG. |
As the results show in Fig. 8(B), there were six monosaccharides in BWG. Their molar ratio was Man:
GluA
:
GlaA
:
Glu
:
Gla
:
Ara = 0.958
:
0.224
:
0.373
:
1.729
:
0.245
:
0.714. It is obvious that BWG is abundant in Glu, which supports the selection of Glu as reference for the determination of total sugar content. All these features suggest that polysaccharides are a potential bioactive component of BWG.
As shown in the ion spectrum in Fig. S7,† the components of the ideal extracts were identified by comparison of their retention times and masses with standards, databases and publications.79,80 A total of 160 components were identified, including 47 flavonoids and their glycosides, 20 lignans and other kinds of components (i.e., amino acids, organic acids, phenanthrenes, bibenzyls, saponins) (Table S10†).
We did not emphasize the underlying mechanisms of how the changing factors influence the extraction performance it was hard to explain how the multiple factors change together using the reported data. Our built model has limited predicted range due to the upper and lower limits decided by original dataset. That may encourage the exploring of the limits of applications in other systems and beyond in the future. That also inspired us to study more about the quantitative analysis of multi-components by a single marker (QAMS) in the production processes in order to find a better analysis method, which addresses the lack of reference standards, and their high cost.82 We brought up a helpful attempt for determining the CPP of QbD, which would set a strong basis for quality control during processes. It remains an open challenge for future research to elucidate the extraction dynamics for the selective targeted extraction of indicative components. Overall, we hope that this ML-assisted strategy will succeed in replacing the conventional extraction process modification and may be applied in other optimization issues.
Footnote |
† Electronic supplementary information (ESI) available. See DOI: https://doi.org/10.1039/d2gc04574e |
This journal is © The Royal Society of Chemistry 2023 |