Corrosion as a phenomenon seems pretty common to us, yet it consume 6.2% GDP of the USA yearly. It is estimated that more than a whooping $270 billion dollars are spent in various ways to tackle corrosion in the USA alone. This is because of the need of replacing corroded equipment, machinery and structures over major transportation, food and energy sectors. This in itself is a compelling reason to support and keep up with the research that is being carried out on developing corrosion-resistant materials and coatings.
In the research in discussion, machine learning is used to predict and select corrosion resistance alloys with focus on multi-principal element alloys (MPEAs) and high entropy alloys. These alloys are now preferred over conventional steels, Ni and Ti based alloys, as MPEAs containing Co, Cr and Ni offer better corrosion resistance under NaCl solution when compared to its austenitic stainless steel counterparts. Though the selection criterion in this research is narrowed down to MPEAs alone, the selection data is considerably high with vast amounts of info to explore and parameters to consider. In addition to using material informatics, which is a helpful tool to overcome this issue faster and relatively less expensive, a descriptor optimization technique is used to combat this issue based on the already available data.
The basic idea of this model is to analyze data and establish relations, correlations and trends within the data. This is done by building a mathematical model whose robustness depends on the accuracy and size of the data. Which can be further used to logically predict the properties of new materials. Unfortunately, test data for MPEA under atmospheric exposure is limited as it is not a standard practice in industries. Hence the applications of MPEAs are limited to harsh environments.
The degree of corrosion resistance of an MPEA depends on the combination of element composition used. Hence it is quite evident that a system might have to verify trillions of combinations for any given environment or application to predict a suitable MPEA. Apart from the challenges of data searching, the ability to select a composition by considering an experimental data and intuition are also major barricades for any given ML model.
Introduction to ML model
The structure of this process is controlled by the use of down selection process. In this selection used in the research, two environmental descriptors are used with control over pH and halide concentration. Additionally, one chemical composition descriptor and two atomic descriptors are used to downsize the data. Even though descriptors are very helpful in downsizing the total data, it is very important to choose the right combination of descriptors based on our need while also keeping an eye on the number of descriptors we use. This requires good domain knowledge on the field we are trying to use the model on.
A gradient boost ML with two stage feature down selection process is used in the research in discussion. The selection process involves a pool of 30 features.
Fig1. Schematic diagram of the two stage down selection feature.
It is important to keep in mind that for any given problem it is impossible to create a perfect model, there is always room for improvement (according to no free lunch theorem). Hence for this particular problem a gradient boosting regression model is considered to be the best fit (after exploring few other models), with the inclusion of two step down selection processes.
In the first stage of selection, top 13 features are taken out from a pool of 30 features. In the second stage of selection process, all the possible combinations of the 13 features are analyzed by evaluating the mean squared error. Followed by, finalizing a total of five features at the end of second stage.
Down selection stage one
The number of descriptors used will play a key role in defining the complexity and accuracy of the model. The selection should be in such a way that the accuracy of the model should not compromised while the model stays fairly simple without complexity. There is also evidence that including control parameters excessively will not only increase the complexity of the model but also degrade the results. Hence in this stage all the concerned features are analyzed by its importance, this is done by computing the mean and standard deviation accumulated by adding features one after the other. All the features are ranked based on their importance after computing mean square error. They are ranked based on the computed values, from which the top 13 are considered.
Fig2. A graph depicting the relative importance of top 13 features selected after down step stage one.
Down selection stage two
As mentioned earlier, it is important to find the sweet spot of using a certain number of descriptors which will give accurate results but also keeps to model simple to interpret. Hence a function to track the performance of the model is recommended.
After analyzing all the possible combinations for the considered 13 descriptors, a combination of five descriptors with the lowest mean square error (MSE) is again selected. The considered descriptors are pH of medium, Halide morality in medium, composition o element with minimum reduction potential, difference in lattice constants and average reduction potential.
Fig3. Plot representing MSE Vs Number of features
In this research the importance of selecting the right ML approach is highlighted along with the important role descriptors play, especially when dealing with large amounts of data. In addition to the information stated in this article, additional info such as the comparison of experimental and ML predicted data (corrosion rate), insight on data collection method used and the machine-learning models used can be found via the following link: Senkov, O., Miller, J., Miracle, D. et al. Accelerated exploration of multi-principal element alloys with solid solution phases. Nat Commun 6, 6529 (2015), https://doi.org/10.1038/s41529-021-00208-y