New Pub: A method for selecting molecular descriptors when training property prediction models
Congratulations to the team for our recent publication in Fuel on developing a systematic method for selecting molecular descriptors as features when training models for prediction physiochemical properties of aviation fuels. A brief summary of the article and the link to the full article are below.
Machine learning has proven to be a powerful tool for accelerating biofuel development. Although numerous models are available to predict a range of properties using chemical descriptors, there is a trade-off between interpretability and performance. In this paper, we present a method for systematically selecting molecular descriptor features and developing interpretable machine learning models without sacrificing accuracy. Our method simplifies the process of selecting features by reducing feature multicollinearity and enables discoveries of new relationships between global properties and molecular descriptors. To demonstrate our approach, we developed models for predicting melting point, boiling point, flash point, yield sooting index, and net heat of combustion with the help of the Tree-based Pipeline Optimization Tool (TPOT). For training, we used publicly available experimental data for up to 8351 molecules. Our models accurately predict various molecular properties for organic molecules (mean absolute percent error (MAPE) ranges from 3.3% to 10.5%) and provide a set of features that are well-correlated to the property. To help accelerate early stage biofuel research and development, we also integrated the data and models into a open-source, interactive web tool: feedstock-to-function.lbl.gov
Click here for the full article.