Objectives: This study is aimed to achieve the rapid optimization of the input feature subset that satisfies the expert’s point of view and enhance the prediction performance of the early prediction model for fatty liver disease (FLD).
Methods: We explore a large-scale and high-dimension dataset coming from a northern Taipei Health Screening Center in Taiwan, and the dataset includes data of 12,707 male and 10,601 female patients processed from around 500,000 records from year 2009 to 2016. We propose three eigenvector-based feature selections taking the Intersection of Union (IoU) and the Coverage to determine the sub-optimal subset of features with the highest IoU and the Coverage automatically, use various long short-term memory (LSTM) related classifiers for FLD prediction, and evaluate the model performance by the test accuracy and the Area Under the Receiver Operating Characteristic Curve (AUROC).
Results: Our eigenvector-based feature selection EFSTW has the highest IOU and the Coverage and the shortest total computing time. For comparison, the highest IOU, the Coverage, and computing time are 30.56%, 45.83% and 260 seconds for female, and that of a benchmark, sequential forward selection (SFS), are 9.09%, 16.67% and 380,350 seconds. The AUROC with LSTM, biLSTM, Gated Recurrent Unit (GRU), Stack-LSTM, Stack-biLSTM are 0.85, 0.86, 0.86, 0.86 and 0.87 for male, and all 0.9 for female, respectively.
Conclusion: Our method explores a large-scale and high-dimension FLD dataset, implements three efficient and automatic eigenvector-based feature selections, and develops the model for early prediction of FLD efficiently.