Tree-based SIMCA for dealing with heterogeneous and sparse data.
BACKGROUND One Class Modelling (CM) is popular among chemometricians, but not well known among omics scientists in general. One issue is that typical CM approaches, including SIMCA, often result in unsatisfactory results due to e.g. large variation, centring and scaling issues, sparsity, outliers, and non-linearities in typical omics data. These effects can cause an inflated decision boundary (of the target class), thereby returning many false positives (of non-target cases). Tree-based techniques are by nature resistant to these challenges. In this study we explore tree-Based SIMCA variants in omics scenarios and compare to existing strategies. RESULTS We present a non-linear form of SIMCA by making use of sample proximities obtained through Unsupervised Random Forest and Isolation Forest (termed URF-SIMCA and IF-SIMCA). We compare accuracy of the algorithms with (traditional) SIMCA, one-class support vector machines, and isolation forest. This comparison was based on five (previously published) clinical omics datasets and the wine-dataset. URF-SIMCA showed superior behaviour. Using the pseudo-sampling principles, an interpretation could be made on the important features for the separation between the target and non-target classes. Using the wine-dataset, we empirically show that these directly relate to information obtained through two-class algorithms. Moreover, feature trajectories in the score- and orthogonal distance spaces further enable interpretability of the model. SIGNIFICANCE URF-SIMCA offers an easy to use extension of SIMCA, which deflates the variance of the target class, allowing for better separation. The increased modelling performance comes at the cost of feature interpretation, but this can be tackled using the pseudo-sampling principle.