Generating rapid insights with PwC's Data Science Machine

Jakob Olbrich Data & Analytics, PwC Switzerland 16 Nov 2020

In this article, we discuss a paper published in Nature Machine Intelligence by Yan et al.1 that tries to identify the drivers of COVID-19 outcomes. We show how a recently developed digital accelerator, PwC Switzerland’s Data Science Machine (DSM), can be used to solve similar, advanced problems, help unravel complex patterns and deliver actionable insights.

COVID-19 is a completely new disease, which makes it a challenge, even for experts, to identify the drivers of bad patient outcomes for COVID-19 infections. Machine learning techniques can be used as an accelerator to sort through the data and help experts make meaningful conclusions about this new disease.

We ran 10,434 models in three hours to identify the three main features that predict the course of COVID-19

High accuracy in predicting COVID-19 prognosis

Professor Ye Yan’s team used a database of 485 coronavirus patients from the Wuhan region. For each patient, they had access to a multitude (53) of different blood markers that can be easily obtained at any hospital, as well as the outcome of the patient’s COVID-19 infection (death or discharge from hospital).

Using the most current interpretable machine learning algorithm, they narrowed the 53 blood markers down to the 3 most significant ones for determining the outcome of the infection. Using only these three significant blood markers (‘Lactate dehydrogenase’, ‘Hypersensitive c-reactive protein’ and ‘Lymphocyte’) they were able to produce a very simple decision tree with about 90% accuracy in predicting the outcome based on a test set of 110 patients.
As the decision tree is very simple (only three blood markers need to be looked at), this could be a powerful tool allowing hospitals to allocate their resources to the people most in need.

Single-tree model identified by Yan et al.

To identify decisive features from a variety (53) of easily obtained blood markers, Yan et al.1 fitted a multi-tree XGBoost machine learning model to the data of 70% of the 375 patients (the data of 110 patients was kept aside for testing purposes). Using XGBoost's built-in feature importance method, they were able to identify the top ten most important features. To pressure-test these findings, they tried fitting a single-tree XGBoost model to the data, starting with only the most important feature (as discovered by the previous method) and gradually adding the second-most important feature, then the third-most important and so on, until the performance of the model no longer improved on a validation set. This approach identified the following three features as the decisive features in the dataset:

  • Lactate dehydrogenase
  • Hypersensitive c-reactive protein
  • Lymphocyte

The single-tree XGBoost model still performed very well in the validation and test sets and, as it only contained one tree, it produced a decision tree that was very minimal and easy to evaluate.

The DSM way ~10,000 XGBoost models

As Yan et al.1 published the raw data and the scripts for pre-processing and evaluating the data on GitHub, we were able to analyse exactly the same data as them, but instead using the DSM as our accelerator.

The DSM is able to generate and fit hundreds of models within minutes, while performing feature optimisation at the same time. So, we let the DSM engine train a multitude (~10,000) of XGBoost models, randomly picking parameters for the XGBoost model and also randomly picking features to use from the dataset. In addition, we used a 70%/30% split between training and validation sets. Taking a fairly different approach and letting the DSM basically work unsupervised, this meant we were able to give further evidence on the importance of 'Lactate dehydrogenase' and 'Hypersensitive c-reactive protein' as important features for predicting COVID-19 outcomes.

Two blood markers validated by DSM

In the table 1 below, we looked at all models with a performance higher than AUC = 85% and counted how often these models picked a certain feature. The features 'Hypersensitive c-reactive protein' and 'Lactate dehydrogenase' seem to correlate with good performance. Lymphocytes seem to be less important than the other two and only the count, not the percentage as suggested by Yan et al., shows up in the top ten. Again, this is in line with the paper, where the proposed, simple decision tree suggests that lymphocytes only act as a tiebreaker if the prognosis isn't clear from the other two features.

It’s also worth noting that in other papers discussing blood markers to determine the severity of COVID-19 infections, some of the other blood markers that we identified in the top ten were also found. Specifically, the features in this paper by Wu et al.2 and this paper by Lu et al.overlapped significantly with the features in our top ten.

Confirmed importance of critical features

In this section, our aim is to verify whether or not the features (i.e. blood markers) identified as critical in the paper by Yan et al.1 are of a similar critical nature within the models built by the DSM. To do this, we studied the performance of all DSM-generated models as a function of critical features used for different numbers of total features (figure 1). We observed a clear trend indicating that adding the first critical feature increases model performance, especially if there’s a low number of total features. In general model performance rises monotonously as a function of numbers of critical features used.

In general, models with good performance seem to correlate with the two features 'Lactate dehydrogenase' and 'Hypersensitive c-reactive protein'. The 'Lymphocyte' feature appears to be less important for having a model with good performance. This seems to be in line with the interpretable tree-based model that Yan et al. proposed, where 'Lymphocyte' was used more as a tiebreaker in cases where 'Lactate dehydrogenase' and 'Hypersensitive c-reactive protein' weren't decisive by themselves.

Evidence

Feature

Count

Hypersensitive c-reactive protein

2,466

Lactate dehydrogenase

2,397

D-D dimer

2,150

Aspartate aminotransferase

2,150

Procalcitonin

2,129

International standard ratio

2,125

Thrombocytocrit

2,120

Prothrombin time

2,099

Glucose

2,087

Lymphocyte count

2,083

Table 1: The top ten most used features among models with ‘good’ performance (AUC > 85%). The features written in bold are those that Yan et al.1 found to be most deterministic of critical disease progression. Yan et al.’s1 model also found ‘(%)Lymphocyte’ 4 to be an important feature while our analysis didn’t. The feature ‘Lymphocyte count’ showed up in our top ten, which is a slightly different feature, however. But even in Yan et al.’s1 analysis, ‘Lymphocyte’ was only used as a tiebreaker in cases where ‘Hypersensitive c-reactive protein’ and ‘Lactate dehydrogenase’ were not conclusive.

Figure 1: For different numbers of total features used in a model, we plotted the performance of our models as a function of how many ‘critical features’ were used by the model. We can see a clear indication that for each bucket of ‘total features’, the average performance of models increases monotonously with the number of ‘critical features’ used.


Technical Conclusion

Using a completely different approach to Yan et al.1, powered and accelerated by the DSM, we reached similar results and conclusions, giving extra evidence for their findings. In conclusion, we were able to demonstrate the importance of ‘Lactate dehydrogenase’ and ‘Hypersensitive c-reactive protein’ for predicting the outcome in COVID-19 infections in the dataset provided by Yan et al.1

Results found by Yan et al.1 Confirmation by DSM
Yan et al.1 found the three blood markers ‘Lactate dehydrogenase’, ‘Hypersensitive c-reactive protein’ and ’Lymphocyte’ to be the most important blood markers in their dataset The DSM found ‘Lactate dehydrogenase’ and ‘Hypersensitive c-reactive protein’ to be the most frequently picked features by models above a certain performance threshold
Showed the existence of a simple, interpretable decision tree for predicting COVID-19 outcomes based on the three features The DSM found very high performance for all models using the three features found by Yan et al.1

Summary

PwC Switzerland’s digital accelerator Data Science Machine proved to be a reliable solution for obtaining instant tangible results.

The DSM proved to be a valuable tool by providing:

  • an easy-to-use, automated machine learning pipeline that allows the problem at hand to be tackled quickly and easily
  • rapid insights into the success of different model configurations by calculating all the necessary metrics and keeping track of the model parameters and data features used
  • easy tunability, in other words allowing the user to easily tune the fitting process to most easily answer the questions posed by the problem
  • a time-saving way to do data science, accelerating the work of a data scientist and speeding up the time to arrive at important findings in the data.

To summarise, the DSM lets experts rapidly obtain valuable scientific insights into their data that can bring your field or business to the next level when it comes to AI.

1 Yan, L., Zhang, H., Goncalves, J. et al., 2020. An interpretable mortality prediction model for COVID-19 patients. Nature Mach Intell, Vol. 2, pp. 283–288. 

Lu, Y., Sun, K., Guo, S., Wang et al., 2020. Early Warning Indicators of Severe COVID-19: A Single-Center Study of Cases From Shanghai, China. Frontiers in Medicine.

3 Wu, Y., Hou, B., Liu, J., Chen, Y. and Zhong, P., 2020. Risk Factors Associated With Long-Term Hospitalization in Patients With COVID-19: A Single-Centered, Retrospective Study. Frontiers in Medicine.

(%)Lymphocyte reflects the number or percentage of lymphocytes, which are white blood cells that include B-cells, T-cells, and natural killer cells

Contact us

Jakob  Olbrich

Jakob Olbrich

Data & Analytics, PwC Switzerland

Tel: +41 58 792 21 37