In this example, we build a machine learning model to detect cancer using urinary biomarkers.

Pancreatic ductal adenocarcinoma (PDAC) is a highly deadly type of pancreatic cancer. Once diagnosed, the five-year survival rate is less than 10%. However, if the disease is detected when tumors are still small and resectable, 5-year survival can increase by up to 70%. Unfortunately, many cases of pancreatic cancer show no symptoms until cancer has spread throughout the body. While blood has traditionally been the primary source of biomarkers, urine represents a promising alternative biological fluid. It allows a completely non-invasive sampling, high-volume collection, and ease of repeated measurements.

Previous studies have identified a panel of 3 protein biomarkers (LYVE1, REG1A, and TFF1) in urine that showed promise in detecting resectable PDAC. We have improved this panel by substituting REG1A with REG1B. Despite his invasive sampling collection, including the blood biomarker plasma_CA19_9, it improves cancer detection with other urine biomarkers. Finally, the four critical biomarkers we will consider in urine are creatinine, LYVE1, REG1B, and TFF1. Creatinine is a protein often used as an indicator of kidney function. YVLE1 (lymphatic vessel endothelial hyaluronan receptor 1) is a protein that may play a role in tumor metastasis. REG1B is a protein that may be associated with pancreas regeneration. Finally, TFF1 (trefoil factor 1) may be related to regeneration and repair of the urinary tract.

This example is solved with Neural Designer. To follow it step by step, you can use the free trial.

Contents

  1. Application type.
  2. Data set.
  3. Neural network.
  4. Training strategy.
  5. Model Selection.
  6. Testing analysis.
  7. Model deployment.

 

1. Application type

This is a classification project. Indeed, the variable to be predicted is categorical (no pancreatic disease, benign hepatobiliary disease, or pancreatic cancer). The goal is to predict the presence of disease before it’s diagnosed. More specifically, we want to differentiate between pancreatic cancer versus non-cancerous pancreas and healthy conditions.

2. Data set

The data set was obtained from multiple centers: Barts Pancreas Tissue Bank, University College London, University of Liverpool, Spanish National Cancer Research Center, Cambridge University Hospital, and the University of Belgrade. These centers analyzed this panel of biomarkers in 590 urine samples: 183 control samples, 208 benign hepatobiliary disease samples (119 were chronic pancreatitis), and 199 PDAC samples.

It is composed of four concepts:

  • Data source.
  • Variables.
  • Instances.
  • Missing values.

Data source

The data file pancreatic-cancer.csv contains the information used to create the model. It consists of 509 rows and 14 columns. The columns represent different cancer risk factors, while the rows represent the study samples.

Variables

This data set uses the following 16 variables:

  • sample_id: Unique string identifying each subject.
  • patient_cohort: Cohort 1 (previously used samples); Cohort 2 (newly added samples).
  • sample_origin: BPTB: Barts Pancreas Tissue Bank, London, UK; ESP: Spanish National Cancer Research Centre, Madrid, Spain; LIV: Liverpool University, UK; UCL: University College London, UK.
  • age: Age in years.
  • sex: M = male, F = female.
  • plasma_CA19_9: Blood plasma levels of CA 19-9 monoclonal antibody. It is often elevated in patients with pancreatic cancer. This study evaluated plasma_CA19_9 only in 350 patients. Indeed, one goal was to compare various CA 19-9 cutpoints from a blood sample to the model developed using urinary samples.
  • creatinine: Urinary biomarker of kidney function.
  • LYVE1: Urinary levels of Lymphatic vessel endothelial hyaluronan receptor 1, a protein that may play a role in tumor metastasis.
  • REG1B: Urinary protein levels that may be associated with pancreas regeneration.
  • REG1A: Urinary protein levels that may be associated with pancreas regeneration. Only assessed in 306 patients (one study goal was to assess REG1B vs. REG1A).
  • TFF1: Urinary levels of Trefoil Factor 1 may be related to regeneration and repair of the urinary tract.
  • diagnosis: 1 = control (no pancreatic disease), 2 = benign hepatobiliary disease (119 of which are chronic pancreatitis);    3 = Pancreatic ductal adenocarcinoma, i.e., pancreatic cancer.
  • benign_sample_diagnosis: For those with a benign, non-cancerous diagnosis.
  • stage: Stage for those with pancreatic cancer: IA, IB, IIA, IIIB, III, IV.

We set a few of these input variables as unused:

  • ‘sample_id’: Neural Designer unuses this variable automatically.
  • ‘sample_origin’: it only particularizes the origin of the patient samples, and it should not affect the final diagnosis.
  • ‘stage’: is a variable that only exists for individuals we already know have cancer.
  • ‘patient_cohort’: it does not contribute to the final sample diagnosis.
  • ‘benign_sample_diagnosis’ only specifies the complete diagnosis for patients with a benign diagnosis.

The variable corresponding to the biomarker REG1A is absent in all the samples. For that reason, we choose to set it as unused, too. However, this decision does not mean a deterioration of the model as the biomarker REG1B improves the results.

Variables distribution

Once the data set is configured, we can calculate the data distribution of the variables. The following figure depicts the number of patients who have cancer and those who do not.

As we can see, the three cases have similar sample numbers.

We must divide our dataset into four subsets to compare the accuracy and AUC (Area Under Curve) calculated in this study with those in the paper cited in the references section.

  • Control samples vs. PDAC stages I and II: From the raw dataset, we only select healthy individual samples and pancreatitis cancer stages I and II samples (File: control-vs-PDAC-I_II.csv). The model has to predict the variable PDAC_I_II, which explains whether the patient has PDAC stages I or II or is healthy (Control samples = 0, PDAC-I_II = 1).
  • Control samples vs. PDAC stages III and IV: From the raw dataset, we only select healthy individual samples and pancreatitis cancer stages III and IV samples (File: control-vs-PDAC-III_IV.csv). The model has to predict the variable PDAC_III_IV, which explains whether the patient has PDAC stages III or IV or is healthy (Control samples = 0,                  PDAC-III_IV = 1).
  • Benign hepatobiliary diseases vs. PDAC stages I and II: From the raw dataset, we only select individuals with benign tumor samples and pancreatitis cancer stages I and II samples (File: benign-vs-PDAC-I_II.csv). The model has to predict the variable PDAC_I_II, which explains whether the patient has PDAC stages I or II or has a benign hepatobiliary disease (Benign hepatobiliary disease = 0, PDAC-I_II = 1).
  • Benign hepatobiliary diseases vs. PDAC stages III and IV: From the raw dataset, we only select individuals with benign tumor samples and pancreatitis cancer stages I and II samples (File: benign-vs-PDAC-III_IV.csv). The model has to predict the variable PDAC_III_IV, which explains whether the patient has PDAC stages III or IV or has a benign hepatobiliary disease (Benign hepatobiliary disease = 0, PDAC-III_IV = 1).

All these cases are divided into training and testing, containing 50% of the samples in each subset.

2.1. Control samples vs. PDAC stages I and II

The next figure depicts the inputs-target correlations of all the inputs with the target and helps us see the different inputs’ influence on the default.

The more correlated variables are the biomarkers age, plasma_CA19_9, and LYVE1.

2.2. Control samples vs. PDAC stages III and IV

The next figure depicts the inputs-target correlations of all the inputs with the target and helps us see the different inputs’ influence on the default.

The more correlated variables are the biomarkers age, plasma_CA19_9, and creatinine.

2.3. Benign hepatobiliary diseases vs. PDAC stages I and II

The next figure depicts the inputs-target correlations of all the inputs with the target and helps us see the different inputs’ influence on the default.

The more correlated variables are the biomarkers age, plasma_CA19_9, and REG1B.

2.4. Benign hepatobiliary diseases vs. PDAC stages III and IV

The next figure depicts the inputs-target correlations of all the inputs with the target. This helps us see the different inputs’ influence on the default.

The more correlated variables are the biomarkers age, plasma_CA19_9, and creatinine.

3. Neural network

The second step is to choose a neural network to represent the classification function. We will use the same neural network configuration for all four cases for this part of the model creation. For classification problems, it is composed of:

We realize that having a perceptron layer contributes to overfitting the neural network. For this reason, we remove the perceptron layer.

The following figure is a diagram of the neural network used in each case of this example:

It contains a scaling layer with seven neurons (yellow) and a probabilistic layer with one neuron (red).

4. Training strategy

The fourth step is to configure the training strategy. Finally, the training strategy is applied to the neural network to obtain the best possible loss. The type of training is determined by how the adjustment of the parameters in the neural network takes place. It is composed of two concepts:

  • A loss index.
  • An optimization algorithm.

The loss index chosen for this problem is the mean squared error with L2 regularization. It calculates the average squared error between the outputs from the neural network and the target in the data set.

The optimization algorithm is applied to the neural network for the best performance. We use here gradient descent for training. Through this method, Neural Designer updates the neural parameters in the direction of the negative gradient of the loss function.

The following chart shows how the training error decreases with the epochs during the training process. All the charts have a similar curvature. We will only show the case corresponding to control samples vs. PDAC stages I and II. The selection error is not plotted in the chart because we have not initially taken any selection samples.

Now, we calculate the training error of all the cases of this study:

  • Control samples vs. PDAC stages I and II: The final results are: training error = 0.0968 MSE.
  • Control samples vs. PDAC stages III and IV: The final results are: training error = 0.105 MSE.
  • Benign hepatobiliary diseases vs. PDAC stages I and II: The final results are: training error = 0.162 MSE.
  • Benign hepatobiliary diseases vs. PDAC stages III and IV: The final results are: training error = 0.111 MSE.

6. Testing analysis

The next step is to evaluate the performance of the trained neural network by an exhaustive testing analysis. The standard way to do this is to compare the neural network outputs against data never seen before in the testing instances.

A common method to measure the generalization performance is the ROC curve. It is a visual aid to study the classifier’s discrimination capacity. One of the parameters obtained from this chart is the area under the curve (AUC).
The closer to 1 area under the curve, the better the classifier.

6.1. Control samples vs. PDAC stages I and II

In this case, the AUC takes a high value: AUC = 0.919.

Neural Designer computes the optimal threshold by finding the point of the ROC curve nearest to the upper left corner.
The threshold corresponding to that point is called the optimal threshold and, in this case, has a value of 0.788.

The confusion matrix gives us helpful information about our predictive model’s performance. For the optimal decision threshold, we display the confusion matrix:

Predicted positive Predicted negative
Real positive 34 (39.5%) 6 (7.0%)
Real negative 9 (10.5%) 37 (43.0%)
  • Classification accuracy: 82.6% (Ratio of correctly classified samples).
  • Error rate: 17.4% (Ratio of misclassified samples).
  • Sensitivity: 79.1% (Portion of real positives the model predicts as positives).
  • Specificity: 86.0% (Portion of real negatives the model predicts as negatives).

The high classification accuracy (82.6%) makes the prediction suitable for many cases.

6.2. Control samples vs. PDAC stages III and IV

In this case, the AUC takes a high value: AUC = 0.913.

The optimal threshold has a value of 0.587.

Predicted positive Predicted negative
Real positive 85 (60.7%) 9 (6.4%)
Real negative 7 (5.0%) 39 (27.9%)
  • Classification accuracy: 88.6% (Ratio of correctly classified samples).
  • Error rate: 11.4% (Ratio of misclassified samples).
  • Sensitivity: 92.4% (Portion of real positives the model predicts as positives).
  • Specificity: 81.3% (Portion of real negatives the model predicts as negatives).

The classification accuracy is high (88.6%), making the prediction suitable for many cases.

6.3. Benign hepatobiliary diseases vs. PDAC stages I and II

In this case, the AUC takes a high value: AUC = 0.920.

The optimal threshold has a value of 0.653.

Predicted positive Predicted negative
Real positive 44 (46.8%) 5 (5.3%)
Real negative 11 (11.7%) 34 (36.2%)
  • Classification accuracy: 83.0% (Ratio of correctly classified samples).
  • Error rate: 17.0% (Ratio of misclassified samples).
  • Sensitivity: 80.0% (Portion of real positive predicted positive).
  • Specificity: 87.2% (Portion of real negative predicted negative).

The classification accuracy is high (83.0%), making the prediction suitable for many cases.

6.4. Benign hepatobiliary diseases vs. PDAC stages III and IV

In this case, the AUC takes a high value: AUC = 0.848.

The optimal threshold has a value of: 0.412.

Predicted positive Predicted negative
Real positive 47 (52.8%) 11 (12.4%)
Real negative 8 (9.0%) 23 (25.8%)
  • Classification accuracy: 78.7% (Ratio of correctly classified samples).
  • Error rate: 21.3% (Ratio of misclassified samples).
  • Sensitivity: 85.5% (Portion of real positives that the model predicts as positives).
  • Specificity: 67.6% (Portion of real negatives the model predicts as negatives).

The classification accuracy is high (78.7%), making the prediction suitable for many cases.

As in the paper, we will show a table with some sensitivity and specificity cutoffs. First, we treat the case of the control samples versus pancreatic cancer stages I and II, and III and IV:

Sensitivity cutoff Specificity (Control vs I, II) Specificity (Control vs III, IV)
0.8 0.86 0.875
0.85 0.791 0.854
0.9 0.744 0.833
0.95 0.512 0.771

Now we treat the case of the benign samples versus pancreatic cancer stages I and II, and III and IV:

Specificity cutoff Sensitivity (Benign vs I, II) Sensitivity (Benign vs III, IV)
0.8 0.846 0.676
0.85 0.769 0.647
0.9 0.769 0.618
0.95 0.615 0.559

7. Model deployment

Once we have tested the generalization performance of the neural network, we can save it for future use in the so-called model deployment mode.

An interesting task in the model deployment tool is to calculate outputs, which produce outcomes for each set of inputs applied. The outputs depend, in turn, on the values of the parameters.

Next, we will show an example of the benign tumor or PDAC stages III and IV diagnosis.

  • age: 45
  • sex: F (1)
  • plasma_CA19_9: 740.94
  • creatinine: 0.927814
  • LYVE1: 3.78856
  • REG1B: 121.787
  • TFF1: 752.305
  • diagnosis: 0.6895

That person’s pancreatic cancer risk (stages III or IV) would be high.

References

  • Debernardi S, O’Brien H, Algahmdi AS, Malats N, Stewart GD, et al. (2020) A combination of urinary biomarker panel and PancRISK score for earlier detection of pancreatic cancer: A case-control study. PLOS Medicine 17(12): e1003489
  • Dataset from: Kaggle: Urinary biomarkers for pancreatic cancer.

Related posts