Introduction

Colorectal cancer is one of the leading causes of cancer mortality, with liver metastases being a major determinant of prognosis and survival.

Metastasis prediction using machine learning provides a powerful approach to identify which patients are most likely to develop metastasis, taking into account the complex interplay between genetic alterations and clinical factors.

We developed a neural network model that integrates mutational data from 492 genes and phenotypic variables to estimate the probability of liver metastasis.

Using the MSK-MET cohort from Memorial Sloan Kettering, which includes genomic and clinical data from over 25,000 patients, the model achieved an accuracy of 78% and an AUC of 0.85, showing strong potential as a clinical decision-support tool.

Healthcare professionals can explore this methodology through Neural Designer’s trial version, applying the colon cancer metastasis prediction using a machine learning model to new patient data.

Contents

The following index outlines the steps for performing the analysis.

1.Model type

2.Dataset

3.Neural network

4.Training strategy

5.Model selection

6.Testing analysis

7.Model deployment

1. Model type

The predicted variable can have two values: “yes” if the patient has liver metastasis and “no” otherwise. Therefore, this is a binary classification project.

The goal is to model the probability of metastasis in the liver based on mutational and phenotypic data using artificial intelligence and machine learning.

2. Data set

Data source

The liver_metastasis_colon_cancer.csv file contains the data for this example. Target variables can only have two values in a classification model: 0 (false, no) or 1 (true, yes). The number of instances (rows) in the data set is 3537, and the number of variables (columns) is 510.

The number of input variables, or attributes for each sample, is 509. The target variable is 1, distant_metastasis_liver (Yes or No), indicating whether the patient has liver metastasis.

Variables

The following list summarizes the variables’ information:

Phenotypic features

  • age_at_first_metastasis_diagnostic – Age at which the first metastasis was diagnosed.
  • age_at_surgical_procedure – Age at the time of surgery.
  • cancer_type_detailed – Specific cancer type: colon, rectal, or colorectal.
  • mortality_3_years – Whether the patient is alive 3 years after sequencing (1 = yes, 0 = no).
  • fraction_genome_altered – Percentage of the genome affected by copy number variations.
  • metastasis_count – Total number of metastases observed.
  • metastasis_primary_site_count – Number of metastases in the primary tumor site.

Genetic features (gene panel)

  • KIT – Number of mutations in KIT gene.
  • CARD11 – Number of mutations in CARD11 gene.
  • RB1 – Number of mutations in RB1 gene.
  • WT1 – Number of mutations in WT1 gene.
  • PLCG2 – Number of mutations in PLCG2 gene.
  • DNMT1 – Number of mutations in DNMT1 gene.
  • BRD4 – Number of mutations in BRD4 gene.
  • PIK3R1 – Number of mutations in PIK3R1 gene.
  • IRS2 – Number of mutations in IRS2 gene.
  • SESN1 – Number of mutations in SESN1 gene.
  • NPM1 – Number of mutations in NPM1 gene.

Target variable

  • distant_metastasis_liver (0 or 1) – 0 if no liver metastasis, 1 if liver metastasis is present.

The image shows that metastatic liver tumors represent 57% of the samples, while 43% represent tumors without liver metastases.

The inputs-targets correlations might indicate to us which factors most influence whether a tumor produces liver metastases or not and, therefore, be more relevant to our analysis.

Here, the most correlated variables with malignant tumors are metastasis_primary_site_count, metastasis_count, cancer_subtype, and microsatellite_instability_score.

3. Neural network

The next step is to set up a neural network representing the classification function. For this class of applications, the neural network is composed of:

The scaling layer contains the statistics on the inputs calculated from the data file and the method for scaling the input variables. Here, the minimum-maximum method has been set. Nevertheless, the mean-standard deviation method would produce very similar results. As we use 497 input variables, the scaling layer has 497 inputs.

We won’t use a perceptron layer to stabilize and simplify our model.

The probabilistic layer only contains the method for interpreting the outputs as probabilities. Moreover, as the output layer’s activation function is logistic, that output can already be interpreted as a probability of class membership. The probabilistic layer has 497 inputs. It has one output, representing the probability of a sample being a malignant tumor.

The following figure is a graphical representation of this neural network for liver metastasis diagnosis.

As mentioned above, the network has 497 inputs, from which we obtain a single output value. This value is the probability of liver metastasis for each patient.

4. Training strategy

The fourth step is to set the training strategy, which is composed of two terms:

  • A loss index.
  • An optimization algorithm.

The loss index is the weighted squared error with L2 regularization, which is the default loss index for binary classification applications.

We can state the learning problem as finding a neural network that minimizes the loss index. That is a neural network that fits the data set (error term) and does not oscillate (regularization term).

The optimization algorithm that we use is the quasi-Newton method, which is also the standard optimization algorithm for this type of problem.

The following chart shows how the error decreases with the iterations during the training process. The final training and selection errors are training error = 0.3969 WSE and selection error = 0.8127 WSE, respectively.

As we can see in the previous image, the curves have converged, although the selection error is greater than the training error, so we could try to continue improving the model to reduce the errors further.

5. Model selection

The objective of model selection is to find the network architecture that minimizes the error, that is, with the best generalization properties for the selected instances of the data set.

Order selection algorithms train several network architectures with different numbers of neurons and select the one with the smallest selection error. We have removed our perceptron layer to stabilize our model, so we cannot use this feature.

However, we will use input selection to select features in the data set that provide the best generalization capabilities.

In the following image, we see that we can reduce the training/selection error using this method.

Ultimately, we obtain a training error = 0.6333 WSE and selection error = 0.6264 WSE, respectively. Also, we have reduced the number of inputs to only 18 features. Our network is now like this:

Our final network comprises 7 inputs corresponding to phenotypic variables and 11 Genes from the panels, totaling 18 input variables. The genes are: KIT, CARD11, RB1, WT1, PLCG2, DNMT1, BRD4, PIK3R1, IRS2, SESN1, NPM1.

6. Testing analysis

The objective of the testing analysis is to validate the performance of the generalization properties of the trained neural network. To validate a classification technique, we need to compare the values provided by this technique to the observed values. We can use the ROC curve as it is the standard testing method for binary classification projects.

A random classifier has an area under the curve of 0.5, while a perfect classifier has a value of 1. The closer this value is to 1, the better the classifier. In this example, the AUC parameter is 0.85, indicating excellent performance.

The following table contains the elements of the confusion matrix. This matrix contains the true positives, false positives, false negatives, and true negatives for the variable diagnosis.

Predicted negativePredicted positive
Real negative344 (48.7%)69 (9.8%)
Real positive86 (12.2%)208 (29.4%)

The binary classification tests are parameters for measuring the performance of a classification problem with two classes:

  • Classification accuracy (ratio of instances correctly classified): 78%
  • Error rate (ratio of instances misclassified): 21.9%
  • Specificity (ratio of real positives that are predicted positive): 70.7%
  • Sensitivity (ratio of real negatives that are predicted negative): 83.3%

7. Model deployment

Once we have tested the neural network’s generalization performance, we can save the model for future use in the so-called deployment mode.

In this phase, the colon cancer liver metastasis prediction model can be applied to new patients by calculating the neural network outputs based on their phenotypic variables and gene mutation profiles.

This approach allows healthcare professionals to use the trained model as a support tool for estimating the risk of liver metastasis, providing an additional layer of information to assist clinical decision-making.

The output represents the probability of developing liver metastasis and must always be interpreted by a physician in the context of the patient’s overall clinical picture.

The model represented by the neural network is shown below.

Conclusions

The colon cancer liver metastasis prediction machine learning model developed from the MSK-MET dataset demonstrated strong performance (AUC = 0.85, accuracy = 78%) in classifying patients with and without liver metastasis.

The most influential variables, including metastasis count, primary site involvement, and mutations in genes such as KIT, RB1, and PIK3R1, are consistent with established clinical and molecular markers, supporting the model’s reliability.

Due to its solid generalization capacity, this colon cancer metastasis prediction model can serve as a valuable tool to assist in risk assessment, complement clinical expertise, and improve treatment planning and follow-up strategies for colorectal cancer patients.

References

Related posts