Metastasis prediction using machine learning

Introduction

Colorectal cancer is a leading cause of cancer death, with liver metastases strongly affecting prognosis.

Predicting metastasis is challenging due to interactions between genetic and clinical factors.

We developed a neural network integrating 492 genes and clinical variables, trained on the MSK-MET cohort (>25,000 patients), achieving 78% accuracy and 0.85 AUC.

This approach shows strong potential as a decision-support tool, and healthcare professionals can explore it using Neural Designer’s trial version for new patient data.

The following index outlines the steps for performing the analysis.

1. Model type

Problem type: Binary classification (presence or absence of liver metastasis)
Goal: Model the probability of liver metastasis based on mutational and phenotypic data to support clinical decision-making using artificial intelligence and machine learning.

2. Data set

Data source

The dataset (3537 instances, 510 variables) for a binary classification problem (target: distant_metastasis_liver, [yes or no]).

Variables

The following list summarizes the variables information:

Patient information

age at first metastasis diagnostic – Age at which the first metastasis was diagnosed (years).
age_at_surgical_procedure – Age of the patient at the time of surgery (years).
sex – Patient sex (e.g., male, female).
race_category – Patient race or ethnicity.

Tumor characteristics

cancer_type_detailed – Specific cancer type: colon, rectal, or colorectal.
cancer_subtype – Subtype of the tumor.
primary_tumor_site – Location of the primary tumor.
tumour_mutational_burden – Number of mutations per megabase in the tumor.
tumor_purity – Fraction of tumor cells in the sample.
fraction_genome_altered – Proportion of the genome with copy number alterations.

Metastasis information

metastasis_count – Total number of metastases observed.
metastasis primary site count – Number of metastases in the primary tumor site.
microsatellite instability score – Score indicating level of microsatellite instability.
microsatellite instability type – Type of microsatellite instability detected.

Genomic features

gene variables – Presence or absence of mutations in each of 492 genes, capturing the tumor’s genomic profile.

Target variable

distant_metastasis_liver (yes or no) – whether liver metastasis is present or not.

Variables distributions

We can calculate variable distributions; the figure shows a pie chart comparing metastatic versus non-metastatic tumors in the dataset.

As depicted in the image, liver metastatic tumors represent 57% of the samples, while non-metastatic tumors account for 43%.

Input-target correlations

The inputs-targets correlations indicate which factors most influence whether a tumor develops liver metastases and, therefore, are more relevant to our analysis.

Here, the most correlated variables with liver metastases are microsatellite_instability_type, race_category, and metastasis_primary_site_count.

3. Neural network

A neural network is an artificial intelligence model inspired by how the human brain processes information.

It is organized in layers: the input layer receives the variables, and the output layer provides the probability of belonging to a given class.

The network uses historical data to learn patterns distinguishing benign from metastasic tumors.

The network uses 497 input variables to output the probability of liver metastasis for each patient, with connections showing each variable’s contribution to the prediction.

4. Training strategy

Training a neural network uses a loss function to measure errors and an optimization algorithm to adjust the model, ensuring it learns from data while avoiding overfitting for good performance on new cases.

The network was trained to minimize errors while avoiding overfitting (training error 0.3969, validation error 0.8127).

Due to the gap between training and validation errors, input selection will be applied to remove irrelevant variables and improve generalization.

5. Model selection

Due to the high number of input neurons and relatively low evaluation metrics, a neuron selection process was performed.

The selection method trains several network architectures with varying numbers of neurons and identifies the configuration that achieves the lowest selection error.

After performing input selection, the model was reduced to 34 inputs (by removing less relevant features), which lowered the difference between the training and selection errors, and simplified the network architecture.

As shown in the chart, both training error and selection error decrease as the number of inputs is optimized, resulting in a more efficient network with improved performance.

The new model was trained for accuracy and stability, with steadily decreasing training and selection errors (1.036 and 0.997 WSE), demonstrating effective learning and strong generalization to new patients.

6. Testing analysis

The testing analysis aims to validate the performance of the generalization properties of the trained neural network.

ROC curve

The ROC curve is a standard tool for evaluating classification models, showing how well the model distinguishes between two classes by comparing predicted outcomes with actual results, such as the presence or absence of liver metastasis.

A random classifier scores 0.5, while a perfect classifier scores 1.

The AUC obtained is 0.85, showing that the model performs exceptionally well at distinguishing between patients with metastasis and those without it.

Confusion matrix

The confusion matrix shows the model’s performance by comparing predicted and actual outcomes. It includes:

True positives – patients correctly predicted as deceased
False positives – patients incorrectly predicted as deceased
False negatives – patients incorrectly predicted as surviving
True negatives – patients correctly predicted as surviving

For a decision threshold of 0.5, the confusion matrix was:

	Predicted positive	Predicted negative
Real positive	329	73
Real negative	82	223

Binary classification

Using a classification threshold of 0.3, the performance of this binary classification model is summarized with standard measures.

Accuracy: 78.1% of patient outcomes were correctly predicted.
Error rate: 21.9% of cases were misclassified.
Sensitivity: 81.8% of deceased patients were correctly identified.
Specificity: 73.1% of surviving patients were correctly identified.

These measures indicate that the model is highly effective at predicting patient survival outcomes.

7. Model deployment

Once validated, the neural network can be saved for deployment, allowing clinicians to use patients’ clinical data to predict breast cancer mortality.

Neural Designer automatically exports the trained model, enabling seamless integration as a diagnostic support tool.

Healthcare professionals can explore the model simulator by clicking the button below.

Conclusions

The machine learning model predicts liver metastasis in colon cancer patients with 78% accuracy and 0.85 AUC.

Key factors—metastasis count, primary site involvement, and mutations in KIT, RB1, and PIK3R1—align with known clinical markers.

Its strong generalization makes it a valuable tool to support risk assessment, clinical decision-making, and treatment planning.

References

The data for this problem has been taken from the cBioportal Repository MSK-MET (Memorial Sloan Kettering – Metastatic Events and Tropisms) dataset.