Introduction
Colorectal cancer is a leading cause of cancer death, with liver metastases strongly affecting prognosis.
Predicting metastasis is challenging due to interactions between genetic and clinical factors.
We developed a neural network integrating 492 genes and clinical variables, trained on the MSK-MET cohort (>25,000 patients), achieving 78% accuracy and 0.85 AUC.
This approach shows strong potential as a decision-support tool, and healthcare professionals can explore it using Neural Designer’s trial version for new patient data.
Contents
The following index outlines the steps for performing the analysis.
1. Model type
- Problem type: Binary classification (presence or absence of liver metastasis)
- Goal: Model the probability of liver metastasis based on mutational and phenotypic data to support clinical decision-making using artificial intelligence and machine learning.
2. Data set
Data source
The dataset (3537 instances, 510 variables) for a binary classification problem (target: distant_metastasis_liver, [yes or no]).
Variables
The following list summarizes the variables information:
Patient information
age at first metastasis diagnostic – Age at which the first metastasis was diagnosed (years).
age_at_surgical_procedure – Age of the patient at the time of surgery (years).
sex – Patient sex (e.g., male, female).
race_category – Patient race or ethnicity.
Tumor characteristics
cancer_type_detailed – Specific cancer type: colon, rectal, or colorectal.
cancer_subtype – Subtype of the tumor.
primary_tumor_site – Location of the primary tumor.
tumour_mutational_burden – Number of mutations per megabase in the tumor.
tumor_purity – Fraction of tumor cells in the sample.
fraction_genome_altered – Proportion of the genome with copy number alterations.
Metastasis information
metastasis_count – Total number of metastases observed.
metastasis primary site count – Number of metastases in the primary tumor site.
microsatellite instability score – Score indicating level of microsatellite instability.
microsatellite instability type – Type of microsatellite instability detected.
Genomic features
- gene variables – Presence or absence of mutations in each of 492 genes, capturing the tumor’s genomic profile.
Target variable
distant_metastasis_liver (yes or no) – whether liver metastasis is present or not.
Variables distributions
We can calculate variable distributions; the figure shows a pie chart comparing metastatic versus non-metastatic tumors in the dataset.

As depicted in the image, liver metastatic tumors represent 57% of the samples, while non-metastatic tumors account for 43%.
Input-target correlations
The inputs-targets correlations indicate which factors most influence whether a tumor develops liver metastases and, therefore, are more relevant to our analysis.

Here, the most correlated variables with liver metastases are microsatellite_instability_type, race_category, and metastasis_primary_site_count.
3. Neural network
A neural network is an artificial intelligence model inspired by how the human brain processes information.
It is organized in layers: the input layer receives the variables, and the output layer provides the probability of belonging to a given class.
The network uses historical data to learn patterns distinguishing benign from metastasic tumors.

The network uses 497 input variables to output the probability of liver metastasis for each patient, with connections showing each variable’s contribution to the prediction.
4. Training strategy
Training a neural network uses a loss function to measure errors and an optimization algorithm to adjust the model, ensuring it learns from data while avoiding overfitting for good performance on new cases.

The network was trained to minimize errors while avoiding overfitting (training error 0.3969, validation error 0.8127).
Due to the gap between training and validation errors, input selection will be applied to remove irrelevant variables and improve generalization.
5. Model selection
Due to the high number of input neurons and relatively low evaluation metrics, a neuron selection process was performed.
The selection method trains several network architectures with varying numbers of neurons and identifies the configuration that achieves the lowest selection error.

After performing input selection, the model was reduced to 34 inputs (by removing less relevant features), which lowered the difference between the training and selection errors, and simplified the network architecture.
As shown in the chart, both training error and selection error decrease as the number of inputs is optimized, resulting in a more efficient network with improved performance.

The new model was trained for accuracy and stability, with steadily decreasing training and selection errors (1.036 and 0.997 WSE), demonstrating effective learning and strong generalization to new patients.
6. Testing analysis
The testing analysis aims to validate the performance of the generalization properties of the trained neural network.
ROC curve
The ROC curve is a standard tool for evaluating classification models, showing how well the model distinguishes between two classes by comparing predicted outcomes with actual results, such as the presence or absence of liver metastasis.
A random classifier scores 0.5, while a perfect classifier scores 1.

The AUC obtained is 0.85, showing that the model performs exceptionally well at distinguishing between patients with metastasis and those without it.
Confusion matrix
The confusion matrix shows the model’s performance by comparing predicted and actual outcomes. It includes:
True positives – patients correctly predicted as deceased
False positives – patients incorrectly predicted as deceased
False negatives – patients incorrectly predicted as surviving
True negatives – patients correctly predicted as surviving
For a decision threshold of 0.5, the confusion matrix was:
| Predicted positive | Predicted negative | |
|---|---|---|
| Real positive | 329 | 73 |
| Real negative | 82 | 223 |
Binary classification
Using a classification threshold of 0.3, the performance of this binary classification model is summarized with standard measures.
Accuracy: 78.1% of patient outcomes were correctly predicted.
Error rate: 21.9% of cases were misclassified.
Sensitivity: 81.8% of deceased patients were correctly identified.
Specificity: 73.1% of surviving patients were correctly identified.
These measures indicate that the model is highly effective at predicting patient survival outcomes.
7. Model deployment
Once validated, the neural network can be saved for deployment, allowing clinicians to use patients’ clinical data to predict breast cancer mortality.
Neural Designer automatically exports the trained model, enabling seamless integration as a diagnostic support tool.
Healthcare professionals can explore the model simulator by clicking the button below.
Conclusions
The machine learning model predicts liver metastasis in colon cancer patients with 78% accuracy and 0.85 AUC.
Key factors—metastasis count, primary site involvement, and mutations in KIT, RB1, and PIK3R1—align with known clinical markers.
Its strong generalization makes it a valuable tool to support risk assessment, clinical decision-making, and treatment planning.
References
- The data for this problem has been taken from the cBioportal Repository MSK-MET (Memorial Sloan Kettering – Metastatic Events and Tropisms) dataset.



