Introduction
Lung cancer is the leading cause of cancer-related deaths worldwide, and lung cancer machine learning models are increasingly being developed to improve early detection and patient outcomes.
Unlike other cancers, lung cancer often remains asymptomatic until advanced stages, when treatment options become less effective. This makes the development of screening and decision-support tools a priority in healthcare.
In this study, we implement a lung cancer machine learning model that learns from patient lifestyle factors, symptoms, and medical conditions to estimate the probability of lung cancer.
Using the lung-cancer dataset (309 instances, 16 variables), the model achieved high predictive performance, with an accuracy of 96.7% and an area under the ROC curve (AUC) of 0.975. These results show the potential of artificial intelligence to support prevention, early detection, and risk management in clinical practice.
Healthcare professionals can test this methodology with Neural Designer’s trial version.
Contents
The following index outlines the steps for performing the analysis.
1. Model type
This is a classification project since the variable to be predicted is binary (cancer or not).
The goal here is to model the probability of having lung cancer as a function of patient lifestyle factors, symptoms, and medical conditions.
2. Data set
Data source
The dataset lung-cancer.csv includes 309 rows (patients) and 16 columns (variables) for study participants and cancer risk factors.
Variables
The following list summarizes the variables’ information:
Patient characteristics
Gender (1 = male, 2 = female) – Lung cancer may affect males and females differently.
Age (years) – Lung cancer mainly occurs in older people.
Lifestyle factors
smoking (0 = no, 1 = yes) – Cigarette smoking is the number one risk factor for lung cancer.
alcohol_consuming (0 = no, 1 = yes) – Regular alcohol consumption may harm lung tissue.
peer_pressure (0 = no, 1 = yes) – Stress or social influence may be associated with lifestyle risk factors.
Symptoms and conditions
yellow_fingers (0 = no, 1 = yes) – Often linked to smoking; may also indicate other health issues.
anxiety (0 = no, 1 = yes) – Anxiety disorders can share symptoms with lung cancer (e.g., breathing difficulties).
chronic_disease (0 = no, 1 = yes) – Includes COPD, asthma, or other long-term lung conditions.
fatigue (0 = no, 1 = yes) – Early symptom often associated with respiratory problems.
allergy (0 = no, 1 = yes) – Some allergies share respiratory symptoms with lung cancer.
wheezing (0 = no, 1 = yes) – Whistling sound in breathing due to blocked or narrowed airways.
coughing (0 = no, 1 = yes) – A Persistent cough is a common symptom of lung cancer.
shortness_of_breath (0 = no, 1 = yes) – Also called dyspnea; difficulty in breathing.
swallowing_difficulty (0 = no, 1 = yes) – Dysphagia, often caused by tumor growth.
chest_pain (0 = no, 1 = yes) – Pain in the chest when breathing deeply, coughing, or laughing.
Target variable
lung_cancer (0 = no, 1 = yes) – Presence or absence of lung cancer.
Instances
The dataset’s instances (one per patient, including input and target variables) are split into training (60%), validation (20%), and testing (20%) subsets by default, adjustable as needed.
Variables distribution
Once we have configured the lung cancer dataset, we can calculate the distribution of the variables. The following figure depicts the number of cancer patients and those who do not have cancer.

As we can see in the previous chart, 87.38% of the patients have lung cancer, corresponding to 270 people.
Inputs-targets correlations
The following figure illustrates the input-target correlations for all inputs with the target. This helps us understand the impact of various inputs on the default.

In the lung cancer model, the most correlated variables are alcohol consumption, allergy, swallowing difficulty, coughing, and wheezing. For example, alcohol consumption harms your health since it increases lung cancer risk; we can see that in the positively correlated variable.
3. Neural network
The model uses a neural network, a type of artificial intelligence that learns to recognize complex patterns in clinical data. Patient values, such as tumor size, imaging features, and other diagnostic indicators, are normalized to ensure comparability.
The network includes scaling layers and probabilistic layers. Adding a perceptron layer increased overfitting, so it was removed to improve generalization.
The network analyzes the variables together, identifying combinations and relationships associated with a higher likelihood of lung cancer.
The result is presented as a probability that a tumor is malignant, providing healthcare professionals with an objective measure to support decision-making. This helps prioritize additional tests or early treatment and complements traditional diagnostic methods.

4. Training strategy
To train the neural network for lung cancer prediction, we defined a loss function and an optimization algorithm. The loss function is the mean squared error with L2 regularization, which allows the model to learn from mistakes while preventing overfitting.
For optimization, we used gradient descent, which updates the network parameters in the direction of the negative gradient of the loss function to achieve optimal performance.
During training, the error decreased steadily, balancing between fitting the known data and generalizing to new cases.
The chart below illustrates how training and selection errors progressively decreased over the course of the training epochs, reflecting the model’s increasing accuracy in predicting lung cancer.

The final results are: training error = 0.0503 MSE and selection error = 0.0784 MSE.
5. Model selection
The objective of model selection is to find the network architecture with the best generalization properties, which minimizes the error on the selected instances of the data set.
Since the selection error we have achieved so far is minimal, we don’t need to apply order selection or input selection here.
6. Testing analysis
The next step is to evaluate the performance of the trained neural network through an exhaustive testing analysis. The standard way to do this is to compare the outputs of the neural network against data that has never been seen before, the testing instances.
A standard method for measuring generalization performance is the ROC curve. This is a visual aid to study the classifier’s discrimination capacity. One of the parameters obtained from this chart is the area under the curve (AUC). The closer the area under the curve is to 1, the better the classifier.

In this case, the lung cancer prediction model achieved a high value: AUC = 0.975.
Neural Designer computes the optimal threshold by finding the point on the ROC curve that is nearest to the upper left corner. Subsequently, the threshold corresponding to that point is referred to as the optimal threshold, and in this study, it has a value of 0.975.
The binary classification tests and the confusion matrix provide valuable insights into the performance of our predictive model. Below, both are displayed for the same decision threshold of 0.5.
Predicted positive | Predicted negative | |
---|---|---|
Real positive | 56 | 2 |
Real negative | 0 | 3 |
The number of correctly classified instances is 59, and the number of misclassified instances is 2. From this table, we can calculate the binary classification tests.
Metrics
- Classification accuracy: 96.7% (Ratio of correctly classified samples).
- Error rate: 3.3% (Ratio of misclassified samples).
- Sensitivity: 100% (Portion of real positives the model predicts as positives).
- Specificity: 60% (Portion of real negatives the model predicts as negatives).
The classification accuracy is high (96.7%), making the prediction suitable for many cases. Moreover, the accuracy of the paper cited in the references section is 94.3%. This implies that we have improved the model’s accuracy in this lung cancer machine learning study using Neural Designer.
7. Model deployment
Once the neural network’s generalization performance has been tested, the model can be saved and used in deployment mode.
In this phase, the lung cancer machine learning model can be applied to new patients by calculating the neural network outputs based on their symptoms, lifestyle factors, and medical conditions. This allows clinicians and researchers to estimate the probability of lung cancer for any individual, providing a quick and reliable risk assessment.
By using the trained model as a diagnostic support tool, healthcare professionals gain an additional layer of confidence in their decision-making, complementing traditional screening and clinical evaluation methods.
The neural network representing the lung cancer prediction model can be exported in multiple programming languages. For example, a Python implementation (lung-cancer.csv) enables straightforward application of the model to new datasets.
Conclusions
The lung cancer machine learning model developed in this study demonstrated high performance (AUC = 0.975, accuracy = 96.7%) in predicting the probability of lung cancer.
The most influential variables, including alcohol consumption, allergy, swallowing difficulty, coughing, and wheezing, are consistent with known clinical risk factors and symptoms, which supports the model’s interpretability and reliability.
Given its strong predictive capacity, this lung cancer machine learning model can serve as a valuable tool to support early detection, guide prevention strategies, and complement the work of healthcare professionals in clinical decision-making.
References
- Oliver, A. S., Jayasankar, T., Sekar, K. R., Devi, T. K., Shalini, R. et al. (2021). Early Detection of Lung Carcinoma Using Machine Learning. Intelligent Automation & Soft Computing, 30(3), 755-770.
- Dataset from: Kaggle: Lung Cancer.
- Dataset from: data.world: Survey Lung Cancer.