In this example, we build a machine learning model to make screening patients at risk for lung cancer.
Lung cancer is the leading cause of cancer deaths worldwide. The main goal of this study is to build a model that provides a lung cancer risk assessment and risk management decision support tool to facilitate prevention and screening discussions between people and their doctors. It will take approximately 2 minutes to complete a questionnaire with different cancer risk factors and pathologies. It will provide anyone with an earlier and cost-efficient warning of having cancer. Subsequently, if the final risk value is high, the citizen could attend a hospital to undergo a more exhaustive clinical analysis. Lung cancer does not show up in tests until later stages, after which treatments become ineffective or have lower success rates. That is why cancer should be discovered early.
Contents
1. Application type
This is a classification project since the variable to be predicted is binary (cancer or not).
The goal here is to model the probability of having lung cancer as a function of different patients’ genetic and non-genetic factors.
2. Data set
We obtained the data set from the Kaggle and Data World websites; their links are in the references section below.
It is composed of four concepts:
- Data source.
- Variables.
- Instances.
- Missing values.
Data source
The data file lung-cancer.csv contains the information used to create the model. It consists of 309 rows and 16 columns. The columns represent different cancer risk factors, while the rows represent the study’s patients.
Variables
This data set uses the following 16 variables:
- gender: It could be a risk factor. Lung cancer could affect more males than females or vice versa. (1= male; 2= female)
- age: Age is a risk factor. Lung cancer mainly occurs in older people. (year)
- smoking: Whether the patient smokes or not. Cigarette smoking is the number one risk factor for lung cancer. (1 = yes; 0 = no)
- yellow_fingers: If lung cancer has spread to the pancreas or liver, it may turn your skin or the whites of your eyes yellow. Although having yellow fingers is mainly linked to smoking. (1 = yes; 0 = no)
- anxiety: An anxiety disorder is a type of mental health condition. It could affect the way a person breathes, not getting enough air. That is why it could share symptoms with lung cancer, mainly breathing problems. (1 = yes; 0 = no)
- peer_pressure: It is related to anxiety, stress, and its symptoms. (1 = yes; 0 = no)
- chronic_disease: Chronic obstructive pulmonary disease (COPD) causes narrowing of the airways in the lung, making it difficult to breathe. It includes several long-term lung conditions, such as emphysema, chronic bronchitis, and chronic asthma. The main cause of COPD is smoking. (1 = yes; 0 = no)
- fatigue: In some people, fatigue is an early symptom of the presence of lung cancer. It can be derived from having respiratory problems. (1 = yes; 0 = no)
- allergy: As the anxiety symptoms, some allergies could share symptoms with lung cancer. (1 = yes; 0 = no)
- wheezing: When your airways become constricted, blocked, or inflamed, your lungs may produce a wheezing or whistling sound when you breathe. This can have multiple causes, some benign and easily treatable. However, wheezing is also a symptom of lung cancer (1 = yes; 0 = no)
- alcohol_consuming: Whether a person drinks alcohol regularly or not. Alcohol abuse can cause inflammation and harm cells in the upper and lower parts of the airway. It can start to harm the lungs in as little as six weeks. (1 = yes; 0 = no)
- coughing: Some forms of lung cancer more often have a cough as a symptom because the cancerous cells obstruct the airways in your lungs. (1 = yes; 0 = no)
- shortness_of_breath: Also known medically as dyspnea, it is a distressing symptom of lung cancer that causes difficulty catching your breath. (1 = yes; 0 = no)
- swallowing_difficulty: Many patients with lung cancer experience difficulty in swallowing. The medical term for this symptom is dysphagia, which refers to any swallowing dysfunction. Dysphagia can occur for several reasons, including the direct impact of tumors. (1 = yes; 0 = no)
- chest_pain: When a lung tumor causes tightness in the chest or presses on nerves, you may feel pain, especially when breathing deeply, coughing, or laughing. (1 = yes ; 0 = no)
- lung_cancer: Having lung cancer or not. (1 = yes ; 0 = no).
Instances
On the other hand, the instances are randomly divided into training, validation, and testing subsets, which contain 60%, 20%, and 20% of the instances, respectively. More specifically, 187 samples are used here for training, 61 for selection, and 61 for testing.
Variables distribution
Once we have configured the data set, we can calculate the data distribution of the variables. The following figure depicts the number of cancer patients and those who do not.
As we can see in the previous chart, 87.38% of the patients have lung cancer, corresponding to 270 people.
Inputs-targets correlations
The following figure depicts the inputs-target correlations of all the inputs with the target. This helps us see the different inputs’ influence on the default.
The more correlated variables are alcohol_consuming, allergy, swallowing_difficulty, coughing, and wheezing. For example, alcohol consumption harms your health since it increases lung cancer risk; we can see that in the positively correlated variable.
3. Neural network
The second step is to choose a neural network to represent the classification function. For classification problems, it is composed of:
We realize that a perceptron layer contributes to overfitting the neural network, so we remove it.
The following figure shows the neural network used in this example.
It contains a scaling layer with 15 neurons (yellow) and a probabilistic layer with 1 neuron (red).
4. Training strategy
The fourth step is to configure the training strategy. Finally, the training strategy is applied to the neural network to obtain the best possible loss. The type of training is determined by how the adjustment of the parameters in the neural network takes place. is composed of two concepts:
- A loss index.
- An optimization algorithm.
The loss index chosen for this problem is the mean squared error with L2 regularization. It calculates the average squared error between the outputs from the neural network and the target in the data set.
The optimization algorithm is applied to the neural network for the best performance. In this case, gradient descent is utilized for training. This method updates the neural parameters in the direction of the negative gradient of the loss function.
The following chart shows how training and selection errors decrease with the epochs during training.
The final results are: training error = 0.0503 MSE and selection error = 0.0784 MSE.
6. Testing analysis
The next step is to evaluate the performance of the trained neural network by an exhaustive testing analysis. The standard way to do this is to compare the outputs of the neural network against data never seen before, the testing instances.
A standard method to measure the generalization performance is the ROC curve. This is a visual aid to study the classifier’s discrimination capacity. One of the parameters obtained from this chart is the area under the curve (AUC). The closer to 1 area under the curve, the better the classifier.
In this case, the AUC takes a high value: AUC = 0.975.
Neural Designer computes the optimal threshold by finding the point of the ROC curve nearest to the upper left corner. Subsequently, the threshold corresponding to that point is called the optimal threshold and, in this study, has a value of 0.704.
The binary classification tests and the confusion matrix provide helpful information about our predictive model’s performance. Below, both are displayed for the same decision threshold of 0.5.
Predicted positive | Predicted negative | |
---|---|---|
Real positive | 56 (91.8%) | 2 (3.3%) |
Real negative | 0 (0.0%) | 3 (4.9%) |
- Classification accuracy: 96.7% (Ratio of correctly classified samples).
- Error rate: 3.3% (Ratio of misclassified samples).
- Sensitivity: 100% (Portion of real positives the model predicts as positives).
- Specificity: 60% (Portion of real negatives the model predicts as negatives).
The classification accuracy is high (96.7%), making the prediction suitable for many cases. Moreover, the accuracy of the paper cited in the references section is 94.3%. This implies that we have improved the model’s accuracy in this study using Neural Designer.
7. Model deployment
Once the neural network’s generalization performance has been tested, it can be saved for future use in the so-called model deployment mode.
An interesting task in the model deployment tool is to calculate outputs, which produce a set of outputs for each set of inputs applied. Consequently, any new person who wants to know his lung cancer risk can quickly fill the survey with his symptoms and obtain his cancer risk. Furthermore, the outputs depend, in turn, on the values of the parameters.
We can see an example with the 9 cancer risk factors selected in the growing inputs method of the model selection.
- gender: female (2)
- age: 43
- smoking: yes (1)
- yellow_fingers: yes (1)
- anxiety: yes (1)
- peer_pressure: no (0)
- chronic_disease: no (0)
- fatigue: no (0)
- allergy: yes (1)
- wheezing: no (0)
- alcohol_consuming: yes (1)
- coughing: yes (1)
- shortness_of_breath: no (0)
- swallowing_difficulty: no (0)
- chest_pain: no (0)
- lung_cancer: 0.897
References
- Oliver, A. S., Jayasankar, T., Sekar, K. R., Devi, T. K., Shalini, R. et al. (2021). Early Detection of Lung Carcinoma Using Machine Learning. Intelligent Automation & Soft Computing, 30(3), 755-770.
- Dataset from: Kaggle: Lung Cancer.
- Dataset from: data.world: Survey Lung Cancer.