Lung cancer is the leading cause of cancer deaths worldwide. The main goal of this study is to build a model which provides a lung cancer risk assessment and risk management decision support tool to facilitate prevention and screening discussions between people and their doctors.
It will take approximately 2 minutes to complete a questionnaire with different cancer risk factors and pathologies. It will provide anyone with an earlier and cost-efficient warning of having cancer. Then, if the final risk value is high, the citizen could attend a hospital to undergo a more exhaustive clinical analysis.
Lung cancer does not show up in tests until later stages, after which treatments become ineffective or have lower success rates. That is why cancer should be discovered early.
This is a classification project since the variable to be predicted is binary (cancer or not).
The goal here is to model the probability of having lung cancer as a function of different patients' genetic and non-genetic factors.
It is composed of four concepts:
The data file lung-cancer.csv contains the information used to create the model. It consists of 309 rows and 16 columns. The columns represent different cancer risk factors, while the rows represent the study's patients.
This data set uses the following 16 variables:
On the other hand, the instances are divided randomly into training, validation, and testing subsets, containing 60%, 20%, and 20% of the instances, respectively. More specifically, 187 samples are used here for training, 61 for selection, and 61 for testing.
Once the data set is configured, we can calculate the data distribution of the variables. The following figure depicts the number of patients who have cancer and those who do not.
As we can see in the previous chart, 87.38% of the patients have lung cancer, corresponding to 270 people.
The next figure depicts the inputs-target correlations of all the inputs with the target. This helps us see the different inputs' influence on the default.
The more correlated variables are alcohol_consuming, allergy, swallowing_difficulty, coughing, and wheezing. For example, alcohol consumption is bad for your health since it increases lung cancer risk; we can see that in the positively correlated variable.
The second step is to choose a neural network to represent the classification function. For classification problems, it is composed of:
We realize that having a perceptron layer contributes to overfitting the neural network. For this reason, we remove the perceptron layer.
The following figure is a diagram of the neural network used in this example
It contains a scaling layer with 15 neurons (yellow) and a probabilistic layer with 1 neuron (red).
The fourth step is to configure the training strategy. Finally, the training strategy is applied to the neural network to obtain the best possible loss. The type of training is determined by how the adjustment of the parameters in the neural network takes place. is composed of two concepts:
The loss index chosen for this problem is the mean squared error with L2 regularization. It calculates the average squared error between the outputs from the neural network and the target in the data set.
The optimization algorithm is applied to the neural network to get the best performance. Gradient descent is used here for training. This method updates the neural parameters in the direction of the negative gradient of the loss function.
The following chart shows how the training and selection errors decrease with the epochs during the training process.
The final results are: training error = 0.0503 MSE and selection error = 0.0784 MSE.
The next step is to evaluate the performance of the trained neural network by an exhaustive testing analysis. The standard way to do this is to compare the outputs of the neural network against data never seen before, the testing instances.
A common method to measure the generalization performance is the ROC curve. This is a visual aid to study the capacity of discrimination of the classifier. One of the parameters obtained from this chart is the area under the curve (AUC). The closer to 1 area under the curve, the better the classifier.
In this case, the AUC takes a high value: AUC = 0.975.
Neural Designer computes the optimal threshold by finding the point of the ROC curve nearest to the upper left corner. The threshold which corresponds to that point is called the optimal threshold and, in this study, has a value of 0.704.
|Predicted positive||Predicted negative|
|Real positive||56 (91.8%)||2 (3.3%)|
|Real negative||0 (0.0%)||3 (4.9%)|
The classification accuracy takes a high value (96.7%), making the prediction suitable for many cases. The accuracy of the paper cited in the references section has a value of 94.3%. This implies that we have improved the model's accuracy in this study using Neural Designer.
Once the generalization performance of the neural network has been tested, the neural network can be saved for future use in the so-called model deployment mode.
An interesting task in the model deployment tool is to calculate outputs, which produces a set of outputs for each set of inputs applied. That means that any new person who wants to know his lung cancer risk can easily fill the survey with his symptoms and obtain his cancer risk. The outputs depend, in turn, on the values of the parameters.
We can see an example with the 9 cancer risk factors selected in the growing inputs method of the model selection.