Crystal ball logo

Health care applications

By Roberto Lopez, Artelnics.

Data mining is being widely used in the field of health. Studying genetic factors, environmental influences and physiological data allows practitioners to prevent, diagnose and treat diseases more effectively in order to improve people's welfare. The main challenge for predictive analytics is to use artificial intelligence to analyze clinical data in order to take up new models of care and new technologies promoting health and wellbeing.

Medical diagnosis picture

1. Medical diagnosis

A big amount of data is currently available to clinical specialists, ranging from details of clinical symptoms to various types of biochemical assays and outputs of imaging devices. To streamline the diagnostic process in daily routine and avoid misdiagnosis, predictive analytics is being widely employed. Deep learning algorithms can handle diverse types of medical data and successfully integrate them into categorized outputs.

An example is the early diagnosis of heart disease in a patient from personal characteristics and physical measurements.

  • Input variables:
    • Personal characteristics (age, sex...)
    • Patient symptoms (pain location, pain type...)
    • Physical measurements (blood pressure, cholesterol...)
    • Environmental factors (cigarretes...)
  • Target variable:
    • Heart disease (true or false)

The next figure shows the neural network for this example.

Neural network graph

Medical prognosis picture

2. Medical prognosis

The second application considered here is that of prognosis, the prediction of the long-term behavior of a disease. Some case studies refer to the forecasting of invasive cancer with no evidence of distant metastases at the time of diagnosis, or the probability of survival for patients who will undergone a certain surgery. The application of predictive analytics for prognosis in clinical medicine is becoming very popular.

  • Input variables:
    • Age of patient at time of operation
    • Patient's year of operation
    • Number of positive axillary nodes detected
  • Target variable:
    • Survival status (survived 5 years or longer or died within 5 years)

The results for this analysis show 83.2% accuracy on 2-year recurrence. The next table shows some values to test the accuracy of the model.

Binary classification test

Microarray picture

3. Microarray analysis

Functional genomics involves the analysis of large datasets of information derived from various biological experiments. One such type of large-scale experiment involves monitoring the expression levels of thousands of genes simultaneously under a particular condition, called gene expression analysis. Microarray technology makes this possible and the quantity of data generated from each experiment is enormous. Data mining can give a feeling for what the data actually represents, derive meaningful results from such experiments.

An example is to identify proteomic patterns in serum that distinguish ovarian cancer from non-cancer. This study is significant to women who have a high risk of ovarian cancer due to family or personal history of cancer. The proteomic spectra were generated by mass spectroscopy and the data set provided here includes 91 controls (Normal) and 162 ovarian cancers. The raw spectral data of each sample contains the relative amplitude of the intensity at each molecular mass/charge (M/Z) identity. There are total 15154 M/Z identities.

  • Input variables:
    • Gene 1 expression
    • ...
    • Gene 15154 expression
  • Target variable:
    • Diagnosis

Microarray picture

The analysis here shows that there is a reduced number of identities which can univocally distinguish ovarian cancer from non-cancer.

Drug design picture

4. Drug design

A technique widely used by pharmaceutical companies in the drug discovery process is called structure-activity relationship (SAR) analysis. A successful solution to this problem has the potential to provide significant economic benefit via increased process efficiency. Predictive analytics techniques such as deep learning have demonstrated to outperform its competitors in predicting the activity of a molecule from the values of its physicochemical descriptors.

An example is to predict the inhibition of dihydrofolate reductase by pyrimidines, using attributes such as size, polarity, and flexibility. More specifically, the input and target variables are:

  • Input variables:
    • Polarity
    • Size
    • Flexibility
    • Hydrogen bond donor
    • Hydrogen bond acceptor
    • Pi donor
    • Pi acceptor
    • Polarizability
    • Sigma effect
  • Target variable:
    • Biological activity

The next figure shows the testing analysis for this example. The correlation coefficient here is 0.87.

regression chart