Forecast air pollution in a city using machine learning

Environment

Ismael Mira
https://www.linkedin.com/in/ismael-mira-hern%C3%A1ndez-5a7052231/
June 29, 2023

Air pollution is one of the significant problems the world faces.

Having a way to monitor these levels, allowing for informed decisions, can be crucial for public administrations.

We want to predict the levels of atmospheric pollutants for the following week in Madrid.

We solve this example with Neural Designer. To follow it step by step, you can use the free trial.

Application type.
Data set.
Neural network.
Training strategy.
Model selection.
Testing analysis.
Model deployment.

1. Application type

This is a forecasting project since the variables to be predicted are the values of five pollutants for the next week.

The goal here is to obtain an accurate prediction for each of them.

2. Data set

The file madrid-forecasting.csv contains 2,959 samples, each with 15 inputs.

Variables

We transform the dataset into a time series with lags and steps ahead. The output variables are 35, one for each pollutant, for 7 days of the week. The following list summarizes the variables’ information:

Regarding time attributes:

day
month
weekday

As far as contamination attributes, all values are presented using the Air Quality Index:

PM2.5: value of particulate matter less than 2.5 microns in diameter.
PM10: value of particulate matter less than 10 microns in diameter.
O3: tropospheric ozone value (at ground level).
NO2: nitrogen dioxide value.
SO2: sulfur dioxide value.

Lastly, regarding meteorological attributes:

precipitations: amount of rainfall in mm.
tavg: average daily temperature.
tmax: maximun daily temperature
tmin: minimum daily temperature
pressure: atmospheric pressure in hPa.
windspeed: average wind speed in km/h.
humidity: relative humidity in percentage.

The target variables represent the value of each pollutant for one day of the week, resulting in a total of thirty-five target variables.

Instances

The instances are split at random into training (60%), selection (20%), and testing (20%) subsets.

Statistics

Once the data set has been set, we are ready to perform a few related analytics. With that, we verify the provided information and ensure that the data is of high quality.

We can calculate the data statistics and create a table that displays the minimums, maximums, means, and standard deviations of all variables in the dataset. The following table depicts the values.

Input-target correlations

Additionally, we can identify the existing correlations between inputs and targets for each variable, thereby gaining insight into the importance of different factors on atmospheric pollutants.

We can appreciate the significant influence that one contaminant has on the others (if the levels of one increase, typically they all increase).

Additionally, the impact of meteorological conditions on air quality is evident. For example, PM2.5 levels decrease with higher wind speed, due to the dispersion of particles that this causes.

3. Neural network

The second step is to set a neural network to represent the forecasting function. For this class of applications, the neural network is composed of:

Scaling layer.
Perceptron layers.
Unscaling layer.

The scaling layer uses the minimum and maximum scaling method.

The number of perceptron layers is 2:

The first perceptron layer has 22 inputs and 10 neurons.
The second perceptron layer has 10 inputs and 35 neurons (the number of target variables).

The perceptron layer uses the hyperbolic tangent activation function.

In this graphical representation, we can see the architecture of the neural network.

4. Training strategy

The procedure used to carry out the learning process is referred to as a training strategy. The training strategy is applied to the neural network to obtain the best possible performance.

The type of training is determined by how the parameters in the neural network are adjusted.

Loss index

We set the Minkowski error with L1 regularization as the loss index.

Optimization algorithm

On the other hand, we use the quasi-Newton method as the optimization algorithm.

Training

The following chart illustrates how the training and selection errors decrease with the epochs of the quasi-Newton method during the training process.

As we can see, both curves’ behavior is similar throughout the iterations, which means that no overfitting has occurred.

The final errors are 0.598 ME for training and 0.657 ME for validation.

That indicates that the neural network has good generalization capabilities.

5. Model selection

The objective of model selection is to find the network architecture with the best generalization properties, that is, one that minimizes the error on the selected instances of the dataset.

Order selection algorithms train several network architectures with different numbers of neurons and select the one with the smallest selection error.

The incremental order method starts with a small number of neurons and increases the complexity at each iteration.

The following chart shows the training error (blue) and the selection error (orange) as a function of the number of neurons.

6. Testing analysis

Once the model is trained, we perform a testing analysis to validate its prediction capacity. We use a subset of previously unused data, specifically the testing instances.

To verify the results obtained in this example, the graphs below compare the predicted and actual values of contamination.

As shown in the graph above for PM2.5, the prediction aligns closely with the actual values in all cases, with only slight discrepancies for NO2. We can say that the results are satisfactory.

7. Model deployment

The neural network is now ready to predict the activity of new people in the so-called model deployment phase.

The file madrid-air-forecasting.py implements the mathematical expression of the neural network in Python. This piece of software can be embedded in any tool to make predictions on new data.

We can integrate this model into a website and, using public data retrieved from APIs, obtain the weekly forecast.

Weekly air pollution forecast for Madrid >

References

The data for this problem has been taken from the Air Quality Historical Data Platform.