Higgs Boson machine learning challenge
By Pablo Martín, Artelnics.
The data used for this challenge are simulated data provided by the ATLAS experiment at CERN. Physicists use them to optimize the analysis
of the Higgs Boson. The database has been downloaded from Kaggle (Higgs Boson challenge).
The Higgs Boson was theorized almost 50 years ago by Peter Higgs as the elemental particle that gives other particles their mass. It has been discovered by the ATLAS
experiment and the CMS experiment, which are running at the Large Hadron Collider (LHC) at CERN.
In these experiments, proton bunches are accelerated on a circular trajectory in both directions. When the bunches cross the ATLAS detector, some of the protons collide and produce hundreds of particles (an event), which are detected by sensors. From this information, it is estimated the type, energy and 3D direction of every particle.
Some of these events correspond to uninteresting events (called background), which are exotic in everyday terms but have been discovered by previous generations of experiments. The goal of this analysis is to find a region in which there is a significant excess of events (called signal) compared to what known background processes can explain.
The database consists of 250000 simulated events using the official ATLAS full detector simulator. Firstly, proton-proton collisions are simulated based
on all the knowledge that has been accumulated on particle physics. Secondly, the resulting particles are tracked through a virtual model of the detector.
Each event is described by 30 different features such as the ID of the event, the estimated mass of the Higgs Boson candidate or the missing transverse energy. Furthermore, every instance contains a variable called weight. The sum of the weights of events falling in the same region is an unbiased estimate of the expected number of events falling in the same region during a fixed time interval. They will not be used as an input for the analysis.
The next figure shows some basic statistics of the variables: minimum, maximum, mean and standard deviation.
Neural Designer provides us the tools to find what deep architecture will best fit our problem.
In this case, we will use one hidden layer with seven neurons in it.
In addition, we will use the cross-entropy error as performance measure, which is especially useful for classification problems.
The quasi-Newton algorithm will be used as the main training method.
The next figure depicts how the performance decreases with the iterations during the training process:
The last step is to test the performance of the model for which we will use some well-known testing techniques for classification problems.
The binary classification parameters shown in the next picture provide us with some useful information about the performance of the model:
As we can see, the classification accuracy, which is the proportion of instances that the model can correctly classify, is 0.989 (98.9%). The error rate, which is the ratio of instances misclassified, is 0.011 (1.1%).
Another useful test that illustrates the performance of this model is the ROC analysis.
In this case, the area under the ROC curve is 0.998, practically a perfect classifier which would have an area under curve of 1.
Once the model has been tested, Neural Designer allows us to obtain the mathematical expression of the trained deep architecture with which more than four million events per second can be analyzed.
At the end, we can see the logistic and probability functions: Logistic(x) and Probability(x).
As we already mentioned, this application has been solved with the professional predictive analytics solution Neural Designer. To find out more about Neural Designer click here.