Illustration 9

Powered by Evolved Analytics' DataModeler

Illustration: Model-based Outlier Detection

Outlier detection for nonlinear systems with lots of input variables is very hard to achieve using conventional methods. However, an outlier is either the most important nugget in the data set or something which should be removed from the modeling process to avoid distorting the results. Deciding which requires human insight.
DataModeler provides tools for outlier detection both before and after the model development. Here we look at model-based outlier detection.

Define noisy data which includes an outlier

Here we build upon the previous noisy data example but append an outlier at the end of the data set with the sample data point being at the center of the parameter space and the outlier being in the middle of the response values. From the perspective of the BivariatePlot, there are no obvious outliers — and we definitely wouldn't flag one like this that was in the middle of the data and response ranges.





Evolve models from the noisy data containing the outlier

What we have is the data. Let us first model the data with a one minute SymbolicRegression via three 20 second IndependentEvolutions. This generates a reasonably high-quality set of models given the noise contained within the inputs and does a good job of isolating the driving variables.




Graphics:Pareto Front Plot Graphics:                           2 variables in models with  R  >80%

Now let us form an ensemble from the models with an 9_outlierDetection_6.gif and a complexity less than 150. We can see at the upper end the EnsemblePredictionPlot of the ridge riding through the superimposed noise.





Identify outliers

The concept behind a DataOutlierAnalysis is to look at the difficulty of modeling a data record response by “good models” which can handle the majority of the data well. If a data record has an unusually large StrangenessMetric, then it is flagged and reported. As we can see below, we have correctly identified the artificially introduced outlier. The outlier distance is the key parameter below; its definition is (take a deep breath) the number of inter-quantile distances a data records strangeness is away from the nearest quantile. The default is to flag a record as an outlier if this value is larger than 1.5 and values greater than 3 should be viewed as a far outlier and definitely unusually difficult to model. (These are the same criteria used by BoxWhiskerPlot.)





Based upon this analysis, below we visually tag the outlier record(s) (rotate the graphic to see the artificially introduced outlier) and also remove the outlier and look at the ensemble performance against the cleaned data.





Having a means to easily validate input multivariate data which are coupled to produce the observed response is HUGE.


Spikey Created with Wolfram Mathematica 8.0