Illustration 7

Illustration: Dealing with noisy data

Modeling noisy data is difficult since we want to model the fundamental behavior and not noise - induced perturbations. DataModeler and SymbolicRegression lets us easily identify the models providing the best trade - off between model complexity, accuracy and constituent variables.

Define some noisy data

Let is define a fairly simple function with two driving variables and synthesize some data using the truth function and add random noise to the response. The BivariatePlot of the data and response is shown below. Can you tell that the x1 and x2 variables are the real drivers for the response?  What we have is the data. Let us first model the data with a quick SymbolicRegression. For this particular problem, we want to hammer on the ModelDimensionality; however, we will only devote two minutes to the model search. From the ParetoFrontPlot below, we can see that this generates a reasonably high-quality set of models given the noise contained within the inputs. We can also see the existence of a noise floor and a bend (knee) in the ParetoFront with a relatively sharp decrease in the incremental benefit of adding model complexity. In general, we want to choose models from near the knee of the ParetoFront since they provide the best balance between accuracy and complexity. Note that the data tells us the appropriate complexity not an a priori assumption made before the modeling.

As we can see from the VariablePresenceMap, there are clearly two variables which are important. There are other variables which MAY be important; however, they seem to come in later during as we work our way to increasing complexity so they may be chasing the noise perturbations.    Above we formed an ensemble from the UncorrelatedModels which have two variables and which have a complexity less than 150 and an . We can see at the upper end the prediction of the ridge riding through the superimposed noise — but we wouldn't really know that if we didn't have a priori knowledge of the response behavior. That said, if we were dealing with industrial data and saw a prediction vs. actual plot that looked like this, we might be happy — especially if the underlying model featured only two of the eight available variables.

If we look at the ResponseSurfacePlot of this ensemble we see that we have quickly sorted out the multivariate input data, identified the key inputs AND built a reasonably good approximation to the true system dynamics while avoiding overfitting to the injected noise. And all of this in a total of two minutes of compute time. We should be impressed. The TriangularSurfacePlot of the two variables which were identified as drivers shows the input data used for the model development.  