Illustration 2

Powered by Evolved Analytics' DataModeler

Illustration: Rapid variable selection & modeling

SymbolicRegression can identify driving variables and return quality models — in just a few seconds in some cases.

Generate the data

Below we generate 10 data points using five input variables. The response behavior that we are modeling is a nonlinear function of two of the inputs. In general, we would receive a data table such as that shown and be tasked with extracting insight from that set of numbers.

2_variableSelection_1.gif

2_variableSelection_2.gif

Staring at list of numbers is a good way to go blind. A BivariatePlot lets us examine all pairwise combinations of data. Although useful, such visualization techniques break down when we are faced with many variables and nonlinear interactions between more than a few inputs.

2_variableSelection_3.gif

2_variableSelection_4.gif

Perform a SymbolicRegression

Now let us devote 15 seconds to searching for models which capture this behavior. The modeling process explores the trade-off between ModelComplexity and model error (2_variableSelection_5.gif is the default metric). This is illustrated in the ParetoFrontLogPlot below which displays each of the returned models' quality metrics — complexity AND accuracy. The models denoted by red dots lie on the ParetoFront and are ALL optimal in the sense that for a given level of accuracy there is no simpler model or, conversely, for a given level of complexity there is no more accurate model. Mousing over the quality points will popup the underlying model; as we can see, a wide variety of model structures and variable combinations have been proposed and explored during even this very short modeling exercise.

With only ten data points and five variables, it is possible to discover spurious relationships; however, the focus on model simplicity does a remarkable job of identifying and isolating the driving variables. This is illustrated by the VariablePresenceMap of the models along the ParetoFront. As we move from least complex to most complex models, we will, typically, see variables being introduces as their information content provides a model accuracy benefit relative to the corresponding increase in model complexity.

2_variableSelection_6.gif

2_variableSelection_7.gif

Graphics:Variable Presence Map

The VariableCombinationTable lets us look at which combinations of variables are popular in the selected model set. As we can see, there has been strong selectivity towards the true driving inputs.

2_variableSelection_9.gif

2_variableSelection_10.gif

The ModelSelectionTable lets us examine models along with their corresponding quality metrics. By default it shows the models which lie on the ParetoFront — in other words, the best models. Here we can confirm that wildly different models structures have been hypothesized and explored.

2_variableSelection_11.gif

2_variableSelection_12.gif

In the above table, note that we have NOT discovered the true underlying function. Of course, in the real world all data is corrupted by noise and perturbations so we typically don't want a model that exactly fits the observed data behavior. Rather, we seek a “good enough” model that captures the response dynamics without being inappropriately complex. In the case of multivariate data, identifying the driving variables can also be key to human insight and understanding.

2_variableSelection_13.gif

Spikey Created with Wolfram Mathematica 8.0