Illustration 10

Powered by Evolved Analytics' DataModeler

Illustration: Active Design-of-Experiments

The conventional approach to experimental design is to :
     (a) make a bunch of simplifying assumptions
    (b) assume a model form based upon ease of data analysis
    (c) run a batch of experiments and
    (d) check whether the data confirmed the a priori assumptions and, if not, start over.
The ability of SymbolicRegression to synthesize, assess and refine models and, furthermore, build a trust metric on WHERE those models are valid means that we can integrate modeling and data collection and shift from a passive collect - data - then - analyze mode into an active DOE framework. The benefits are HUGE. At the end of the data collection we have BOTH an awareness of significant variables AND a quality response model. Furthermore, at each step of the process, we have chosen the next data point to maximize the anticipated information content — thereby achieving a better result with few experiments than if a conventional approach were adopted. For some systems, an active DOE strategy could require orders - of - magnitude fewer experiments be conducted.
The implications for time - to - market, product quality and customer satisfaction should be obvious.

The Objectives & the Strategy

Suppose we have a formulation design problem. Unfortunately, we don't know which of the many parameters are important and, furthermore, there is a fair amount of noise in our experimental test rig. Our goals are:

1) identify which parameters (processing setpoints, ingredients, additives, etc.) are most important,

2) understand the basic trade-offs between parameters

3) develop a new formulation which will satisfy the customer needs and

4) do this as fast and efficiently as possible — otherwise, we are leaving money on the table.

One approach would be to follow a conventional DOE approach, run a full-factorial DOE (10_activeLearning_1.gif experiments where N is the number of parameters), try to figure out which variables are important based upon a linear model assumption and then try to optimize the variables which were tagged as significant. The problem is that when we are done with the first round of (1,024 if N=10) experiments we know very little about the system response behavior AND we have assumed that our simplifying assumptions are not TOO simplifying.

The strategy we are advocating here is to:

a) collect an initial (small) data set,

b) build ensembles of symbolic regression (which, you recall, feature a trustability measure),

c) collect more data to confirm/deny the estimated optimal response AND to drive uncertainty out of the model — effectively, we seek to maximize the information content of each collected data sample.

d) build more model ensembles and repeat b-d until “good enough” has been achieved.

As we shall see, integrating the modeling and data collection is MUCH more efficient and effective than a decoupled approach.

Define the underlying system & experimental rig

For this illustration, we will assume that we have ten parameters which all lie in the range [-7,7]. To simplify the visualization, only the first two actually count and the remaining eight are superfluous. The response behavior from these two variables is shown to the left below; however, out test system used to generate data is noisy so what would be seen by the active DOE process would tend to look more like that shown to the right.



Since we will be starting with relatively few data samples, it will be easy to get spurious correlations in the early going. The data collection needs to be able to drive out the spurious correlations and focus on the true driving variables AND build a good response model.

Collect some initial data and build models

Below we generate random sample points within the 10-D parameter space and collect experimental results at those points as well as the center point. After devoting a minute to a model search, we take the models with a complexity less than 150 and use the best third of these from an accuracy perspective as the candidate set from which we build a model ensemble. The selected models and their variable presence are shown below. We also show the response and divergence surfaces of the ensemble — restricting the plots to only the variables which are in at least 30% of the ensemble models.



Identify where we should collect more data

Given the paucity of input data, we are not likely to have very good models coming out of the initial round of model building; however, the models will likely have good performance metrics since we have lots of variables and few observations which must be fit. Using the ensemble of “good models” we look for the point of maximum model uncertainty (as defined by the EnsembleDivergenceFunction) as well as the predicted maximum response point. (The algorithm implemented below is a little more sophisticated in that if either point had been previously discovered, a local search is done from a variety of starting points and takes the best result which had not previously been discovered. If that doesn't uncover a unique new sampling point, then the identified point is randomly perturbed. Variables not present in the ensemble are randomly chosen within their range.)

In the early stages of active DOE, the new data points will tend to shatter the model performance since the hypothesized model structures will not accommodate the new information. In the later stages, the inclusion of new information will be less destructive since the basic structure is correct and now refinement is needed. When the models are shattered, we know that the new data has added significant information