Model Building Lifecycle

The data modeling process is essentially a path to try to make the conversion of data ⇒ information ⇒ knowledge ⇒ understanding. The onus for wisdom extraction lies with the user and in an industrial setting the final conversion is into units of dollars. Towards that end, there are fairly standard steps we should follow; these are shown in the graphic below.

It needs to be emphasized that the data modeling process is not a George Jetson process of pressing a button, kicking back and blindly accepting results. Rather, the human side needs to be intimately involved in the entire process both from understanding the context of the data, the modeling objectives, the strengths and weaknesses of the techniques used as well as providing a jaundiced eye at the developed models and their validity.

The human side is especially the focus of the most important step in the data modeling processing  — defining success. This is critical to determining how much effort should be expended and over what time frame as well the model accuracy requirements.

An understanding of how the model will be used provides guidance in terms of the technologies which should be used in the modeling process. We should also note that, surprisingly often, the process of methodically defining the problem can allow the desired goals to be achieved without the use of advanced modeling; a structured understanding of the targeted system, the issues and a back-of-the-envelope analysis may be sufficient.

With that caveat, note that data is at the center of the other steps in the modeling process and at some steps along the way we are shaping and redefining the data to be used in the model building. Also note that modeling is implicitly used at each step along the way  — either as a hypothesized model, a reference to identify driving variables or balancing the data, in the form of expectations on variable ranges and distributions or simply in guiding our strategy to condition the data or select the appropriate modeling technique.

Data exploration is used to get the zen of the data and to develop a first-level assessment of its quality, quantity and characteristics. We would want to look at summary statistics of the available variables, their ranges and dispersion, examine scatter matrix plots of the variables and the targeted response(s) to try to understand pairwise behaviors. We would also look for missing fields (variables) within records (data points) and try to understand their frequency and distribution. If possible, visualization techniques may be applied; however, this can be quite difficult in high-dimensional spaces with many input variables.

In conditioning the data, we are trying to assemble a functional data set upon which we can apply our modeling techniques. Towards that end, most modeling techniques (notably excluding symbolic regression) require re-scaling so that all variables and responses are in the same range. Re-scaling can be an issue if variables are coupled and that relationship is significant; however, symbolic regression is also the only technique which can naturally handle correlated variables.

We might also want to artificially introduce limit points to ensure that the developed models have the proper limit behaviors and don't have pathologies simply because the model development data didn't include those extrema. If there are missing records, we must either decide to remove them or impute them based upon some strategy. If the input includes ordinal information (e.g., the artificial leather quality is {poor, fair, good, very good or excellent}) we need to map from this classification information into a numerical representation so that we can build models. The net effect of this step is the data and response which will form the foundation of the developed models and that data must be complete and strictly numerical.

One of the traditional problems with empirical model building is the inclusion of spurious or correlated variables into the modeling process. Hence, variable selection is very important to develop a quality model. Symbolic regression can do this from nonlinear perspective as can iterating model building in stacked analytic nets. Symbolic regression is also attractive because it can identify variable combinations and transforms which can be used as metavariables which can improve the performance of other techniques. A standard linear technique to create independent variables is principle components analysis; this has the problem of interpretability of the synthesized variables as well as being limited to linear combinations of the variables. To some extent, linear model building can detect the presence of correlated variables by looking at the variance inflation factor of the developed models and iteratively pruning inappropriate inputs. Note that the variable selection process is implicitly model dependent since a developed model is used to judge which variables or combinations of variables should be used in the ongoing model development.

Success in data balancing also implicitly depends upon a model since there is a presumption that the proper variables have been selected. Intuitively, the data should be sampled more vigorously in areas of high curvature and less so in areas of linear behavior  — which also implies a model to connect the data points. This is done automatically by the support vector regression identification of support vectors. However, non-model-centric approaches can also be used which attempt to get a uniform coverage in either parameter space or the response behavior — for example, clustering techniques. Related to the data balancing is partitioning of the data into appropriate subsets for training, test and validation.

Model building is, of course, the focus of our algorithms of interest. As previously discussed, each technique has its own strengths and weaknesses. Although our favorite is symbolic regression, other techniques may be preferred for specific applications.

Model validation is very important to develop a sense of trust in the models prior to their usage. Classically, this is achieved by using a data subset which has not been used in the model development as an independent assessment of the model quality. As a further quality control measure, models which perform well against the training and test sets can also be evaluated against a third (validation) set. If there is a paucity of data (a fat array), then other cross-validation strategies can be applied. Alternately, in this case, the symbolic regression modeling can be done using all of the available data and the Pareto front trading expression accuracy and complexity may be used to select models at an appropriate point in the trade-off between model accuracy and complexity  — which implicitly minimizes the risk of selecting an over-trained model. Symbolic regression models have an additional advantage in terms of transparency  — i.e., a user can look at the expression structure and constituent variables (and variable combinations) and agree that the expression is reasonable and intuitively correct. Model behavior at limit conditions should also be examined since, for example, a model which recommends that he optimal planting depth for corn is four inches above the surface of the ground (true story) is not likely to get much in terms of user acceptance. Thus, our definition of a good model typically includes robustness as well as absolute accuracy.

Additionally, we may want to identify and ensemble of diverse models (of similar modeling capacity) whose agreement can be used as a trust metric. Assembling this ensemble is easiest for stacked analytic nets and relatively easy for symbolic regression models. The use of ensembles and the associated trust metric has several very significant advantages:

  • divergence in the prediction of the individual models within the ensemble provides and immediate warning that the model is either operating in uncharted territories or the underlying system has undergone fundamental changes. In either event, the model predictions should not be trusted;
  • all of the data may be used in the model building rather than requiring a partitioning into training, testing, validation, etc. subsets. The resulting models will be less myopic. Additionally, this may be a very significant advantage if data is sparse.

Empirical modeling is simply an academic exercise unless their value is extracted. Model deployment, therefore, is a key aspect of the modeling lifecycle. Associated with this are system building aspects of user interfaces and data entry. Trust metrics are also important to identify, if possible, when the model is being used inappropriately  — e.g., in parameter space outside of the domain used to train the model.

Finally, in many applications (for example, financial and process industries) the underlying system changes over time. If these changes are not captured in terms of input parameters (or, sometimes, even if they are), the model accuracy will decay over time and, therefore, need to be either rebuilt or recalibrated. Tracking the accuracy of the models will help to indicate when model maintenance is required. Different model types generally have different rates of decay with symbolic regression models having the longest real-world legitimacy. Since a trusted incorrect model is potentially financially and physically dangerous, having a trust metric is very important and a key real-world advantage of symbolic regression and stacked analytic networks.