Modeling Technologies

Reality is that in the hands of a guru and given enough time, almost any modeling technology can be massaged into providing a quality model and yielding human insight. We have a different philosophy in that the combination of a variety of modeling technologies has a synergistic effect and will produce the desired high quality models and human insight faster, easier and more robustly than pushing any given technology to its limits. With that caveat, the favorite technologies used by the authors are illustrated in the graphic below.

Symbolic regression has emerged as our dominate modeling technology in recent years due to algorithmic and computer advances that have expanded the scope of problems to which it can be applied. Using a survival-of-the-fittest metaphor, expressions are evolved which summarize the data response.

The current state-of-the-art algorithms implemented are very powerful in that in addition to developing compact (nonlinear) models which are suitable for human inspection, driving variables are automatically identified and selected.

Because of the automatic variable selection, symbolic regression can directly handle fat arrays which would be pathological for other techniques. One limitation of symbolic regression is that the search space is infinite so developing quality models is compute intensive.

The practical limit for the DataModeler implementation is on the order of 50,000 to 100,000 records; although, this can be exceeded if extra effort or patience is applied.

Variable selection can also be accomplished via iterative downselection of inputs using stacked analytic nets (SANs). Similar to neural networks, SANs are attractive because ensembles of models can easily be developed which will diverge when extrapolating  — which provides a trust metric on the developed models to detect when either the underlying system has changed or the models are being supplied with data from novel regions of parameter space and, therefore, are extrapolating. SANs are also attractive because their development is very efficient and, as a result, they are suitable for model building and variable selection on very large data sets.

Neural networks are universal approximators and are useful identify the model accuracy potential for a given data set. The downside is that the model development is a nonlinear optimization problem and it is difficult to control the modeling capacity — which is required to legitimately assemble ensembles of models for a consensus metric. Since other technologies have additional advantages, we tend not to use neural networks for deployed models; however, their use can be helpful in the interim development.

Support vector regression (SVR) falls into a kernel machine learning category. It turns out that there is a theory (statistical learning theory) which says that if we define a level of acceptable error we can place kernels (e.g., polynomials or radial basis functions or a mixture) on selected data points and get a model response model which is very accurate and, furthermore, if appropriate kernels have been chosen then the response model will tend to have relatively good extrapolation capabilities (within limits, obviously). The really nice feature about SVR is that the selected support vectors (data points) will naturally capture the dynamics of the response behavior and, therefore, have more data points located in regions of rapid change and fewer in other regions. Hence, it has a natural and implicit data balancing capability — assuming appropriate kernels have been chosen. Additionally, advanced versions allow the detection and identification of outliers in a nonlinear and data-driven sense. The downside is that SVR suffers from the curse-of-dimensionality so spurious input variables will cause problems. Additionally, the exact method for support vector selection is computationally intensive and, therefore, restricted to relatively small numbers of data records; however, iterative selection techniques can be used to extend the range of application into broader data sets. The final problems with SVR are implementation and human interpretation. The summary is that we generally don't use SVR for deployed models; however, we occasionally use it in the modeling process for outlier detection as well as data balancing.

The methods discussed so far are mainly focused on model discovery with varying levels of implicit requirements on the model structure. If we have an a priori model with unknown parameters (as opposed to structure), we can use a nonlinear optimization approach to determine those model parameters. Since this is a very common problem, there are many techniques which have been developed to address this scenario which vary in their balance of being a local vs. global optimizer and the computing efficiency. Although we will use genetic algorithms as well as differential evolution (the latter is one of the methods underlying Mathematica's NMinimize[ ] function), our favorite is particle swarm optimization (PSO) due to its algorithmic simplicity, efficiency and robustness.

There are some other modeling technologies which we do not use since they do not seem to offer a distinctive advantage relative to the ensemble of current favorite technologies. The most well known is fuzzy systems which attempt to develop descriptive rules similar to human linguistics to describe a system response, e.g., IF[(probeATemperature is HIGH) AND probeBTemperature is LOW] THEN reactorInterface is MEDIUM]. Curiously, these fuzzy rules can produce a crisp, precise and actionable estimate of the response value through de-fuzzification. Unfortunately, as with their crisp set counterparts (e.g., CART), the interpretability of the developed rules becomes an issue as the number of inputs grows. In this sense, symbolic regression has an advantage in interpretability. Another high-quality modeling technology is GMDH (Group Methods for Data Handling) which can be achieve automated variable selection analogous to that of stacked analytic nets (although, at least in principle, not in the same nonlinear selection capacity as symbolic regression). The form of the resulting models depends upon the selection of kernels; however, the typical polynomial form does lend itself to some degree of interpretation.

It should be re-emphasized these modeling technologies are complementary and have a synergistic effect when used in concert.