Evolved Analytics' DataModeler

The DataModeler package is developed within the context of industrial data analysis with a goal of rapid analysis, interpretation and exploitation of multivariate data sets. Towards this goal, we have developed a system which at its core exploits advanced and powerful nonlinear data modeling techniques.

The technology has been developed to withstand the challenges of real world — in addition to handling problems of too much data, too little data, correlated data, or noisy data, DataModeler respects the cost and timeliness issues associated with modeling development.

DataModeler encompasses a plethora of functions and functionality targeted at simplifying the model development life cycle, including data exploration, model development and model management. Furthermore, since the modeling rarely exists in a vacuum, we paid specific attention to model deployment, usage and maintenance — DataModeler also provides an infrastructure for the entire modeling life cycle with an emphasis on modeling efficiency, accuracy, robustness and ease-of-use.

DataModeler features state-of-the-art algorithms for symbolic regression via genetic programming. Among the many benefits are rapid development of transparent human interpretable models and identification of driving variables and variable combinations. Correlated variables are naturally handled and effective models may be built from large and ill-conditioned data sets. Models may also have trust metrics to allow their use in dynamic data environments.

From the user perspective the distinctive advantages of the technology behind the DataModeler are:

  • Transparency — symbolic regression expressions are human interpretable which helps with buy-in and trust since the user can check that the expression is reasonable and intuitively satisfying. (One DataModeler user, a vice-dean at the University of Houston Law School, used the package to identify changes to liability insurance law that he contends will drive $44B worth of inefficiency out of the market — with a very simple change. There is no way he could sell the concept if the model was embedded in a neural network; this way, he might have a chance.)
  • Variable selection — the processing will automatically identify dominant variables and variable combinations even if there are lots of variables and they are correlated (which a limitation of traditional techniques). Just knowing what parameters matter is important since it provides focus and facilitates action.
  • Robust models — because the model building explores the trade-off between model complexity and accuracy, the data is implicitly telling us which models capture the targeted behavior but are not over-fitting and chasing noise perturbations. As a result, we can select models which we can expect to have good generalization capability and are accurate and robust with respect to minor changes in the underlying system fundamentals.
  • Trust metric — ensembles of diverse models enable us to identify when the models are encountering novel data due to either moving into new regions of parameter space or due to fundamental changes in the underlying system. Knowing when NOT to trust an empirical model avoids risk and is a breakthrough capability in data modeling.
  • Efficient model building — quality model building doesn't require a Ph.D. in statistics. The technology is intensive from a CPU viewpoint; however, the human side of the relationship is greatly simplified. You can develop a very quick assessment of the modeling potential of any data set and, furthermore, even insufficient high-dimensional data sets with more variables than records can easily be analyzed with the key variables identified and compact insightful models developed.
  • Speed — we have gained approximately three orders-of-magnitude speed improvement vs. conventional symbolic regression with our algorithm advances over the past five years. Coupled with CPU performance advances, the range of tractable modeling problems and ease of their attack has greatly increased.