DataModeler Release 8.12 (27 Nov 2012)

We are happy to release DataModeler 8.12. The key highlights are:

  • Archived models are now automatically compressed to reduce file sizes (factors of 25 are a good thing).
  • Implemented support for VariablesToPlot option in a variety of functions — this makes data and performance exploration much cleaner for high-dimensional data sets as we gain insight into the key inputs.
  • Implemented support for display of DataOutliers in a variety of functions. Associated with this support is changing some plotting defaults since the outliers will by default be denoted in red.
  • Greatly improved the behavior of BivariatePlot so that large multi-dimensional data sets can be safely handled without blowing out the memory footprint of the notebook.
  • Implemented a new function ModelPredictionComparisonPlot which is useful to look at prediction performance trajectories relative to the observed behavior. The use of the SortBy, DataVariableReference and DataOutliers options make this a pretty powerful function. Under the hood, it uses SmallPlot so it can efficiently handle large data sets.
  • Modified the MultiCore behavior of SymbolicRegression to allow finer-grain control of the number of cores operating in parallel. (The upcoming Mathematica 9 appears to offer more subkernel licenses so we can peg crunching capabilities of our machines.)
  • Note for Mathematica 9 testers: Mathematica 9, stomps on a couple of DataModeler function (KernelID and AbsoluteCorrelation) so we have renamed those functions in this release (they are still supported in Mma8). Mma9 also changes the documentation system so the current help is not discoverable; however, model development and existing notebooks should work.

Since it is pretty spiffy, let's quickly look at the ModelPredictionComparisonPlot. One basic use is to look at model performance against time series data. Here we look at data from a distillation column with DataOutliers highlighted.

ModelPredictionComparisonPlot

We can also use SortBy and DataVariableReference to reorder the data or define the x-axis in the plot. Note that in this case the frame labels are automatically adjusted to provide the audit trail information.

ModelPredictionComparisonPlot with DataVariableReference

The official release notes and changes for 8.12:

  • Implemented support in ResponsePlotExplorer for DataVariableLabels. These will now be used for the variable sliders. The default behavior will be to use ColorizeList to color code the ModelVariables used for the slider labels so that they match those used in the graphic labels.
  • Added a Compress option to StoreModelSet which determines whether the archived models should be processed using Compress to reduce file sizes. The default is to compress the files. The complementary RetrieveModelSet function will recognize the archival choice and Uncompress the file, if needed.
  • Fixed a bug in SymbolicRegression wherein MetaVariables were not being supported subsequent to the first of the IndependentEvolutions of each kernel or subkernel. This would manifest itself as a pathology if only a single variable was supplied to the modeling.
  • Modified the InversePatternMapping rules associated with SymbolicRegression. Although functionally similar to the previous performance, orders-of-magnitude speed gains were realized relative to the previous approach when ActiveGenomeSimplification was enabled with a SimplificationFunction setting of Expand or ExpandAll.
  • Modified OptimizeModel and OptimizeModelExpression to accept options for SelectModels if a list of models are supplied.
  • Modified GridTable to support options appropriate for Framed. Now the bounding box can be suppressed by setting FrameStyle to None and the appearance tweaked via other options such as Background, RoundingRadius, FrameMargins, etc.
  • Modified UnivariatePlot, BivariatePlot, DataDistributionPlot, CorrelationChart and CorrelationMatrixChart to support a VariablesToPlot option. This makes data exploration easier as the data modeling progresses and high-priority inputs are identified.
  • Implemented support in a variety of functions for the display and annotation of DataOutliers. These include UnivariatePlot, BivariatePlot, ModelPredictionPlot, EnsemblePredictionPlot, ModelResidualPlot and EnsembleResidualPlot. As part of this change, the plot style for many functions has been changed so that, by default, red is reserved for denoting outliers.
  • Extensive modifications to improve the scaling and functionality of BivariatePlot. Provided support for data subsampling within BivariatePlot so that the n^2 expansion in the graphics does not produce an inordinately large memory footprint if large data sets are supplied. Support for displaying DataOutliers and controlling the VariablesToPlot was also incorporated as well as allowing finer control of setting the various graphic styles.
  • Modified UnivariatePlots to support a DataVariableReference option which allows the x-axis for the plots to be specified rather than just looking at the data trajectory. This option is useful if the data records are not uniformly sampled.
  • Modified RangeLength to support a specified start value. Thus RangeLength[ list, 0 ] will produce a zero- relative indexing rather than the default 1-relative behavior.
  • Modified SymbolicRegression to support TargetColumn option settings of one of the DataVariables or DataVariableLabels. Previously, this had to be specified as an index into one of the columns of the supplied data matrix. Last or First are now also valid settings with Last (i.e., the final data column) continuing to be the default.
  • Changed the default EnsembleDivergenceFunction to (3*StandardDeviation[#]&) from the previous settings of the model extremals. Since we target diverse models in assembling a ModelEnsemble and include “sloppy but good”models as a means to detect extrapolation and changes in the fundamentals of the targeted system, we want to flag the divergence of the models. Given the stochastic nature of the model selection, the envelope of predictions is implies too much confidence in the extremal models. Conversely, we want to incorporate them into the assessment so we do not want to use a robust statistic such as MedianDeviation as a foundation. The 95% confidence limit chosen based upon the (nonrobust) StandardDeviation seems like a reasonable compromise given the operational purpose of the EnsembleDivergenceFunction.
  • Modified the default SignificanceLevel for MetaVariableDistributionChart and MetaVariableDistributionTable to be { 10, 0.4 }. This form requires that a MetaVariable be present in at least 10 models of at least one of the IndependentEvolutions and it be in at least 40% of the models of one of the IndependentEvolutions (not necessarily the same one). Setting the minimum threshold for model count avoids trivial results when only one model from an independent evolution might have passed the selection (e.g., QualityBox) critieria.
  • Implemented a new function, ModelPredictionComparisonPlot, which show the prediction overlaid on the observed behavior. The SortBy option can be used to sequence the data records of the supplied data sets and DataVariableReference may be used to specify the x-axis.
  • Modified ConfidenceEllipsoid, ConfidenceEllipsoidSelection and ConfidenceEllipsoidSelectionIndices to allow duplicate and constant data columns to be supplied. The supplied data still has to be strictly numeric.
  • Modified the MultiCore option for SymbolicRegression. Mathematica 9 will introduce support for more subkernels so we will be able to tap into the multiple physical and virtual cores (available via hyperthreading). MultiCore may now be specified as None, Automatic, All or an integer ranging up to the $ProcessorCount for the machine. Each additional subkernel will reduce the CPU effort allocated to the individual IndependentEvolutions; however, the search diversity is generally a benefit. Testing indicates that the All setting will approximately halve the number of modeling generations for a given selected TimeConstraint relative to running a single kernel and by 25% relative to using half of the available kernels (i.e., the Automatic option setting) — hence, it may be desirable to lengthen the TimeConstraint. We also implemented some recovery support when subkernels spontaneously disconnect and get lost — but we still cannot recover the licenses associated with the lost kernels until Mathematica is restarted. The default setting is Automatic to allow for use of other applications; however, for serious (e.g., overnight) model search, a setting of All would probably be appropriate.
  • The soon-to-be-released Mathematica 9 introduces two new functions which stomp on DataModeler functions. AbsoluteCorrelation is about half the speed of the DataModeler implementation so we have renamed the current version to AbsCorrelation. Similarly, KernelID is an undocumented developer function so we have renamed the DataModeler implementation to KernelNumber. AbsoluteCorrelation and KernelID will continue to work underneath Mathematica 8 (and, possibly, under Mathematica 9).
  • Modified SubSample and SmallPlot to handled supplied lists of Tooltips. If the dataset size exceeds the DataSegments limit, the tooltips will be stripped. Otherwise, the tooltips will be restored after processing. Included in this is working around a bug in ListPlot wherein it does not handle the display of doublets (i.e., two DataOutliers in ModelPredictionComparisonPlot).