Release news and events

DataModeler release 23.0 (7 March)

Monday, March 7, 2011

Tuning the user experience and support for exploratory modeling is the current theme along with the ever-popular documentation development.

  • Introduced a new function, ResponsePlotExplorer which wraps a Manipulate around the ResponsePlot of a supplied model or ensemble. By default, sliders are shown for the ModelVariables used by the model.
  • Modified BivariatePlot to use a GridTable rather than a GraphicsGrid. This allows labels on the frame of the grid of plots which is useful if long DataVariableLabels are being used or there are more than a handful of variables being plotted. The previous behavior of showing the DataVariableLabels on the diagonal histograms is no longer used; however, each histogram now has a tooltip of the data variable label of the column being plotted.
  • Modified CorrelationMatrixPlot to be able to handle missing elements in the nominally numeric data columns. Previously, a non-numeric would cause any correlations dependent upon that column to appear as blood red. Now a warning message is presented and any non-numeric doublets deleted from the Correlation calculation.
  • Renamed the ModelingVariables option for VariablePresenceMap to be VariablesToPlot. This makes the option name clearer. Now ModelingVariables is strictly used by SymbolicRegression to define the (possible) subset of supplied DataVariables from which models should be developed.
  • Modified CreateModelEnsemble, AlignModel, AlignModelExpression and EvaluateModelQuality to handle input-response data with a mixture of numerics and Indeterminate values. Now an Indeterminate in the response or one of the ModelVariables used in the model being evaluated will cause that data record to be eliminated from the assessment. This seems reasonable in that a model should not be punished because it was provided with incomplete data.
  • Introduced a new function, ClipUnitStep which clips the input from zero to one. Although this could be used as a TemplateTopLevel during SymbolicRegression when the targeted response is naturally a fraction, it might be better as a post-modeling constraint since the direct approach inflicts a substantial efficiency penalty since a scale-invariant ModelingObjective isn't applicable and, as a result the scaling and translation factors must be evolved during the modeling rather than being identified post-facto. However, devoting some extra time can, likely, resolve that concern.
  • Modified EnsemblePredictionPlot and EnsembleResidualPlot to eliminate the error bars associated with missing data points since they cluttered the view and didn't add any insight. A warning message as to the extent of the missing data impact is generated.
  • Modified VariablePresenceTable to only show the "Meaning" column if DataVariableLabels are supplied that are different than the DataVariables.
  • Modified VariablePresenceTable and VariablePresenceMap to use an AllModels SelectionStrategy if presented with a ModelEnsemble. Also modified the option settings to directly associated SelectionStrategy - > ParetoFrontSelect and SelectionSize -> 0.5 to focus on the 50% of models closest to the ParetoFront. This differs from the behavior of the VariablePresence function (which uses AllModels); however, for model set exploration, this seems to be a more natural default.

DataModeler Release 22.0

Friday, January 28, 2011

Release of 28 January 2011 is accompanied by the following minor bug fixes and enhancements:

  • Renamed ModelFitness to ModelQuality throughout DataModeler (e.g., UpdateModelFitness -> UpdateModelQuality, etc.). This is a fairly fundamental name change; however, such makes the meaning of the quality metric apparent to users who are not versed in evolutionary computing terminology.
  • Modified SymbolicRegression to support model development of a column within a dataMatrix. The new form, SymbolicRegression[dataMatrix, colNum, opts] will automatically adjust the ModelingVariables to exclude the targeted column and will also adjust the TargetLabel, TargetColumn, TargetSymbol (all new symbols within DataModeler), PlotLabel, Label and FileName options to denote the targeted column and allow models to be interpreted post-facto. A list of column numbers may also be supplied. This change is useful for exploratory modeling wherein we want to look for possible relationships between variables in a data set. If no column indices are specified, the target will be assumed to be in the last column.
  • Modified SymbolicRegression so that ModelingVariables may be specified as integers which will use the corresponding elements of the (supplied or automatically synthesized) DataVariables. This facilitates interactive modeling and what-if explorations of input-output relationships. A similar capability should be implemented for RequiredVariables, ExcludedVariables, VariablesToPlot, etc. which specify targeted inputs during model selection and exploration.
  • Modified SymbolicRegression so that DataVariables may be supplied as arbitrary strings which will be automatically converted to allowable symbols via CreateDataVariableNames. This makes user-definition of modeling symbols easier since the headers from, for example, ImportDataMatrix may be used directly.
  • Extended SymbolicRegression to allow it to handle non-numeric input and outputs. Now non-numeric columns will be automatically deleted as will data records with non-numeric response values. From the residual set, any missing inputs will be randomly replaced for each generation by legitimate numeric values. At the end of the modeling, the missing values are replaced by the MedianAverage of the numeric values for that variable for the final ModelQuality evaluation. The stochastic nature of the missing value substitution should drive the developed models towards those inputs which are most complete (since their quality assessment will be the most stable across generations) while allowing processing when data records are precious.
  • Modified CreateModelEnsemble to execute a SelectModels on the supplied model set even if no input-output data sets are supplied. This makes it consistent with the other behaviors in that we can, for example, specify a BoxRegion and build an ensemble from all models within it. The default behavior in this mode is for SelectionStrategy -> AllModels which does differ from the other function form behaviors which would pull the default SelectionStrategy from SelectModels.
  • Introduced a new function, MakeDataNumeric, which searches the supplied data vector or matrix (even if they don't pass a MatrixQ test) and replaces those elements which are not numeric according to the specified ReplacementFunction. The only argument supplied to the ReplacementFunction is the data within that column. This turns out to be surprisingly useful to parse and condition data sets.
  • Modified ParetoFrontPlot and ParetoFrontLogPlot so any options embedded within the ModelPersonality of the First of the supplied model set will be used.
  • Renamed to Sigmoid function to SigmoidDM to avoid a naming conflict with the add-on NeuralNetworks package. Also added a SigmoidUnitStep function and incorporated that within the "Bounds" function pattern definitions for use in BuildFunctionPatterns.
  • Modified Crossover so that if it is supplied with an empty list, it returns an empty list. This eliminates a cryptic error message that could occur if no models were judged as fit during a SymbolicRegression.
  • Fixed a bug in EvaluateModelFitness and UpdateModelFitness in that now the ModelingObjective embedded within the supplied models will be given precedence over the default SymbolicRegression options.
  • Modified SymbolicRegression so that the DataVariableRange and RangeExpansion used in the model development are embedded within the returned models. This means that the models can be more easily explored visually and the nominal operational ranges are preserved. Additionally, any options supplied to SymbolicRegression will be embedded in the ModelPersonality of the returned models \[LongDash] with the exception of an InitialPopulation -> modelList since such can result in huge model storage requirements. (Delayed rules, e.g., InitialPopulation :> modelList, will be embedded.)
  • Fixed a bug in UpdateModelPersonality where if multiple rules were specified in the update list which had the same left-hand-side, they would be embedded in the model. Now only the first rule is preserved.
  • Modified the ExcludedVariables option (used by SelectModels , NicheModels and a variety of other functions) to allow a new setting, Automatic (which is the new default behavior). Now if the RequiredVariables is set to anything other than None (i.e., an explicit list of inputs) then ONLY those inputs will be allowed in the returned models.
  • Modified ParetoFront so that if an empty list is provided, an empty list will be returned. Previously, an error message would be generated and Null returned.
  • Added a ToolTipFunction option to ResponsePlot, ResponseSurfacePlot and DivergenceSurfacePlot. The default Automatic setting will display the ModelExpression as a Tooltip on the PlotLabel if a GPModel is to be plotted and a text indicator that an ensemble is being plotted if a ModelEnsemble is supplied. Previously, the ModelExpression would have been plotted for both — which meant an entire ensemble of expression would be displayed which was not easily interpreted.
  • Fixed a sin-of-omission in ImportDataMatrix so now previously processed lists of imported data structures (which could arise from a multi-sheet Excel spreadsheet) can be reprocessed directly. Also discovered an undocumented change in the behavior of Import in Mathematica 8 in that blank Excel sheets are returned (they were previously not returned). Empty sheets are automatically deleted by ImportDataMatrix.
  • Migrated DataModeler over to Mathematica 8. Along the way, encountered a bug in the PlotRange -> All behavior. Alas, there is no workaround other than to explicitly set the ranges so we will await a patch from Wolfram.
  • Modified ResponsePlot, ResponseSurfacePlot, DivergenceSurfacePlot, ModelResidualPlot and EnsembleResidualPlot to allow the VariablesToPlot option to be specified in terms of integer indices into the DataVariables rather than requiring explicit entry of the variable symbols.

DataModeler Release 21.0

Tuesday, September 28, 2010

Release of 28 September 2010 features the following changes to DataModeler:

  • Modified SymbolicRegression to handle missing or non-numeric elements in the supplied input-response data.
  • Introduced a new DataDistributionPlot function which facilitates examination of data sets. This builds upon BoxWhiskerPlot but is a mich more intelligent implementation for real-world assessment of multivariate data.
  • Modified EvaluateModel and EvaluateEnsemble so that if non-numerics were in the evaluation data record, the model would still evaluate if those variables were not used in the model. Previously, any non-numeric entry would result in and Indeterminate result. A side-effect of this is that evaluation can be significantly faster for low-dimensional models which are derived from modeling with large numbers of possible input variables.
  • Modified NoisePower and ScaleInvariantNoisePower to allow the use of fractional norms. Fractional norms can reduce the influence of data outliers.
  • Added support for Max, Min and, Clip as modeling building blocks. These can be easily included by including the string "Bounds" in the BuildFunctionPatterns parameters and supplying that result to the FunctionPatterns option for SymbolicRegression.
  • Removed Sigmoid and RBF from the "PowerMath" definition for BuildFunctionPatterns and moved them into the new "Bounds" predefined set.
  • Modified SymbolicRegression so that supplied options are embedded in the ModelPersonality of returned models. This will be useful when, for example, custom FunctionPatterns are used during the modeling and, as a result, these definitions would be automatically transferred to future model evaluations. This capability could also be used to embed project info into the developed models (e.g., supplying "Project" -> "FormulationDesign" to the SymbolicRegression which would then be available for reference).
  • Modified the ObjectiveOrder option behavior (used by ParetoFrontPlot, ParetoFrontLogPlot, ModelSelectionTable and ModelSelectionReport to allow integer values to specify the objectives to be displayed. This will be especially useful when looking at results from, for example, CascadeMonitor during a SymbolicRegression since the default behavior is to use a SecondaryModelingObjective during model development which is suppressed as an explicit objective prior to returning the final results.
  • Modified RandomModel and RandomGenome to use the ModelingVariables option with that taking precedence over the DataVariables if there is a conflict.

DataModeler Release 20.0

Tuesday, July 20, 2010

The priority of this release of 20 July 2010 is still the documentation completion. More changes and enhancements are also made:

  • Heavily modified ResponseSurfacePlot and ResponsePlot. The new ShowDataVariableReference option will allow the reference point to be graphically denoted which is useful if more than one or two variables are in the model to show the value being used as a reference in the other plots. A ShowEnsembleDivergence option was also introduced which comes into play if a ModelEnsemble is being plotted to show the envelope of the EnsembleDivergenceFunction around the predicted response. Displaying the prediction confidence helps to highlight one unique advantage of ModelEnsemble. Of course a number of additional options were also associated with the functions to facilitate adjusting the graphics appearance. (SecondaryPlotStyle, Filling, and FillingStyle).
  • Fixed a bug in BoundedModelResponseQ (which would ripple into RobustModels) wherein specifying a RangeExpansion of the form {minScaleFactor, maxScaleFactor} did not properly thread over all of the DataVariableRange.
  • Fixed a bug in VariablePresenceMap wherein the EnsemblePersonality was not being used in the graphics generation.
  • Fixed a bug in ModelPredictionPlot wherein the EnsemblePersonality wasn't being used in the generated graphics.
  • Modified ConsolidateRules so that rules within rules (e.g., Filling -> {2 -> {3}}) are not promoted in the returned rule set.
  • Hopefully, worked around a random bug in Mathematica where it would not be able to parse the exact same expression that it had done thousands of times before during SymbolicRegression.
  • Made ImportDataMatrix a little smarter in that if a complete filepath was supplied that was valid the target file would be returned — even if the Directory option setting was such that a relative path should be pursued. In any event, a meaningful message will be displayed if the retrieval fails.
  • Modified CreateLinearModel so that any options supplied will be included in the ModelPersonality of the developed GPModel.
  • Fixed a bug in ReplaceModelPersonality wherein if an empty list was provided, it did not recognize that the ModelPersonality should be reset to an empty list.
  • Modified SummaryStatistics so that it can handle non-numeric data. If the supplied data is not strictly numeric, then it will be automatically removed from the data columns and a warning message displayed.
  • Fixed a sin-of-omission bug in MedianAverage where it was not holding its contents unevaluated — which affected plotting in ResponseSurfacePlot. Now the function explicitly looks for numerics, Indeterminate or Missing values and handles the case of a vector of those being supplied.
  • Added an RBF (aka, Gaussian or Radial Basis Function) function and implemented support for it in SymbolicRegression. It is now natively supported by BuildFunctionPatterns, etc.
  • Changed the Background option setting for ParetoFrontPlot, ParetoFrontLogPlot, ModelPredictionPlot, EnsemblePredictionPlot, ModelResidualPlot, ResponsePlot, ResponseSurfacePlot and DivergenceSurfacePlot from White to None since it appears that the previous Mathematica bug which motivated this setting has been corrected.

DataModeler Release 19.0

Tuesday, April 27, 2010

Documentation completion continues to be the priority theme of this release of 27 April 2010; however, a number of changes and enhancements are created in that process:

  • More tweaking of SymbolicRegression options. The new option defaults will run for 50,000 generations unless interrupted by a TimeConstraint and feature continuous innovation over that span.
  • Modified NicheModels by renaming the Split option to be NicheBy and also introduced a new option, NicheSortBy which defines a criteria by which to sort the models in the returned niches.
  • Tweaked the performance of SelectModels. Unless variable constraints are being imposed, this should result in a substantial speedup.
  • Added a new ProgagationOperator, NichedCrossover, which may be used during SymbolicRegression. This operator partitions and organizes the supplied model set according to the the NicheBy and NicheSortBy options for NicheModels and then applies Crossover to the niched model sets.
  • Modified the $ConventionalGP option set for SymbolicRegression to include ModelComplexity as a ModelingObjective. Because a single-objective SelectionStrategy is used, this does not affect the model search; however, it is convenient for comparison with multi-objective strategies.
  • Renamed the SignificanceLevel option for UncorrelatedModels, UncorrelatedVariables and CorrelationMatrixPlot to be the clearer CorrelationThreshold.
  • Renamed the DataSegments option for UncorrelatedVariables back to DataSubsetSize since it was erroneously changed in a previous option renaming exercise. (UncorrelatedModels was never modified.)
  • Added a CreateDataVariableNames function which will convert supplied strings into forms that may be safely used for the SymbolicRegression DataVariables option. If a list is provided, all of the returned symbol strings are guaranteed to be unique.
  • Added a new option, EnsemblePlotStyle, for EnsemblePredictionPlot and EnsembleResidualPlot to specify how the ensemble predictions should be displayed. Previously, the predictions could get visually lost if there were lots of evaluation points to be displayed.
  • Fixed a bug in UncorrelatedVariables wherein if a matrix of constant columns was supplied, errors would be spawned. The new behavior is to return an empty list.
  • Modified UncorrelatedModels so that if a perfect model (zero error residual) is supplied, it will automatically be included in the returned model set. This is a bit of an ad hoc behavior snce the correlation of a constant is undefined. However, since UncorrelatedModels is used by the default EnsembleStrategy -> Automatic behavior of CreateModelEnsemble, the previous behavior of deleting perfection seemed inappropriate. Since perfection, typically, only appears for toy problems, this should not change the behavior for most real-world modeling.
  • Modified MedianAverage to accommodate Indeterminate points in the supplied vector. If the number of Indeterminate values are not too large (i.e. with a risk of intruding into the calculated result), a numeric value will be returned. Otherwise, Indeterminate will still be returned. This should be useful in ModelEnsemble visualization.
  • Modified BoundedModelResponseQ to accommodate the RangeExpansion option along with the DataVariableRange. This makes it consistent with RobustModels in its behavior.
  • Fixed a bug in VariablePresence wherein the "PresencePercent" option setting for PresenceMetric was not handled properly if a single stand-alone model or ensemble was provided.
  • Fixed a bug in VariablePresenceTable wherein only generic string DataVariableLabels would be recognized. Now formatted (e.g. the result from LabelForm) lists may be supplied.