News archive for January 2011

DataModeler Release 22.0

Friday, January 28, 2011

Release of 28 January 2011 is accompanied by the following minor bug fixes and enhancements:

  • Renamed ModelFitness to ModelQuality throughout DataModeler (e.g., UpdateModelFitness -> UpdateModelQuality, etc.). This is a fairly fundamental name change; however, such makes the meaning of the quality metric apparent to users who are not versed in evolutionary computing terminology.
  • Modified SymbolicRegression to support model development of a column within a dataMatrix. The new form, SymbolicRegression[dataMatrix, colNum, opts] will automatically adjust the ModelingVariables to exclude the targeted column and will also adjust the TargetLabel, TargetColumn, TargetSymbol (all new symbols within DataModeler), PlotLabel, Label and FileName options to denote the targeted column and allow models to be interpreted post-facto. A list of column numbers may also be supplied. This change is useful for exploratory modeling wherein we want to look for possible relationships between variables in a data set. If no column indices are specified, the target will be assumed to be in the last column.
  • Modified SymbolicRegression so that ModelingVariables may be specified as integers which will use the corresponding elements of the (supplied or automatically synthesized) DataVariables. This facilitates interactive modeling and what-if explorations of input-output relationships. A similar capability should be implemented for RequiredVariables, ExcludedVariables, VariablesToPlot, etc. which specify targeted inputs during model selection and exploration.
  • Modified SymbolicRegression so that DataVariables may be supplied as arbitrary strings which will be automatically converted to allowable symbols via CreateDataVariableNames. This makes user-definition of modeling symbols easier since the headers from, for example, ImportDataMatrix may be used directly.
  • Extended SymbolicRegression to allow it to handle non-numeric input and outputs. Now non-numeric columns will be automatically deleted as will data records with non-numeric response values. From the residual set, any missing inputs will be randomly replaced for each generation by legitimate numeric values. At the end of the modeling, the missing values are replaced by the MedianAverage of the numeric values for that variable for the final ModelQuality evaluation. The stochastic nature of the missing value substitution should drive the developed models towards those inputs which are most complete (since their quality assessment will be the most stable across generations) while allowing processing when data records are precious.
  • Modified CreateModelEnsemble to execute a SelectModels on the supplied model set even if no input-output data sets are supplied. This makes it consistent with the other behaviors in that we can, for example, specify a BoxRegion and build an ensemble from all models within it. The default behavior in this mode is for SelectionStrategy -> AllModels which does differ from the other function form behaviors which would pull the default SelectionStrategy from SelectModels.
  • Introduced a new function, MakeDataNumeric, which searches the supplied data vector or matrix (even if they don't pass a MatrixQ test) and replaces those elements which are not numeric according to the specified ReplacementFunction. The only argument supplied to the ReplacementFunction is the data within that column. This turns out to be surprisingly useful to parse and condition data sets.
  • Modified ParetoFrontPlot and ParetoFrontLogPlot so any options embedded within the ModelPersonality of the First of the supplied model set will be used.
  • Renamed to Sigmoid function to SigmoidDM to avoid a naming conflict with the add-on NeuralNetworks package. Also added a SigmoidUnitStep function and incorporated that within the "Bounds" function pattern definitions for use in BuildFunctionPatterns.
  • Modified Crossover so that if it is supplied with an empty list, it returns an empty list. This eliminates a cryptic error message that could occur if no models were judged as fit during a SymbolicRegression.
  • Fixed a bug in EvaluateModelFitness and UpdateModelFitness in that now the ModelingObjective embedded within the supplied models will be given precedence over the default SymbolicRegression options.
  • Modified SymbolicRegression so that the DataVariableRange and RangeExpansion used in the model development are embedded within the returned models. This means that the models can be more easily explored visually and the nominal operational ranges are preserved. Additionally, any options supplied to SymbolicRegression will be embedded in the ModelPersonality of the returned models \[LongDash] with the exception of an InitialPopulation -> modelList since such can result in huge model storage requirements. (Delayed rules, e.g., InitialPopulation :> modelList, will be embedded.)
  • Fixed a bug in UpdateModelPersonality where if multiple rules were specified in the update list which had the same left-hand-side, they would be embedded in the model. Now only the first rule is preserved.
  • Modified the ExcludedVariables option (used by SelectModels , NicheModels and a variety of other functions) to allow a new setting, Automatic (which is the new default behavior). Now if the RequiredVariables is set to anything other than None (i.e., an explicit list of inputs) then ONLY those inputs will be allowed in the returned models.
  • Modified ParetoFront so that if an empty list is provided, an empty list will be returned. Previously, an error message would be generated and Null returned.
  • Added a ToolTipFunction option to ResponsePlot, ResponseSurfacePlot and DivergenceSurfacePlot. The default Automatic setting will display the ModelExpression as a Tooltip on the PlotLabel if a GPModel is to be plotted and a text indicator that an ensemble is being plotted if a ModelEnsemble is supplied. Previously, the ModelExpression would have been plotted for both — which meant an entire ensemble of expression would be displayed which was not easily interpreted.
  • Fixed a sin-of-omission in ImportDataMatrix so now previously processed lists of imported data structures (which could arise from a multi-sheet Excel spreadsheet) can be reprocessed directly. Also discovered an undocumented change in the behavior of Import in Mathematica 8 in that blank Excel sheets are returned (they were previously not returned). Empty sheets are automatically deleted by ImportDataMatrix.
  • Migrated DataModeler over to Mathematica 8. Along the way, encountered a bug in the PlotRange -> All behavior. Alas, there is no workaround other than to explicitly set the ranges so we will await a patch from Wolfram.
  • Modified ResponsePlot, ResponseSurfacePlot, DivergenceSurfacePlot, ModelResidualPlot and EnsembleResidualPlot to allow the VariablesToPlot option to be specified in terms of integer indices into the DataVariables rather than requiring explicit entry of the variable symbols.