Release news and events

DataModeler Release 18.0

Tuesday, March 16, 2010

Documentation completion continues to be the priority theme in this release of 16 March 2010. Additional changes and bug fixes are the following:

  • Fixed a bug in Crossover wherein if it was supplied with a list of a single GPModel it would return two models. Now it will return a list of a single model.
  • Modified Crossover so that the ModelAge is based upon parent that donated the root node rather than the maximum of the two parents. This seemed more reasonable give that the root determines the fundamental structure of the resulting model.
  • Added three new functions: FibonacciSpread, FibonacciSequence and InverseFibonacci. These facilitate generating non-uniform indices which are more heaviliy represented for smaller numbers. Such can be useful for generating indices for lag matrices for time series data analysis as well as generating the ModelAgeBracket boundaries.
  • Added a ModelAgeBracket function which classifies models according to the specified ModelAgeBracketBoundaries. The default boundaries uses a FibonacciSpread result (e.g., {0,2,7,30,121,493,2000}) which is useful as a SecondaryModelingObjective for SymbolicRegression to promote continual innovation.
  • Modifed the behavior of the SecondaryModelingObjective so that a symbol can be supplied as an option setting. Now, for example, ModelDimensionality will automatically be converted into a functional form, ModelDimensionality[##]&. Of course, None will continue to suppress the use of a secondary objective. ModelAge, ModelAgeBracket, ModelDimensionality and ModelNonlinearity and were also modified to accept (and ignore) the spurious model and observed response vectors which would be supplied during SymbolicRegression.
  • Modified CreateModelFromExpression so that any supplied options will be automatically embedded in the ModelPersonality of the returned GPModel(s).
  • Modified UpdateModelPersonality and ReplaceModelPersonality so that multiple options may be supplied to a model or a model set rather than forcing the personality aspects to be enclosed in a list.
  • Changed the default SymbolicRegression options so that now a ClassicGP EvolutionStrategy is used with a SecondaryModelingObjective of ModelAgeBracket. The default ModelAgeBracketBoundaries are FibonacciSpread[2000,7].
  • Fixed a subtle bug in ModelExpression wherein it would take three orders-of-magnitude longer than it should have. Of course, the resulting timing impact rippled into all sorts of other functions.
  • Fixed a bug in OptimizeModel & OptimizeModelExpression wherein a small fraction of model forms would fail if OptimizeIntegers -> True.
  • Modified SubSample and SmallPlot to use use DataSegments and DataSegmentFunction options rather than the previous DataSubsetSize and DataSubsetSelectionFunction, respectively since the convention for their use was ad odds with that used by SymbolicRegression (which still uses the old option names). The revised option names more clearly represent the option functionality.

DataModeler Release 17.0

Wednesday, March 3, 2010

Documentation completion continues to be the priority theme in this release of 3 March 2010. Additionally, a number of changes and enhancements are created in the process:

  • Implemented support for templating. Towards this end, a TemplateTopLevel option for SymbolicRegression was implemented which facilitates forcing a desired output form — e.g., a conditional, exponential, etc. — in the generated models. The Crossover, MutateSubtree and DepthPreservingSubtreeMutation PropagationOperators were modified to support the preservation of any embedded templates. However, only the top-level pattern is viewed as sacred.
  • Implemented a ResponsePlot function which is similar to ResponseSurfacePlot except that variables are plotted individually as 2D rather than as all 3D pairwise combinations. This is useful to get a quick overview of the response behavior when models or ensembles feature many input variables. As with ResponseSurfacePlot, the settings for the model variables which are not being plotted can greatly affect both the scale and response behavior. To address this, a CommonPlotRange option was introduced which will place all of the synthesized graphics on the same vertical scale.
  • Deleted the ResponseSurfaceParameters option for ResponseSurfacePlot, DivergenceSurfacePlot (and ResponsePlot) with each now using the DataVariableRange and (newly introduced) DataVariableReference option. Valid settings for the DataVariableReference (which specifies the setting for all DataVariables not being modified in a given graphic) are: a specified point, Automatic (which uses the midpoint of the DataVariableRange), Random (which generates a random point in parameter space), ModelMaximum or ModelMinimum. The latter two settings will search for the appropriate extramal response points and use those.
  • Added a CreateLinearModel function which creates a GPModel using the supplied or synthesized BasisSet. This is useful for creating reference conventional models for comparison to SymbolicRegression results.
  • Modified RandomGenomes and RandomModels to speed up model synthesis as well as increase the diversity of models synthesized. Five new options were implemented (TemplateTopLevel, BalancedTemplates, TemplateFunctionCount, TemplateDepth and SynthesisDepth) with AllowAtomicGenomes deleted. MinimumTreeDepth and MaximumTreeDepth now only apply to ExtractGenomeSubtrees.
  • Introduced BuildFunctionPatterns which uses FunctionPatternSynthesisRules (default associated with SymbolicRegression) to generate appropriate input for the FunctionPatterns option for SymbolicRegression. Several pattern sets ("BasicMath", "ExtendedMath", PowerMath" etc.) have been pre-defined which can easily be mixed and extended to tailor the building blocks to the appliction characteristics. This is actually a really slick implementation since it allows the user to easily tweak the functional building blocks used in the model development.
  • Fixed a sin-of-omission so now RandomModels and RandomGenomes can handle all valid forms for the PopulationSize option. If a list of integers is supplied, the first number will be used as the targeted size.
  • Removed the Unique option for RandomModels since it was obsolete.
  • Modified the default FunctionPatterns so that summation and multiplication in RandomModels will have at least two arguments (and up to a MaximumArity of 5). Previously, it was easier to create models which had introns (non functional genetics) due to only having a single argument with summation and multiplication.
  • Modifed RemoveModelScaling so that any ModelingObjectiveNames in the ModelPersonality are removed along with the ModelFitness being reset.
  • Fixed a bug in introduced in Release 16.0 in RandomModels wherein the supplied variables were not properly weighted for selection during model synthesis. This would have been an issue for modeling systems with large numbers of input variables.
  • Uncovered a bug in SymbolicRegression wherein the ModelingVariables were all treated as having equal weights for RandomModel synthesis independent of any individual or class weighting.
  • Fixed a bug in MutateSubtree and DepthPreservingSubtreeMutation wherein the ModelFitness in the modified models was not being reset to Indeterminate.
  • Fixed a bug in CreateFittedEnsemble wherein SelectModels option defaults associated with CreateFittedEnsemble were not being passed through properly.
  • Fixed a bug in AlignModelExpression wherein option settings embedded in the ModelPersonality were not be used. This sin-of-omission rippled into other function; however, it did not affect the SymbolicRegression (where the model alignment typically occurs).
  • Modified the ParetoGP EvolutionStrategy so that both the archive and the final population are presented to the ResultsSelectionStrategy. This is important if a SecondaryModelingObjective has been used since moving to only considering the ModelingObjective can mean that some of the long tail models (e.g., overly complex low-dimensional models if a ModelDimensionality was used as the secondary objective) would not be of user interest.
  • Changed the default ResultsSelectionStrategy to return the 50% developed models closest to the ParetoFront from the final population (and archive). This shouldreturn the entire archive used by ParetoGP along with some other models.
  • Changed the default DataSubsetSelectionFunction to be RandomSample rather than RandomKSubset since the two are equivalent and RandomSample is about three times faster.
  • Renamed the NumberOfCascades option for SymbolicRegression to be CascadesPerEvolution. This makes its name explicit as well as as consistent with the related GenerationsPerRun, RunsPerCascade and IndependentEvolutions options.
  • Fixed a bug in MergeInputResponseData wherein if an atomic structure was supplied which did not pass an AtomQ test (e.g., \[Pi]/2), the supplied components would not be properly merged.
  • Fixed a bug in AbsoluteCorrelation wherein symbolic input would return Indeterminate even though those symbols (e.g., \[Pi]) would evaluate to being a real value. The revision also results in the implementation being even faster than using the standard Correlation function than it was before.
  • Implemented support for TerminalSet -> None in SymbolicRegression, RandomModels and RandomGenomes. This facilitates modeling when only the variables are to be used modeling.
  • Modified PolynomialBasisSet to allow PolynomialOrder, IncludeCrossTerms and IncludeConstantBasis to be supplied as options. Added the new symbols into the package documentation.
  • Modified ModelVariables (and VariablePresence when PresenceMetric -> Variables) to return the variables in the same order as produced by ModelInputVariables. This ripples into a number of other functions; however, the benefit is that model variables will be presented in the "natural order" defined by the input.
  • Implemented a Sigmoid function of the form x/(1+Abs@x). The definition of the Sigmoid is subject to change (e.g., to x/(1+x^2) or the classic (1-E^-x)/(1+E^-x)); however, this seems like a reasonable choice for a less discontinuous version of the UnitStep function

DataModeler Release 16.0

Tuesday, December 22, 2009

Documentation completion continues to be the priority theme in this release of 22 December 2009; however, a number of changes and enhancements are created in that process:

  • Modified the default interpretation of SymbolicRegression building blocks (DataVariables, FunctionPattern or TerminalSet) if a list without an associated class weight is supplied. Previously, it was assumed that each of the list elements would have a element weight of one for the roulette wheel assembly of RandomModels. However, it is quite convenient and attractive to simply supply a list labels for DataVariables which results in directly interpretable models without the need for DataVariableLabels. Unfortunately, for reasonably multivariate data sets this would result in simplistic models being synthesized since the class of DataVariables would be heavily overweighted. Now if a list of variables is supplied, it is assumed that the supplied components should be normalized so that the set has a class weight of one which is more likely the desired behavior.
  • Modified the default Options settings for SymbolicRegression so that a single thread is used for RunsPerCascade. The previous {3, 1, 1} default would have three runs (nominally, of ten generations) execute in parallel for the first cascade and, then merge these results and continue with a single model search thread. This strategy helps to kick start the model search. The problem is that for very short SymbolicRegression the search is spending time laying a foundation which is not exploited. To compensate, the PopulationSize was changed from a flat 300 to {1000, 500, 300} so the first two generations feature larger population sizes to maximize the influx of high-quality genetics.
  • Extended GridTable to include a TableDirections option. This allows the supplied data to be transposed \[LongDash] even if the data is ragged. (Hence, in this case, GridTable has more functionality than TableForm.)
  • Modified SummaryStatistics to return an appropriately sized (according to the supplied SelectionFunction) vector or matrix if an empty list or matrix is supplied. This is a bug fix since, previously, an arcane error message would be returned.
  • Removed an implicit requirement of SelectModels (which would ripple into NicheModels) that the supplied models have the ModelFitness evaluated.
  • Fixed a bug in RescaleData wherein symbolic numerics (e.g., \[Pi]) would not be recognized as valid rescale ranges.
  • Fixed a bug in ModelExtrema, ModelMinimum and ModelMaximum where duplicate extrema would sometimes not be deleted if the option Unique -> True was set.
  • Since we generally want to use the results from RetrieveModelSets as an group rather than individual file results, a MergeModelSets option was enabled (default associated with StoreModelSet for consistency) with the new default behavior being True. This avoids the need to post-process the retrieved results with explicit application of the MergeModelSets function. Setting this option to False will restore the previous behavior.
  • Added an Input option to DataOutlierTable to allow suppression of the input data record display. This is useful in situations where many input variables are in the source data and simply knowing the index of the offending data record and its degree of strangeness is sufficient.
  • Fixed a bug in ModelInputOutputMatrix wherein if a model only had a single variable and the input was supplied as a vector the supplied form was not recognized as being valid. Now the function checks the alignment of the evalPts dimensionality with the DataVariables embedded in the model(s) to perform an appropriate interpretation.
  • Added a ModelVariables function which is equivalent to ModelSubspace. The new name is more appropriate for the typical user to describe the functionality.
  • Corrected a number of bugs in ModelNonlinearity and modified the implementation so that the returned value is normalized by the range of the sum of all variables to provide a constant reference for model comparison.
  • Modified ModelPredictionPlot and EnsemblePrediction plot so that if a list of models is displayed that any supplied PlotLabel will only apply to the graphics grid rather than the individual models. Also, any AspectRatio settings will now only apply to the individual plots.
  • Modified RemoveModelScaling so that it can handle a ModelEnsemble being supplied. If an ensemble is supplied, it is simply returned unchanged.
  • Implemented a ConvertToFittedModel function which returns a FittedModel data structure. This can be used directly by the built-in Mathematica statistics function introduced in Mathematica 7.
  • Deleted the ModelRegressionReport function since the foundation Statistics`LinearRegression package had been superseded as part of the changes in version 7 and the basic functionality can be achieved by using the result from ConvertToFittedModel.
  • Fixed a bug in ModelResidualPlot wherein options embedded in the ModelPersonality were not being used if only a single model was being plotted.
  • Fixed a bug in ModelSelectionReport and ModelSelectionTable wherein DataVariableLabels with an embedded FontColor would cause errors. Now both Hue and RGBColor formatting can be handled.
  • Extended ModelTreePlot to allow Automatic as well as a pure function to be supplied for PlotLabel. Also implemented support for a ToolTipFunction for the individual tree plots.
  • Fixed a bug in MutateSubtree and DepthPreservingSubtreeMutation wherein ModelInputVariables embedded in the supplied models would not be used in the new genetics creation. This did not affect SymbolicRegression; however, it would affect standalone use of the functions.
  • Modified OptimizeModel and OptimizeModelExpression add another valid form for the OptimizeIntegers options. Now All will bring powers and square-roots into play, True (the default behavior) will handle integers which are converted to reals via N and False will leave integers alone and focus on the embedded reals. Mathematica has problems with FindFit so the optimizatons are checked and, if the model is pathological, a warning message is generated and the original model returned rather than the pathological one produced by Mathematica's optimization.
  • Discovered that OrderedQ has a strange behavior if supplied numerics which are not real, integer or rationals. This would cause problems when, for example, something like Sqrt[2] was supplied to ParetoFront since rather than being sorted between 1.4 and 3/2 it would be after all of the reals, integers and rationals — which causes a bit of a problem if there is an implicit assumption that the numbers are sorted by Sort. Although this problem has been fixed for ParetoFront and related functions and was not a problem for the data typically provided to a SymbolicRegression, it is a potential issue across all Mathematica algorithms — including, of course, those in DataModeler.

DataModeler Release 15.0

Tuesday, October 27, 2009

The main thrust in this release of 27 October 2009 has been fleshing out the documentation. Thanks to WRI squashing some bugs in Wolfram Workbench, the ugly help formatting should be resolved! Additionaly, some bug fixes and other changes have been made:

  • Fixed a bug in ParetoFrontLogPlot where if only one FitnessOrder was specified of the ModelingObjectiveNames, the resulting plot would not be properly ordered. Also modified ParetoFrontPlot as well as ParetoFrontLogPlot to sort models from smallest to largest ModelingObjectives value — under the presumption that smaller values are more interesting.
  • Renamed FitnessOrder to ObjectiveOrder to make its meaning clearer to the typical (non-evolutionary computing oriented) user.
  • Fixed a bug in ProjectFilenames wherein TimeStamp was not handled properly if set to anything other than a string pattern. This would ripple into ImportFromFile. Now any option setting which is legitimate for ExportToFile will work and supplied symbols, except for a setting of None, will map into a generic wildcard, "*".
  • Modified RetrieveModelSets and RetrieveModelSetFilenames to allow a string pattern to be explicitly supplied as an argument. This should make things nicer for interactive model retrieval when StoreModelSet has been used with the ProjectName and FileName changed from the default so the user doesn't have to remember that the name synthesis follows the convention ProjectName_FileName_TimeStamp.ModelSetSuffix and have to set the options precisely right. The previous implementation continues to be valid.
  • StoreModelSet now supports the TimeStamp option settings of All, Year, Month, Day, Hour, Minute, Second as well as subsecond resolution. Automatic continues to map into second resolution, None suppresses the timestamp and an explicit string may also be supplied. This makes its behavior consisten with ExportToFile. RetrieveModelSets and RetrieveModelSetFilenames support the new options with symbols other than None mapping into a generic wildcard.

DataModeler Release 14.0

Tuesday, August 11, 2009

Some bug fixes and other changes accompany this release of 11 August 2009:

  • Worked around a bizzare bug in Mathematica pattern matching which was causing the kernel to crash if a data set larger than ~15,000 records was supplied to AlignModel. Since the default behavior is to have models automatically aligned to the data as the final stage of a SymbolicRegression, this would cause the modeling to fail as a side-effect.
  • Fixed a bug wherein Attributes for plotting functions were being inadvertently and improperly set. The result was that SetOptions would not work properly for some graphics functions.
  • Defined a new constant, $DataModelerExampleDataDirectory which returns the path to the top-level /ExampleData folder within the DataModeler package. This will allow examples to easily be cut-and-pasted out of the help browswer and not require specific relative paths from the EvaluationNotebookDirectory.
  • Changed to default EvolutionStrategy for SymbolicRegression to ParetoGP (from ClassicGP). The default ArchiveSize was also lowered and fewer initial RunsPerCascade are executed. This should better support the situation where users use a very short TimeConstraint rather than using one of the pre-defined option sets like $ParetoGPQuick.