Release news and events

New office in Belgium almost ready

Monday, March 24, 2014

We haven't been posting many updates lately - busy, busy, busy designing new cool products. This year an awesome 100% GUI-based interactive data analysis tool and at least one MatLab toolbox on information content estimation will be out. The pipeline for 2015-2016 is full and we are happy.

In February we have opened a second office in Belgium at Antoine Coppenslaan 27/11 in Turnhout!

Only whiteboards, some artwork, a few cables and a couple more chairs are separating us from a (grand) opening party. Our dreams to create an open, inspiring and high-energy environment for effective collaboration, boundary-destroying discussions, and hard-core creativity are coming true.

The location is very convenient: parking is never a problem, the train station is 5 minutes of walking away, Turnhout center with nice pubs and restaurants is only 10 minutes of leisurely walking. The building is a part of the Anco Torens complex - a renovated pasta factory turned into a fancy residential and office quarter. A beautiful harbor is only 50 steps away - which makes us easy to reach by boat, bike, car, train, and busses.

Read More and Check the Photos

DataModeler Release 8.20 (March 2014)

Wednesday, March 12, 2014

We have a very nice update for our users!

This release features performance and option tuning to better exploit the OptimizeLinearModel capability introduced in the previous release as well as a number of enhancements to improve the results display and ease-of-use. The complete release notes are available at the end of this post. We also think we might have worked around the infamous Mathematica tooltip bug wherein tooltip content stop being properly displayed. Since DataModeler makes extensive use of tooltips to layer information content and the only known recovery for this WRI bug is to restart Mathematica, this hopefully resolves a major, albeit randomly occurring, annoyance.

The DataSummaryTable is a nice addition for quickly assessing a data set. Of course, tooltips are used to layer information and it provides a alternate view to that offered by the DataDistributionPlot. to provide a visual on data type, distribution and consistency. The foundational thinking underlying this perspective will enable some impressive capabilities in our upcoming releases.

ModelPredictionComparisonPlot

Another useful new function is the ParetoFrontContextPlot and its sister, the ParetoFrontContextLogPlot. This is useful for exploring a developed model set.To illustrate, suppose we were modeling literacy fraction of the countries around the world and wanted to look at the models which were comprised of exactly three variables and one of those factors had to be femaleLifeExpectancy. (ParetoFrontPlot and ParetoFrontLogPlot now also support SelectModels options.)

ModelPredictionComparisonPlot

The default location for archived SymbolicRegression models is a subfolder, DataModelerModelSets, co-located with the evaluating notebook. As a general rule, we want to run lots of IndependentEvolutions and exploit the evolutionary talent in generating and exploring hypothesized model structures which implies that we want to generate and archive (StoreModelSet set to True) many model sets. Storing them in a subfolder helps to avoid cluttering up the main directory which, typically, contains analysis notebooks as well as the foundation data sets.

We have also shifted the foundations of DataModeler’s GridTable which adds some new capabilities, if appropriate. Many DataModeler functions exploit GridTable so the changes ripple throughout the analysis system. One beneficiary is the ModelSelectionReport which now automatically wraps the ModelExpressions to fit the notebook document width.

ModelPredictionComparisonPlot

As itemized below, there have been quite a few other enhancements since the last release a couple of months ago. We have some really slick stuff in the pipeline which should be released later in the spring.

The official release notes and changes for 8.20:

The big new function in this release is the DataSummaryTable which is quite nice for getting the zen of a data set. The main themes of this release are performance tuning (especially if OptimizeLinearModel is enabled during SymbolicRegression), ease-of-use (e.g., ParetoFrontPlot now accepts SelectModels options) and refining the display of results (for example, the modifications to GridTable ripple into the display of the ModelSelectionReport and the inclusion of a NumberFormatting option for ModelExpression as well as GridTable).

  • Added a DataSummaryTable function explore the columns in a supplied data set. This is a nice complement to DataDistributionPlot to provide a very visual assessment of the columns in the supplied data set.
  • Added ParetoFrontContextPlot and ParetoFrontContextLogPlot which facilitates the comparison of selected models within the context of the ModelQuality of other models.
  • Modified the default ModelingObjective for SymbolicRegression to reward minimizing the number of variables used as well as model simplicity. Enabling OptimizeLinearModel for SymbolicRegression did not impose enough of a penalty on the inclusion of additional variables in models so this has been addressed.
  • Although DataModeler makes extensive use of tooltips to layer information, tooltips in Mathematica are pretty fragile. One manifestation is that Mathematica would spontaneously decide that only a limited region of a tooltip should be displayed — which greatly decreases the functionality of the tooltips. Previously, the only known recovery plan was to restart Mathematica — which lacks a certain elegance. However, we now suspect that introducing a TooltipDelay improves the robustness. Hence, delays of between a twentieth and quarter of a second have been introduced in data and model review functions of DataModeler.
  • Tuned the SymbolicRegression algorithm to improve the performance when OptimizeLinearModels is enabled. It now executes about half the number of generations per unit time as when it is disabled.
  • Modified ParetoTourneySelect so that if a fractional ParetoTournamentSize is provided, it will map into a minimum of two contestants. Previously, a single-competitor tournament would have been possible which would have been equivalent to a RandomSelect strategy.
  • Modified AgeModel, RearrangeModelQuality and UpdateModelPersonality to make them (much) more efficient. The functionality remains the same.
  • Modified ParetoFrontPlot and ParetoFrontLogPlot to avoid plotting overlays of the ParetoFront points. Also reduced the ToolTipLimit to 750 so that if more than this number of models are supplied tooltips will only be shown for those models on the ParetoFront.
  • Modified SymbolicRegression (as well as StoreModelSet, RetrieveModelSets and RetrieveModelSetFilenames) to archive models into the DataModelerModelSets directory within the EvaluationNotebookDirectory folder. If necessary, this directory will be created. For backwards compatibility, the EvaluationNotebookDirectory will also be searched for archived models for retrieval.
  • Modified ModelExpression to support a NumberFormatting option. This allows more compact model representations with more clarity of the developed model forms. This change ripples down to the myriad functions which use ModelExpression.
  • Migrated GridTable to be based upon Grid rather than the lower-level GridBox foundation. This changes some of the applicable options. The most important of these is ItemSize which allows a width for a column to be specified with the cell content automatically adusted. A NumberFormatting option was also added to the mix for convenience of formatting top-level real values and ItemStyle is now the mechanism to control the element formatting.
  • Modified the default option settings for a variety of functions to support the new GridTable foundations.
  • Microsoft Windows allows users to run files directly from within a zipped archive. Unfortunately, Mathematica is not aware of the file structure within these archives and, as a result, the DataModeler installer is unable to install the package. The InstallDataModeler.nb installer has been modified to detect this situation and warn the user that it cannot install the package.
  • Fixed a bug in UnivariatePlot wherein the color was not being plotted properly if a vector rather than a matrix was being plotted.
  • Modified ModelPredictionComparisonPlot to format and automatically append a color key to the PlotLabel if a string is supplied. The color will be automatically matched to that specified for the predicted, observed and outlier data points.
  • Modified ConsolidateRules to only return Rule or DelayedRule elements at the top- level.
  • Modified ParetoFrontPlot and ParetoFrontLogPlot to accept SelectModels options. Although the default for SelectModels is to take the 50% of models closest to the ParetoFront whenever a QualityBox is specified, the default behavior for these functions is to show AllModels within the QualityBox.
  • Modified SymbolicRegression to automatically exclude constant data columns from the modeling since they do not provide any information content.
  • Fixed a bug in CorrelationChart wherein if two integer columns were being compared, the correlation would be expressed as a full-precision fraction rather than as a real- valued numeric.
  • Fixed a bug in LabelForm wherein tooltips were not being handled properly.
  • Modified UpdateModelQuality and EvaluateModelQuality to also accept input-output data as a list rather than separate entries. This matches the input form used by UpdateModelQualityVsMultipleDataSets and EvaluateModelQualityVsMultipleDataSets.
  • For at least 7 years (first reported by Evolved Analytics in 2007), Mathematica's ListPlot and ListLogPlot been unable to plot two points which have tooltips. We have trapped this situation for ParetoFrontPlot and ParetoFrontLogPlot and implemented a workaround.

DataModeler Release 8.16 (1 March 2013)

Friday, March 1, 2013

We have a new (207 MB) DataModeler release available for your retrieval.

The main focus is two new functions, closing an evolutionary loophole and addressing a potential Mathematica pathology. The new functions are DataCompletenessMap and DataCompletenessPlot which let us easily look at the prevalence and distribution of nonnumerics in data which we want to model. Of course, we layer the information presented with intelligent and easily specified tooltips and make it easy to generate quality and insightful graphics.

The evolutionary loophole which has been closed is in the handling of nonnumerics. Previously, we had a NumericColumnThreshold which would restrict model development to those inputs which contained at least a specified fraction of numeric entries (and only the numeric predictions would be considered in evaluating model quality). However, if we allowed more than the default fraction of missing elements, then the model search algorithm could combine multiple inputs and the net result would be that models would be evaluated considering an even smaller fraction of the data records. To address this, we introduced a NumericPredictionRequirement which required models to evaluate to a numeric for a minimum fraction of the data records — otherwise, the model would be rejected even though the inputs were individually satisfying the completeness requirement.

The pathology issue is that Mathematica can consume all available memory/virtual memory/disk space when doing very difficult model searches via very long modeling runs with transcendental functions. Although this behavior is stochastic and is an issue for rare data sets and modeling problems, this is effectively a memory leak which we now monitor and truncate if a MemoryLimit is reached and return/archive the results at that point as we would if a TimeConstraint were encountered.

The complete release notes are below:

The highlights of this release are the new DataCompletenessMap and DataCompletenessPlot. These are useful to get the zen of data sets which feature incomplete data. Related to this, we have introduced a new SymbolicRegression option, NumericPredictionRequirement, since the default of evaluating model quality only on complete records in the data meant that models could, in extreme circumstances, be assessed on far fewer data records than expected even though each of the constituent variables passed the NumericColumnThreshold.

Mathematica also appears to have some memory leak issues which can come into play for very long modeling runs. To address this, we implemented a MemoryLimit on SymbolicRegression to avoid runaway consumption of RAM and disk space.

  • Introduced two new functions related to data completeness, DataCompletenessMap and DataCompletenessPlot, which provide a visual assessment of the presence of non-numeric elements in a data set.
  • Addressed a problem wherein models could be assessed on a smaller than expected fraction of the data set even though the individual variables satisfied the completeness threshold specified by the NumericColumnThreshold. Hence, we introduced a new option NumericPredictionRequirement (default associated with SymbolicRegression) which sets the minimum fraction of data records which must be numerically evaluatable by a model for the non- numerics to be automatically excluded from the assessment of ModelQuality. Along with this change, the NumericColumnThreshold default has been set to Automatic which will use the NumericPredictionRequirement as its threshold.
  • Modified SmallPlot so that it stays unevaluated if the input format is not recognized. This allows it to be used as a pure function for options such as ToolTipFunction.
  • Implemented a default Mesh -> Automatic setting for CorrelationMatrixPlot which will suppress the mesh if a large number of data columns are being plotted. Also trapped the situation where nominally numeric columns didn't have any numeric overlap.
  • Modified SymbolicRegression so that StoreModelSet can be used within the GenerationMonitor, CascadeMonitor, RunMonitor or EvolutionMonitor pure functions.
  • Implemented a MemoryLimit option for SymbolicRegression to guard against the Mathematica kernel memory leaks during the model search. This places an upper bound on the incremental memory required during each of the IndependentEvolutions. Additionally, a MemoryMonitor option was created which can return the profile of memory consumption over the course of each kernel's model search.

DataModeler Release 8.13 (3 Dec 2012)

Monday, December 3, 2012

We are glad to announce that this release should be compatible with both Mathematica 8 and 9!

If you do encounter any issues, please send them our way. Thanks!

The official Release Notes for 8.13

  • The help has been rebuilt using Mathematica 9 so it is searchable using both versions.
  • Wolfram Research changed the ChartElementFunction setting names for BoxWhiskerChart with version 9 and about half of these are not functional yet. Hence, the default for VariablePresenceDistributionChart was changed to "BoxWhisker" which is a setting for both version 8 and 9 and works in both.
  • ParetoFrontPlot was misbehaving in Mma 9 (due to a change in how ListPlot handled the DataRange option) but is now displaying properly.
  • The PlotLegends package has been deprecated. Since CorrelationMatrixPlot was the only function which overtly used PlotLegends and Mathematica 9 has a nice implementation of PlotLegend as an option throughout most Mathematica graphics, we deleted the dependency upon the PlotLegends package.
  • DataOutliers for EnsembleResidualPlot were not being handled properly in some rare cases. This has been fixed.

DataModeler Release 8.12 (27 Nov 2012)

Tuesday, November 27, 2012

We are happy to release DataModeler 8.12. The key highlights are:

  • Archived models are now automatically compressed to reduce file sizes (factors of 25 are a good thing).
  • Implemented support for VariablesToPlot option in a variety of functions — this makes data and performance exploration much cleaner for high-dimensional data sets as we gain insight into the key inputs.
  • Implemented support for display of DataOutliers in a variety of functions. Associated with this support is changing some plotting defaults since the outliers will by default be denoted in red.
  • Greatly improved the behavior of BivariatePlot so that large multi-dimensional data sets can be safely handled without blowing out the memory footprint of the notebook.
  • Implemented a new function ModelPredictionComparisonPlot which is useful to look at prediction performance trajectories relative to the observed behavior. The use of the SortBy, DataVariableReference and DataOutliers options make this a pretty powerful function. Under the hood, it uses SmallPlot so it can efficiently handle large data sets.
  • Modified the MultiCore behavior of SymbolicRegression to allow finer-grain control of the number of cores operating in parallel. (The upcoming Mathematica 9 appears to offer more subkernel licenses so we can peg crunching capabilities of our machines.)
  • Note for Mathematica 9 testers: Mathematica 9, stomps on a couple of DataModeler function (KernelID and AbsoluteCorrelation) so we have renamed those functions in this release (they are still supported in Mma8). Mma9 also changes the documentation system so the current help is not discoverable; however, model development and existing notebooks should work.

Since it is pretty spiffy, let's quickly look at the ModelPredictionComparisonPlot. One basic use is to look at model performance against time series data. Here we look at data from a distillation column with DataOutliers highlighted.

ModelPredictionComparisonPlot

We can also use SortBy and DataVariableReference to reorder the data or define the x-axis in the plot. Note that in this case the frame labels are automatically adjusted to provide the audit trail information.

ModelPredictionComparisonPlot with DataVariableReference

The official release notes and changes for 8.12:

  • Implemented support in ResponsePlotExplorer for DataVariableLabels. These will now be used for the variable sliders. The default behavior will be to use ColorizeList to color code the ModelVariables used for the slider labels so that they match those used in the graphic labels.
  • Added a Compress option to StoreModelSet which determines whether the archived models should be processed using Compress to reduce file sizes. The default is to compress the files. The complementary RetrieveModelSet function will recognize the archival choice and Uncompress the file, if needed.
  • Fixed a bug in SymbolicRegression wherein MetaVariables were not being supported subsequent to the first of the IndependentEvolutions of each kernel or subkernel. This would manifest itself as a pathology if only a single variable was supplied to the modeling.
  • Modified the InversePatternMapping rules associated with SymbolicRegression. Although functionally similar to the previous performance, orders-of-magnitude speed gains were realized relative to the previous approach when ActiveGenomeSimplification was enabled with a SimplificationFunction setting of Expand or ExpandAll.
  • Modified OptimizeModel and OptimizeModelExpression to accept options for SelectModels if a list of models are supplied.
  • Modified GridTable to support options appropriate for Framed. Now the bounding box can be suppressed by setting FrameStyle to None and the appearance tweaked via other options such as Background, RoundingRadius, FrameMargins, etc.
  • Modified UnivariatePlot, BivariatePlot, DataDistributionPlot, CorrelationChart and CorrelationMatrixChart to support a VariablesToPlot option. This makes data exploration easier as the data modeling progresses and high-priority inputs are identified.
  • Implemented support in a variety of functions for the display and annotation of DataOutliers. These include UnivariatePlot, BivariatePlot, ModelPredictionPlot, EnsemblePredictionPlot, ModelResidualPlot and EnsembleResidualPlot. As part of this change, the plot style for many functions has been changed so that, by default, red is reserved for denoting outliers.
  • Extensive modifications to improve the scaling and functionality of BivariatePlot. Provided support for data subsampling within BivariatePlot so that the n^2 expansion in the graphics does not produce an inordinately large memory footprint if large data sets are supplied. Support for displaying DataOutliers and controlling the VariablesToPlot was also incorporated as well as allowing finer control of setting the various graphic styles.
  • Modified UnivariatePlots to support a DataVariableReference option which allows the x-axis for the plots to be specified rather than just looking at the data trajectory. This option is useful if the data records are not uniformly sampled.
  • Modified RangeLength to support a specified start value. Thus RangeLength[ list, 0 ] will produce a zero- relative indexing rather than the default 1-relative behavior.
  • Modified SymbolicRegression to support TargetColumn option settings of one of the DataVariables or DataVariableLabels. Previously, this had to be specified as an index into one of the columns of the supplied data matrix. Last or First are now also valid settings with Last (i.e., the final data column) continuing to be the default.
  • Changed the default EnsembleDivergenceFunction to (3*StandardDeviation[#]&) from the previous settings of the model extremals. Since we target diverse models in assembling a ModelEnsemble and include “sloppy but good”models as a means to detect extrapolation and changes in the fundamentals of the targeted system, we want to flag the divergence of the models. Given the stochastic nature of the model selection, the envelope of predictions is implies too much confidence in the extremal models. Conversely, we want to incorporate them into the assessment so we do not want to use a robust statistic such as MedianDeviation as a foundation. The 95% confidence limit chosen based upon the (nonrobust) StandardDeviation seems like a reasonable compromise given the operational purpose of the EnsembleDivergenceFunction.
  • Modified the default SignificanceLevel for MetaVariableDistributionChart and MetaVariableDistributionTable to be { 10, 0.4 }. This form requires that a MetaVariable be present in at least 10 models of at least one of the IndependentEvolutions and it be in at least 40% of the models of one of the IndependentEvolutions (not necessarily the same one). Setting the minimum threshold for model count avoids trivial results when only one model from an independent evolution might have passed the selection (e.g., QualityBox) critieria.
  • Implemented a new function, ModelPredictionComparisonPlot, which show the prediction overlaid on the observed behavior. The SortBy option can be used to sequence the data records of the supplied data sets and DataVariableReference may be used to specify the x-axis.
  • Modified ConfidenceEllipsoid, ConfidenceEllipsoidSelection and ConfidenceEllipsoidSelectionIndices to allow duplicate and constant data columns to be supplied. The supplied data still has to be strictly numeric.
  • Modified the MultiCore option for SymbolicRegression. Mathematica 9 will introduce support for more subkernels so we will be able to tap into the multiple physical and virtual cores (available via hyperthreading). MultiCore may now be specified as None, Automatic, All or an integer ranging up to the $ProcessorCount for the machine. Each additional subkernel will reduce the CPU effort allocated to the individual IndependentEvolutions; however, the search diversity is generally a benefit. Testing indicates that the All setting will approximately halve the number of modeling generations for a given selected TimeConstraint relative to running a single kernel and by 25% relative to using half of the available kernels (i.e., the Automatic option setting) — hence, it may be desirable to lengthen the TimeConstraint. We also implemented some recovery support when subkernels spontaneously disconnect and get lost — but we still cannot recover the licenses associated with the lost kernels until Mathematica is restarted. The default setting is Automatic to allow for use of other applications; however, for serious (e.g., overnight) model search, a setting of All would probably be appropriate.
  • The soon-to-be-released Mathematica 9 introduces two new functions which stomp on DataModeler functions. AbsoluteCorrelation is about half the speed of the DataModeler implementation so we have renamed the current version to AbsCorrelation. Similarly, KernelID is an undocumented developer function so we have renamed the DataModeler implementation to KernelNumber. AbsoluteCorrelation and KernelID will continue to work underneath Mathematica 8 (and, possibly, under Mathematica 9).
  • Modified SubSample and SmallPlot to handled supplied lists of Tooltips. If the dataset size exceeds the DataSegments limit, the tooltips will be stripped. Otherwise, the tooltips will be restored after processing. Included in this is working around a bug in ListPlot wherein it does not handle the display of doublets (i.e., two DataOutliers in ModelPredictionComparisonPlot).