Illustration 5

Powered by Evolved Analytics' DataModeler

Illustration: Working with BIG data sets

DataModeler' s SymbolicRegression algorithms are state - of - the - art and remarkably efficient. However, for really large data sets, there are strategies which can be useful beyond simply allocating more CPU effort.
  The key is to recognize that for large data sets, generally, not all the data has the same information content and to exploit that fact.

Synthesize a large data set

Let us continue with the same underlying system used in the previous illustration. However, this time we will uniformly distribute the sample points over the parameter space. Even though this data set is uniformly distributed, we can see from the BalanceData analysis that they are not judged to all be equal from an information perspective.



As we can see from the sequence below, we can get a very good approximation of the response surface from relatively few data points. Obviously, if we were dealing with more than two variables it would be difficult to look at the response behavior and visually determine the minimum number of points required. Fortunately, the key takeaway is that we have a means to prioritize data records based upon their information value. When we are dealing with big real-world data sets, we should expect to be able to capture the information content in fewer than the nominal number of data records.



Develop models against the large data set

There are essentially four strategies that we can use to attack the big data sets:

More Time & Effort: Simply throwing more CPU cycles at the modeling can be effective since the SymbolicRegression algorithms are quite good at continual innovation and can be distributed over multiple Mathematica kernels and CPUs.

Balanced Data Subset: Given that not all data records are equal, we could choose an information threshold which (we think) captures the response behavior and use that as a surrogate for the entire data set.

OrdinalGP:  This approach involves starting with smaller randomly chosen data subsets. Then, over time, the fraction of the data used in the random subsets is increased until, eventually, the entire data set is used for the final evaluations.

ESSENCE: This approach uses the data ordering and information ranking of BalanceData and rather than using random data subsets starts with the identified key data records and, over time, introduces new information into the model building at a rate proportional to its incremental information contribution.

These four approaches are illustrated below where we have imposed a TimeConstraint of 90 seconds on each SymbolicRegression. The results are stochastic; however, as a general rule, we will see the ESSENCE approach have the best performance within a given allotted time. All of the data subsetting approaches will, typically, beat the approach of simply using the massive data set as a monolithic block.

The disadvantage of the OrdinalGP & ESSENCE approaches is the profiles require an awareness of how many generations will be run rather than the open-ended. (Actually, modeling runs will naturally terminate after a half-day to several days, depending upon the data set size.) Hence, below we also define an option set to support the OrdinalGP & ESSENCE algorithms.