Illustration 4

Powered by Evolved Analytics' DataModeler

Illustration: Redundant Information

Real - world data sets often contain redundant data records — i.e., the nearly the same information repeated many times. For any data modeling technique, redundant data slows the model development and may also degrade model quality. DataModeler offers tools for ranking data records based upon their incremental information content. This insight may be used to accelerate the model development as well as produce higher - quality models.

An unbalanced data set and its incremental information content

Redundant information is common in many real-world data sets — especially systems with feedback such as industrial plants or financial systems. Redundant data is a problem for two reasons: (a) CPU time is spent evaluating models against essentially the same information and (b) overloaded regions of parameter space will be unduly weighted in the model evaluation — in other words, a focus on local rather than global accuracy.

Below we have data with a small region of parameter space greatly oversampled relative to the overall valid parameter ranges. The BalanceData[ ] function uses the SMITS algorithm to rank each data record in terms of its incremental information content. As we can see from the plot on the right, most of the information is contained within a small fraction of the data records — which would agree with our intuition in this case.

4_redundantInformation_1.gif

4_redundantInformation_2.gif

Incremental information content behavior

Using this insight into the information content, we can identify balanced data subsets which capture the overall behavior and which we can use in our subsequent modeling. As shown below, we can easily gain a significant reduction in the data set size without a significant reduction in the information being supplied into the modeling process.

4_redundantInformation_3.gif

4_redundantInformation_4.gif

Comparing using redundant vs. balanced data in model building

Below we devote 90 seconds to SymbolicRegression using the entire data set as input and, also, 90 seconds for another SymbolicRegression using a MUCH smaller set from the BalancedData results. We are much further to achieving a quality model by avoiding spending lots of CPU time on the redundant information.

4_redundantInformation_5.gif

4_redundantInformation_6.gif

4_redundantInformation_7.gif

Spikey Created with Wolfram Mathematica 8.0