Illustration 3

Powered by Evolved Analytics' DataModeler

Illustration: Variable selection & modeling of under-determined (fat) data arrays

Due to its ability to identify and focus on driving variables, SymbolicRegression can build models data sets that have more variables than records.

Generate a FAT data array — 30 records of 100 variables

Here we take the toy problem used above but now we have 100 variables and 30 records.

Perform a SymbolicRegression

Extracting the driving variables and building models from a fat array like this is more difficult. Hence, we will devote a minute to the SymbolicRegression model search. As seen below, even this moderate amount of allotted time is sufficient to isolate the driving variables and build reasonably good models. Below we have used ModelDimensionality as a SecondaryModelingObjective; as a result, the search process will simultaneously search for the best model with one variable, two variables, etc. with higher dimension models also forced to compete against all models with fewer numbers of variables. As a result, the search is automatically focused on low-dimensional models and the likelihood of returning a trivial model — e.g., the linear sum of 30 variables in this case — is mitigated.

Using the insight from the initial SymbolicRegression to focus additional modeling

Below we identify the variables present in at least 10% of the models having at least an 80% . Notice that the number of variables in the “good models” is pretty tightly focused on just a few variables.

If we wanted to improve our models, we could devote more CPU time to the model search above. Alternately, assuming that we are reasonably confident that we have identified the correct driving variables, we could focus the subsequent search to only use the identified potential driving variables. This means that the search is more efficient since it can neglect considering the variables which were not significant in the exploratory modeling. This result is shown below.