Data Questions

Naturally, data is the foundation of empirical modeling; however, a certain level of distrust in the data is required to develop models which can be trusted. In other words, all data should be assumed to be guilty until proven innocent. As a result, the data should be explored and a basic level of understanding developed prior to launching the modeling tools. Specific data characteristics which might apply are illustrated in the diagram below:

These various aspects should inspire questions. For time series, we need to determine how to handle delayed effects as well as how (if) to partition the data into test, training and validation given that the underlying system might change over time.

For skinny arrays (ones with very many records and relatively few variables), we need to ask whether the modeling effort really needs or wants all of the data given the processing burden. The question is what sort of data balancing approach should be used? This situation is common in industrial data sets.

Conversely, fat arrays (lots of variables and relatively few records) have the problem that there are too many variables so we are faced with the problem of selecting the important ones in order to develop a valid model. This situation is especially common in genomics analysis.

Designed data (developed as part of a designed experiment) pretty much be definition produces a balanced data set. However, with many variables where a full factorial design is not practical, we have to ask whether the system dynamics were really captured.

Closed loop data is generally difficult since the resulting fat arrays are derived from time series with lots of repeated information. This makes the data balancing problem especially hard. In addition to industrial processes, this type of data is also generated by financial markets since they operate in a feedback mode.

Noise in the data is always an issue; hence it is useful to try to estimate the noise level since that determines the achievable model quality. Classification data has a quantized response (two or more classes). The associated question for the modeling effort is whether a quantized output is desired or whether a ranking score is more useful. Success in such modeling is a matter of life and death for cancer patients where the type and aggressiveness of the cancer will determine an appropriate treatment plan.

For all of these data types, we also have the question of whether the variables are correlated or if there are confounding effects. Many modeling techniques implicitly assume that the input variables are independent. If the data contains correlated variables and appropriate corrective action not taken, then poor models will be produced which will be perceived as having higher quality than deserved.

In a related fashion, confounding effects are associated with driving variables which are not captured in the data. Such situations are common in combinatorial chemistry where different cells are for all practical purposes different test rigs and, therefore, may produce structural errors if the experiments are not properly distributed across the various reactors.