DM modelling techniques come from a number of different fields of research like machine learning, signal processing, evolutionary computing, statistics. This fact, together with the great number of different algorithms is confusing for the potential users. A useful approach to diminish confusion is to stress that all DM modelling techniques have similar structure. They can all be described by three main components:
- Model (or knowledge) representation
- This is the functional form of the model(s) that are used by the algorithm. Formally, model can be represented as a function y=f(x,P), where x represents the input (these are attribute-value pairs) and P represents specific parameters describing the particular model. For example in case of Decision tree algorithm y represents a graph of nodes and edges, while for Rule induction algorithm this is a particular set of rules in CNF or DNF form. Important issues related to model representation are: the form of data that the model handles (continuous, discrete-integer, categorical, all), the explanatory power of the representation, the function approximation capabilities (linear, nonlinear), the form of the output of the model.
- Estimation criteria
- When a particular representation f is given, estimation criteria evaluates how well a particular set of parameters P fits the data. This estimation criteria is internal to the specific DM modelling technique and should not be confused with evaluation measures used to asses already built models. Evaluation measures for assesing built models are treated in a separate section of this tutorial. Estimation criterion therefore evaluates construction of different instances of representation f, during the search through the space of all possible instances using the search method which is the third basic part of the DM modelling techniques, and is explained next. Typical characteristics of estimation criteria include: sensitivity and robustness of the estimation criteria for a particular model as a function of sample size and the dimensionality of the problem; the underlying assumptions of the criterion (probabilistic, logical, independent sampling). Estimation criterion differs significantly from technique to technique; it is a consequence of the particular model representation and applied search method.
- Search method
- Given a represetational form, and an estimation criteria, the search method is a specific algorithm that governs search through the space of all describable representations, using the estimation criteria. This basicaly means that given model representation and estimation criteria, DM modelling techniques work like optimization algorithms. Search algorithms have some typical characteristics: basic search methodolgy (greedy, exhaustive, heuristic, hill-climbing); complexity of the search (whether it is a parameter search or it has additional loop over model structures); control of the search (time and memory complexity related stopping criteria).
Descriptions of different DM modelling techniques, which can be found through links given in the DM Modelling techniques section, reveal different properties of these three components, typical for each technique.