Regarding the novelty and abundance of available techniques and algorithms involved in the modelling phase this is the most interesting part of the data mining process. Therefore, we have devoted a special section of the tutorial to a description of data mining modelling techniques. Important stages in the modelling phase:
This problem has been initiated earlier in the project, through specifying problem and data mining goals. However, at this stage, when we finally have data prepared for modelling, we can still choose more appropriate technique than specified at the start of the process.
When choosing an appropriate technique among numerous available DM modelling techniques one has to have in mind the main task of the project and its relation to main divisions of DM tools according to the type of the problem. First division of DM modelling tools is according to the type of knowledge discovery task one wants to achieve: i.e. prediction or description. One must emphasize that many DM modelling tools are capable of generating models which at the same time solve prediction task but also provide an informative description of the model behind the data, which is appropriate as a descriptive task solution. Generally, goals of prediction and description tasks are achieved by applying one of the primary data mining methods. In the table below data mining problem types are related to appropriate modelling techniques.
|Classification||Rule induction methods, Decision trees, Neural networks, K-nearest neighbors, Case based reasoning|
|Prediction||Regression analysis, Regression trees, Neural networks, K-nearest neighbors,|
|Dependency analysis||Correlation analysis, Regression analysis, Association rules, Bayesian networks, Inductive logic programming|
|Data description and summarization||Statistical techniques, OLAP|
|Segmentation or clustering||Clustering techniques, Neural networks, Visualization methods|
Before building a model, we need to generate a procedure or mechanism to test the model’s quality and validity. For example, in supervised data mining tasks such as classification, it is common to use error rates as quality measures for data mining models. Therefore, we typically separate the data into train and test set, build the model on the training set and estimate its quality on the separate test set.
Once the modelling tool(s) is choosen we can run the tool on the prepared dataset and generate typically more different models. All the modelling tools have a number of parameters that govern the model generation process. The choice of optimal parameters for the problem at hand is an iterative process, and it has to be properly explained and supported through results. Resultant models should be properly interpreted and their performance explained.
Once models are generated they are interpreted according to the existing domain knowledge and data mining success criteria. Domain experts judge the results (models) within domain context, while data miners apply data mining criteria (accuracy on the test set, lift or gain tables, etc.).