Togaware DATA MINING
Desktop Survival Guide
by Graham Williams
Google


Nomenclature

Data miners have a plethora of terminology for many of the same things, due primarily to the history of data mining with its roots in many disciplines. Throughout this book we will use a single, consistent nomenclature, and one that is generally accepted. This nomenclature is introduced here.

We refer to collections of data as datasets. This might be a matrix or a database table. A dataset consists of rows which we might refer to as entities, and those entities are described in terms of variables which form the columns. Synonyms for entity include row, record and object, while synonyms for variable include column, field, characteristic, attribute and feature.

Variables can serve one of two roles: as input variables or output variables (Hastie et al., 2001). Input variables are measured or preset data items while output variables are those that are perhaps ``influenced'' by the input variables. In data mining we often build models to predict the output variables in terms of the input variables. Input variables are also known as predictors, independent variables, observed variables and descriptive variables. Output variables are also known as response and dependent variables.

Variables can be categoric or numeric. A categoric variable is one like eye colour and type of motor vehicle. Such variables take on a value from a fixed set of values (e.g., a colour, or passenger vehicle, utility, etc, or the common categorisations like low, medium, and high). A numeric variable has values that are integers or real numbers, such as a persons age or weight, or their income or amount of money in the bank. Synonyms for categoric variable include nominal variable, qualitative variable and factor, while synonyms for numeric variable, include quantitative variable.

Categoric variables are always discrete (i.e., can only take on specific values). Numeric variables can be discrete (integers) or continuous (real).

We will talk of datasets consisting of entities described using variables, which might consist of a mixture of input variables and output variables, either of which may be categoric or numeric.

A dataset (or subsets of a dataset) might have different roles. For building classification models, for example, we often partition a dataset into a training dataset and a testing dataset. Typically, we build our model on the training dataset and evaluate its performance on the testing dataset.

Copyright © 2004-2008 Togaware Pty Ltd
Support further development through the purchase of the PDF version of the book.
PDF version is properly formatted and forms a comprehensive book (draft with over 600 pages).
Brought to you by Togaware.