DATA MINING
Desktop Survival Guide
by Graham Williams

Nomenclature

Data miners have a plethora of terminology, often using many different terms to describe the same concept. A lot of this confusion is due primarily to the history of data mining, with its roots in many different disciplines, including databases, machine learning, and statistics. Throughout this book we will use a consistent and generally accepted nomenclature, which we introduce here.

We refer to collections of data as datasets. This might be a matrix or a table within a database, or within the context of R it might be a data frame. A dataset consists of rows which we might refer to as observations, and those observations are described in terms of variables which form the columns. Synonyms for observation include entity, row, record and object, while synonyms for variable include column, field, characteristic, attribute and feature.

Variables can serve one of two roles: as input variables or output variables (, ). Input variables are measured or preset data items while output variables are those that are often ``influenced'' by the input variables. In data mining we usually build models to predict the output variables in terms of the input variables. Input variables are also known as predictors, independent variables, observed variables and descriptive variables. Output variables are also known as response and dependent variables.

Variables can be categoric or numeric. A categoric variable is one like eye colour and type of motor vehicle. Such variables take on a single value for a particular observation from a fixed set of values (e.g., a colour, or passenger vehicle, utility, etc, or the common categorisations like low, medium, and high). A numeric variable has values that are integers or real numbers, such as a persons age or weight, or their income or amount of money in the bank. Synonyms for categoric variable include nominal variable, qualitative variable and factor, while synonyms for numeric variable, include quantitative variable.

Categoric variables are always discrete (i.e., can only take on specific values). Numeric variables can be discrete (integers) or continuous (real).

We will employ for data mining purposes datasets consisting of observations recorded using variables, which might consist of a mixture of input variables and output variables, either of which may be categoric or numeric.

A dataset (or subsets of a dataset) might have different roles. For building classification models, for example, we often partition a dataset into a training dataset and a testing dataset. Typically, we build our model on the training dataset and evaluate its performance on the testing dataset.

Support further development through the purchase of the PDF version of the book.
PDF version is properly formatted and forms a comprehensive book (draft with over 700 pages).
Brought to you by Togaware. This page generated: Sunday, 13 September 2009