Togaware DATA MINING
Desktop Survival Guide
by Graham Williams
Google


Data Nomenclature

Data miners have a plethora of terminology, often using many different terms to describe the same concept. A lot of this confusion of terminology is due to the history of data mining, with its roots in many different disciplines, including databases, machine learning, and statistics. Throughout this book we will use a consistent and generally accepted nomenclature, which we introduce here.

We refer to a collection of data as a dataset. This might be called, in mathematical terms, a matrix, or in database terms, a table. Figure [*] illustrates a dataset annotated with our chosen nomenclature.

We often view a dataset as consisting of rows which we refer to as observations, and those observations are recorded in terms of variables which form the columns of the dataset. Observations are also known as entities, rows, records and objects. Variables are also known as fields, columns, attributes, characteristics and features. The dimensions of a dataset refers to the number of observations (rows) and the number of variables (columns).

Variables can serve different roles: as input variables or output variables. Input variables are measured or preset data items. They might also be known as predictors, covariates, independent variables, observed variables and descriptive variables. Output variables are those that are often ``influenced'' by the input variables. They might also be known as response and dependent variables. In data mining we often build models to predict the output variables in terms of the input variables.

Some variables may only serve to uniquely identify the observations. Common examples include social security and other such government identity numbers. Even the date may be a unique identifier for particular observations. We refer to such variables as identifiers. Identifiers are not normally used in modelling, particularly identifiers which are essentially randomly generated.

Variables can store different types of data. The values might be the names or the qualities of objects, represented as character strings. Or the values may be quantitative and are thereby represented numerically. At a high level we often only need to distinguish these two broad types of data, as we do here.

We will employ for data mining purposes datasets consisting of observations recorded using variables, which might consist of a mixture of input variables and output variables, either of which may be categoric or numeric.

A dataset (or subsets of a dataset) might have different roles. For building classification models, for example, we often partition a dataset into a training dataset and a testing dataset. Typically, we build our model on the training dataset and evaluate its performance on the testing dataset.

Copyright © Togaware Pty Ltd
Support further development through the purchase of the PDF version of the book.
The PDF version is a formatted comprehensive draft book (with over 800 pages).
Brought to you by Togaware. This page generated: Sunday, 22 August 2010