Modern researchers in various fields are confronted by an unprecedented wealth and complexity of data. However, the results available to these researchers through traditional data analysis techniques provide only limited solutions to complex situations. The approach to the huge demand for the analysis and interpretation of these complex data is managed under the name of data mining, or knowledge discovery. Data mining is defined as the process of extracting useful information from large data sets through the use of any relevant data analysis techniques developed to help people make better decisions. These data mining techniques themselves are defined and categorized according to their underlying statistical theories and computing algorithms. This entry discusses these various data mining methods and their applications.
In general, data mining methods can be separated into three categories: unsupervised learning, supervised learning, and semisupervised learning methods. Unsupervised methods rely solely on the input variables (predictors) and do not take into account output (response) information. In unsupervised learning, the goal is to facilitate the extraction of implicit patterns and elicit the natural groupings within the data set without using any information from the output variable. On the other hand, supervised learning methods use information from both the input and output variables to generate the models that classify or predict the output values of future observations. The semisupervised method mixes the unsupervised and supervised methods to generate an appropriate classification or prediction model.
Unsupervised learning methods attempt to extract important patterns from a data set without using any information from the output variable. Clustering analysis, which is one of the unsupervised learning methods, systematically partitions the data set by minimizing within-group variation and maximizing between-group variation. These variations can be measured on the basis of a variety of distance metrics between observations in the data set. Clustering analysis includes hierarchical and nonhierarchical methods.
Hierarchical clustering algorithms provide a dendrogram that represents the hierarchical structure of clusters. At the highest level of this hierarchy is a single cluster that contains all the observations, while at the lowest level are clusters containing a single observation. Examples of hierarchical clustering algorithms are single linkage, complete linkage, average linkage, and War d's method.
Nonhierarchical clustering algorithms achieve the purpose of clustering analysis without building a hierarchical structure. The k-means clustering algorithm is one of the most popular nonhierarchical clustering methods. A brief summary of the k-means clustering algorithm is as follows: Given k seed (or starting) points, each observation is assigned to one of the k seed points close to the observation, which creates k clusters. Then seed points are replaced with the mean of the currently assigned clusters. This procedure is repeated with updated seed points until the assignments do not change. The results of the k-means clustering algorithm depend on the distance metrics, the number of clusters (k), and the location of seed points. Other nonhierarchical clustering algorithms include k-medoids and self-organizing maps.
Principal components analysis (PCA) is another unsupervised technique and is widely used, primarily for dimensional reduction and visualization. PCA is concerned with the covariance matrix of original variables, and the eigenvalues and eigenvectors are obtained from the covariance matrix. The product of the eigenvector corresponding to the largest eigenvalue and the original data matrix leads to the first principal component (PC), which expresses the maximum variance of the data set. The second PC is then obtained via the eigenvector corresponding to the second largest eigenvalue, and this process is repeated N times to obtain N PCs, where N is the number of variables in the data set. The PCs are uncorrelated to each other, and generally the first few PCs are sufficient to account for most of the variations. Thus, the PCA plot of observations using these first few PC axes facilitates visualization of high-dimensional data sets.
Supervised learning methods use both the input and output variables to provide the model or rule that characterizes the relationships between the input and output variables. Based on the characteristics of the output variable, supervised learning methods can be categorized as either regression or classification. In regression problems, the output variable is continuous, so the main goal is to predict the outcome values of an unknown future observation. In classification problems, the output variable is categorical, and the goal is to assign existing labels to an unknown future observation.
Linear regression models have been widely used in regression problems because of their simplicity. Linear regression is a parametric approach that provides a linear equation to examine relationships of the mean response to one or to multiple input variables. Linear regression models are simple to derive, and the final model is easy to interpret. However, the parametric assumption of an error term in linear regression analysis often restricts its applicability to complicated multivariate data. Further, linear regression methods cannot be employed when the number of variables exceeds the number of observations. Multivariate adaptive regression spline (MARS) is a nonparametric regression method that compensates for limitation of ordinary regression models. MARS is one of the few tractable methods for high-dimensional problems with interactions, and it estimates a completely unknown relationship between a continuous output variable and a number of input variables. MARS is a data-driven statistical linear model in which a forward stepwise algorithm is first used to select the model term and is then followed by a backward procedure to prune the model. The approximation bends at “knot” locations to model curvature, and one of the objectives of the forward stepwise algorithm is to select the appropriate knots. Smoothing at the knots is an option that may be used if derivatives are desired.
Classification methods provide models to classify unknown observations according to the existing labels of the output variable. Traditional classification methods include linear discriminant analysis (LDA) and quadratic discriminant analysis (QDA), based on Bayesian theory. Both LDA and QDA assume that the data set follows normal distribution. LDA generates a linear decision boundary by assuming that populations of different classes have the same covariance. QDA, on the other hand, does not have any restrictions on the equality of covariance between two populations and provides a quadratic equation that may be efficient for linearly nonseparable data sets.
Many supervised learning methods can handle both regression and classification problems, including decision trees, support vector machines, k-nearest neighbors, and artificial neural networks. Decision tree models have gained huge popularity in various areas because of their flexibility and interpretability. Decision tree models are flexible in that the models can efficiently handle both continuous and categorical variables in the model construction. The output of decision tree models is a hierarchical structure that consists of a series of if-then rules to predict the outcome of the response variable, thus facilitating the interpretation of the final model. From an algorithmic point of view, the decision tree model has a forward stepwise procedure that adds model terms and a backward procedure for pruning, and it conducts variable selection by including only useful variables in the model. Support vector machine (SVM) is another supervised learning model popularly used for both regression and classification problems. SVMs use geometric properties to obtain a separating hyperplane by solving a convex optimization problem that simultaneously minimizes the generalization error and maximizes the geometric margin between the classes. Nonlinear SVM models can be constructed from kernel functions that include linear, polynomial, and radial basis functions. Another useful supervised learning method is k-nearest neighbors (kNNs). A type of lazy-learning (instance-based learning) technique, kNNs do not require a trained model. Given a query point, the k closest points are determined. A variety of distance measures can be applied to calculate how close each point is to the query point. Then the k nearest points are examined to find which of the categories belong to the k nearest points. Last, this category is assigned to the query point being examined. This procedure is repeated for all the points that require classification. Finally, artificial neural networks (ANNs), inspired by the way biological nervous systems learn, are widely used for prediction modeling in many applications. ANN models are typically represented by a network diagram containing several layers (e.g., input, hidden, and output layers) that consist of nodes. These nodes are interconnected with weighted connection lines whose weights are adjusted when training data are presented to the ANN during the training process. The neural network training process is an iterative adjustment of the internal weights to bring the network's output closer to the desired values through minimizing the mean squared error.
Semisupervised learning approaches have received increasing attention in recent years. Olivier Chapelle and his coauthors described semisupervised learning as “halfway between supervised and unsupervised learning” (p. 4). Semisupervised learning methods create a classification model by using partial information from the labeled data. One-class classification is an example of a semisupervised learning method that can distinguish between the class of interest (target) and all other classes (outlier). In the construction of the classifiers, one-class classification techniques require only the information from the target class. The applications of one-class classification include novelty detection, outlier detection, and imbalanced classification.
Support vector data description (SVDD) is a one-class classification method that combines a traditional SVM algorithm with a density approach. SVDD produces a classifier to separate the target from the outliers. The decision boundary of SVDD is constructed from an optimization problem that minimizes the volume of the hypersphere from the boundary and maximizes the target data being captured by the boundary. The main difference between the supervised and semisupervised classification methods is that the former generates a classifier to classify an unknown observation into the predefined classes, whereas the latter gives a closed-boundary around the target data in order to separate them from all other types of data.
Interest in data mining has increased greatly because of the availability of new analytical techniques with the potential to retrieve useful information or knowledge from vast amounts of complex data that were heretofore unmanageable. Data mining has a range of applications, including manufacturing, marketing, telecommunication, health care, biomedicine, e-commerce, and sports. In manufacturing, data mining methods have been applied to predict the number of product defects in a process and identify their causes. In marketing, market basket analysis provides a way to understand the behavior of profitable customers by analyzing their purchasing patterns. Further, unsupervised clustering analyses can be used to segment customers by market potential. In the telecommunication industries, data mining methods help sales and marketing people establish loyalty programs, develop fraud detection modules, and segment markets to reduce revenue loss. Data mining has received tremendous attention in the field of bioinformatics, which deals with large amounts of high-dimensional biological data. Data mining methods combined with microarray technology allow monitoring of thousands of genes simultaneously, leading to a greater understanding of molecular patterns. Clustering algorithms use microarray gene expression data to group the genes based on their level of expression, and classification algorithms use the labels of experimental conditions (e.g., disease status) to build models to classify different experimental conditions.
A variety of data mining software is available. SAS Enterprise Miner ( www.sas.com), SPSS (an IBM company, formerly called PASW¯ Statistics) Clementine ( www.spss.com), and S-PLUS Insightful Miner ( www.insightful.com) are examples of widely used commercial data mining software. In addition, commercial software developed by Salford Systems ( www.salford-systems.com) provides CART, MARS, TreeNet, and Random Forests for specialized uses of tree-based models. Free data mining software packages also are available. These include RapidMiner ( rapid-i.com), Weka ( www.cs.waikato.ac.nz/ml/weka), and R ( www.r-project.org).
Exploratory Data Analysis, Exploratory Factor Analysis, Ex Post Facto Study
AUTHORS' NOTE: CRS acknowledges support from CONACyT project 30422-E, a DGAPA sabbatical fellowship, hospitality and financial support from the Dubl
Introduction Knowledge discovery in databases (KDD) was initially defined as the "non-trivial extraction of implicit, previously unknown,...
Introduction Data mining, also called knowledge discovery in databases (KDD), is the field of discovering novel and potentially useful...