Text Mining is the extraction of knowledge from many texts. The extracted knowledge is new and created during the extraction process. It cannot be found in any of the single processed texts. Text mining combines several technologies and is applied in diverse application areas. Knowledge derived from text mining comes in the form of distribution patterns and the frequency of words in texts and their parts. Based on such knowledge, the user can explore and determine how topics are dealt with in many text documents, which attitudes on specific topics are expressed, and how these matters evolve with time. Text mining requires algorithms as well as information work of users: “Text Mining can be broadly defined as a knowledge-intensive process in which a user interacts with a document collection over time by using a suite of analysis tools” (Feldman & Sanger 2007, p. 1).
The basic algorithms for Text Mining deal with natural language as present in texts. Language technology dealing with the unstructured and vague nature of language needs to create numeric representations. Subsequently, the extracted structural knowledge is further processed by numerical algorithms as they are employed in data mining. Particularly, clustering and classification are applied. For instance, text classification can assign texts to different classes based on their topic.
Text Mining systems allow their users to interact with text collections and to extract some useful information in the form of overall patterns. In order to support the information work of the users, optimized user interfaces are necessary and need to be designed in a user centered way. The extraction and identification of patterns are often easier for the user if data is adequately visualized. Hence, visualization is an inherent part of Text Mining.
The term Web Mining is closely related to Text Mining and means the application of machine learning techniques to data from the Internet: ”treat the information in the web as a large knowledge base from which we can extract new, never-before encountered information” (Hearst, 1999). Most information of the Web is stored in text, so that there is a high overlap between the two areas. Machine learning is the computational core of Text Mining. Algorithms try to adapt and improve their output over time. In supervised learning, a teaching input leads the program to a new and better solution for the same input. Data Mining is often referred to as the process in which machine learning is applied and which integrates data preparation and presentation of results.
Text Mining typically begins with the processing of natural language. Initially, the creation of numerical representations for further processing is necessary. Natural language processing tasks are identical for many text mining applications.
Texts contain words in many different forms. The words need to be identified and separated, a difficult task for languages without blanks between words (e.g. some East Asian languages). In the case of most European languages, punctuation marks and hyphens need to be regarded.
The following step is grouping words which have a common basic form. These forms could be e.g. different grammatical forms of a verb. Their meaning is basically identical and only their morphology changes. In languages with many cases for nouns and many temporal forms for verbs (e.g. Finnish), this task can be challenging. Identical stemming operations are carried in Information Retrieval.
An example would be the word forms “run,” “runs” and “running.” They should be all mapped to the same stem “run.”
The remaining words are counted and their frequency in each text and in the entire collection is determined. Based on the frequencies, weights are calculated expressing the importance of a word or term for a text document. These weights show the topicality or “aboutness” of a document. This information can be stored in a document-term matrix where a vector contains the weights for all terms regarding one document. Each column shows the distribution of a term over all documents in a collection. (Manning et al., 2008)
For Text Mining, the occurrence of a single word in one document is not as much of importance as it is for Information Retrieval. Text Mining is not about finding one relevant document but rather about the presence of groups of words in documents and often the way the concept represented by these words is dealt with.
Such a concept could be, for instance, “Chinese Government” which would typically comprise phrases like “Chinese Leader” and “Chinese prime Minister” and also the names of ministers. Another concept could be a collection of words with positive meanings. A concept in Text Mining can be understood as a set of words. Concepts are usually created manually or extracted from ontologies or thesauri. There are also different methods for their semi-automatic creation. The success of applications will to some extent depend on the quality of the concept definitions.
The frequency of a concept can be determined by adding the frequency of all terms in this concept, a process which will open opportunities for analysis. For example, the frequency of a concept of “corruption” could be determined in a news corpus. In a further step, the frequency of the occurrence of a concept related to a particular political party can be determined. With this information, the frequency of a party name in the context of the concept “corruption” will be found.
By incorporating the dimension of time, a trend analysis can be carried out. For the example mentioned above, the association of the concept “corruption” and a party name can be recorded for a long period of time and be presented for short time steps like months or years. Thus, the temporal evolution of the phenomenon constituting corruption and a party can be observed. It is important to relate the frequency of a concept to the normal frequency. Concerning the example above, it could be the case that a certain party name appears very rarely. Consequently, it would also appear rarely in the context of corruption. However, it could occur more often in the vicinity of corruption than expected or more often than other party names.
The common appearance of terms can be an indicator for their semantic similarity. It could also be a hint that together they belong to a same concept. When terms are similar in the vector space model mentioned above, they often appear together in documents. In other words, they exhibit a similar distribution pattern over documents. In a document-term matrix, they can be identified by searching for similar term vectors. Such associated terms or words should occur more often together than their individual frequency suggests. The frequencies of words differ by several orders of magnitudes. Therefore, it is necessary to calculate the divergence from randomness for joint occurrence. Such calculations can be interpreted as statistical tests for the significance of common occurrence. Similar models are well established in Information Retrieval under the language model (Song & Croft 1999).
Frequently used association measures are mutual information, log-likelihood or Chi-square (Manning et al. 2008). Due to the high number of words, not all pairs can be compared in a typical application, therefore; there is a high demand for efficient methods and optimization approaches.
Classification algorithms sort objects into predefined classes. In order to do that, these supervised learning algorithms require the presence of positive and negative example objects for each class as shown in Figure 1. The algorithm extracts knowledge from the objects for which the class assignment is known. Their features and their respective values relevant for class membership are determined. By doing that, rules for class membership are extracted from the learning input. A basic prerequisite is the presence of knowledge about the objects. Values for features of these objects are collected in feature vectors for the objects. The algorithms need to relate the values or the feature vector to the class membership. Symbolic approaches emphasize explicit rules and transparency of the learned knowledge. Important representatives of these symbolic algorithms are decision trees and classification rules (Zaki & Meira, 2010).
Decision trees search for binary divisions of the feature space, which separate as many objects of one class from all other objects. For example, it could be the case that most objects with a high feature value fall into one class. When the decision tree is used for assigning unknown objects to classes, then, the value of that feature could be checked as first. Further rules would be necessary in order to fine tune the decision and finally reach a class assignment.
Linear regression methods calculate a membership value for an object class pair based on a linear combination of the feature values of the object. Parameters are derived which need to be multiplied with the feature values. After that, they can be added up to the membership value or membership probability for a class. These models are also transparent to some extent as the user can check the effect of values.
Complex non-linear classification algorithms cannot deliver such a transparency. In contrast, they are more powerful and can solve learning problems with non-linear relations between feature values and class membership. Well known and often used examples for these algorithms are support vector machines. These algorithms can be seen as a pattern for a complex function with many parameters which are adjusted in many steps to hit the target output closer and closer (Runkler, 2012).
For the evaluation of classification algorithms, there are well established measures such as recall, precision and f-measure per class. For evaluation purposes, it is necessary to measure success not only on the training set but also on a different set of objects. Separating test and training set. That will result in a realistic perception of the quality of the classification algorithm. It could be the case, that the training set exhibits some peculiarities which are atypical for other objects as they might be encountered during the real application of the classification system.
Clustering means the ordering of objects into groups which are unknown and which can provide a natural order. Similar objects should be grouped together and objects with low similarity to each other should be in different clusters. The basic idea is illustrated in Figure 2.
Similar to classification, the objects are characterized by feature vectors where feature values may be numeric or categorical. Cluster analysis typically requires a similarity matrix between the objects. Hierarchical algorithms will first move objects with the highest similarity into one cluster and subsequently, the next most similar pair is joined into another cluster. The process continues until a desired number of clusters has been reached. Similarity measures and appropriate clustering algorithms need to be carefully selected (Backhaus et al., 2008). Hierarchical clustering can be illustrated by a dendogram as shown in Figure 3.
Some methods also determine what the optimal number of clusters is, whereas others obtain the optimal number of clusters themselves (Viswanth et al., 2009). After the calculation of the clusters, their quality is estimated by the homogeneity of the solution. The evaluation of the quality of clusters from a pragmatic application point of view is difficult and greatly depends on the goal of the system. Clusters can then be used by users to explore large quantities of texts or Web pages (Carpineto et al., 2010).
The following examples for applications of Text Mining are presented as they have a high relevance and represent the area well. There are many more applications of text mining which deal with the identification of authors (Savoy, 2012), the identification of technical trends based on patents (Kim et al., 2009) and analysis in the bio-medical domain (Cohen & Hersh, 2005).
Figure 4 shows a typical text mining process with its steps.
The ordering of text into pre-defined classes is a basic and frequent application of Text Mining and is referred to as text classification of text categorization. The assignment of documents into classes can have many goals. A typical scenario is automatic indexing for libraries or other documentation centers. Putting a document into the right class and consequently, on the right library shelf allows convenient access for users to texts about the same topic. Other applications include the assignment of e-mail messages into spam and non spam or the assignment of news agency messages to categories for publication. Challenging research questions for text classification deal with the problem of how to deal with the processing of so many features. The selection of good evidence is important. Each class might have different predictors. Also the choice of classification algorithms is an issue for research (Sebastiani, 2002).
The Internet offers access to numerous news sources. For some applications it is useful to group them automatically. For coarse structuring, classification could be applied. Messages could be classified by text classification methods into categories which correspond to the sections of a newspaper, such as sports, business or politics.
More interesting is the clustering of news articles of different sources into clusters of articles about the same event. Since the events are not known before, clustering needs to be used to group the similar news stories. Within each category, such as politics or sports, very different distributions of words may prevail. Consequently, clustering may be optimized in different ways within each category (Bache & Crestani, 2010).
Some approaches try to recognize the similarity of the texts across languages and can provide access to news over different languages (Steinberger et al., 2011). The temporal dimension of news provides an alternative approach for the exploration of the data for the user.
Opinion Mining is concerned with the identification and the classification of opinionated parts in texts. For concepts such as company names or product names opinion mining may recognize the associated opinions. Most approaches develop concepts for positive and negative opinions, which often consist of collections of words indicating an opinion. These often contain adjectives and adverbs, but sometimes also pronouns. Finding an optimal set is an important first step (Turney, 2002).
The classification of documents into subjective and objective texts or text parts, and, furthermore, the determination of the polarity is the core of opinion mining (Banea et al., 2008). The research also often determines the strength of the expressed opinion. . Recent research often deals with user-generated content such as movie reviews (Pang et al., 2002) or posts (Abbasi et al., 2008) in which opinions are typically expressed.
Text and Data Mining offer tremendous potential for innovative applications. On the other hand, they often lead to new issues regarding privacy and other information ethical concerns. Some personal data which someone would reveal to others without worrying about privacy can be problematic when it is automatically connected to other data sets (van Wel & Royakkers, 2004). The individual can no longer judge the impact of the disclosure of individual data. In complex models, data may lead to the classification into high-risk groups. This may result in the suspension of loans or the denial of insurance. Even vital interests may be affected when certain embryonic stem cells are selected for fertilization.
Text Mining is related to several other academic disciplines or sub-disciplines. Information Retrieval also deals with the processing of large quantities of text data, and its representation mechanisms are very similar to those of Text Mining. However, the goals differ. Information Retrieval intends to identify a few relevant documents out of the entire collection, whereas Text Mining searches for patterns over all documents. Both areas focus on shallow text analysis and typically apply stemming and sometimes part-of-speech tagging. Language technology or computational linguistics on the other hand deals with techniques beyond the word level and intends to analyze syntax and semantics within language. Furthermore, Text Mining is related to Information Extraction which deals with the creation of structured data from unstructured data like text. A frequently used basic application is named entity recognition (NER). It identifies words which represent entities which may occur in several different forms and which sometimes are similar to common words (Curran & Clark 2003).
For many text mining applications, it remains unclear how powerful they are and what determines their success. Due to the large variety of approaches, it is often difficult to evaluate the quality of algorithms. When gold standard sets are created, their coverage remains unclear. The evaluation requires increased attention in the coming years.
Text Mining is a well-established discipline which relies on machine learning. Especially clustering of texts and the classification of text or extracted entities are often applied. Accordingly, Text Mining will continue to grow in importance. The importance of Text Mining methods will not only be discussed by scientists but also by the public as spying and surveillance reveals ethical issues of Mining technologies. Future progress will be made by better algorithms and even more so by the integration of Text Mining applications into work tasks of knowledge workers. That also means that more and more people will be concerned with Text Mining in the future.
Associations measure how often a word co-occurs with other words. The more often words occur close to each other when compared to their general frequency, the higher their association will be.Classification:
Objects are assigned to pre-defined classes based on similarity. Similar objects are assigned to the same class. The function defining similarity is given by examples for the assignment. These are objects which have been assigned to a class before. The algorithm needs to learn a function which reflects the class definition as determined by the learning examples.Clustering:
Objects are being grouped based on similarity. Each cluster contains objects which are more similar among each other than to objects in other clusters.Concepts:
Meaning is defined beyond a word. A concept is a semantic entity which can be expressed by several words or by a group of words.Information Retrieval:
Information retrieval is concerned with the representation and knowledge and subsequent search for relevant information within these knowledge sources. Information retrieval provides the technology behind search engines.Opinion Mining:
Opinion mining or Sentiment Analysis means finding and classifying opinionated parts of texts. These subjective parts need to by identified by Text Mining methods and separated from objective text parts. A technique typically applied is the search for words which express opinion.Stemming:
Stemming refers to the mapping to word forms to stems or basic word forms. Word forms may differ from stems due to morphological changes necessary for grammatical reasons. Plural for English nouns, for example, is mostly constructed by adding an s to the basic noun.
Related Credo Articles
Abstract It is becoming increasingly difficult to keep up with the amount of information published in the scientific literature, both for domain exp
the extraction of information from a large text database by computer searches designed to uncover new information which is not easily retrieved by c
noun data mining applied specifically to text. ...