- Knowledge Discovery
- Technology Overview
- On the Horizon
- The Human Factor
Technology Overview
Data mining requires a hardware and software infrastructure capable of supporting the high-throughput data processing and a network capable of supporting data communications from the database to the visualization workstation. With a robust hardware and software infrastructure in place, processes such as machine learning can be used to automatically manage and refine the knowledge discovery and data mining processes. This work can be performed with minimal user interaction once a knowledgeable researcher has established the basic design of the system.
The core technologies that actually perform the work of data mining, whether under computer control or directed by users, provide a means of simplifying the complexity and reducing the effective size of the databases. This focus isn't limited to genome sequences and protein structures, but extends to the wealth of data hidden in the online literature. Advanced text mining methods are used to identify textual data and place them in the proper context.
Infrastructure
The infrastructure in a data mining laboratory data includes high-speed Internet and intranet connectivity, a data warehouse with a data dictionary that defines a standard vocabulary and data format, several databases, and high-performance computer hardware. Some form of database management system (DBMS) is required to support queries and ensure data integrity. The infrastructure can be based on a central high-performance computer. However, most systems support some form of parallel processing, so intermediate results from one workstation can be fed to another workstation. For example, link analysis performed on one workstation may be fed the regression analysis results from another workstation. The trend toward distributed data mining using relatively inexpensive desktop hardware is largely a reflection of the economics of modern computing. In many cases, the price-performance ratio of desktop hardware is superior to that of mainframe computers.
Pattern Recognition
Data mining involves identifying patterns and relationships in data that often are not obvious in large, complex data sets. This pattern recognition is most often concerned with the automatic classification of character sequences representative of the nucleotide bases or molecular structures, and of 3-D protein structures. From an information processing perspective, pattern recognition can be viewed as a data simplification process that filters extraneous data from consideration and labels the remaining data according to a classification scheme.
As illustrated in Figure 3, the major steps in the pattern recognition and discovery process are feature selection, measurement, processing, feature extraction, classification, and labeling. Given a pattern, the first step in the pattern recognition is to select a set of features or attributes from the universe of available features that will be used to classify the pattern. Next, the original pattern must be transformed into a representation that can be easily manipulated programmatically. After the data are processed to remove noise, features in the data that are defined as relevant to pattern matching are searched for. In the classification stage, data are classified based on measurements of similarity with other patterns. The pattern recognition process ends when a label is assigned to the data, based on its membership in a class.
Figure 3 Stages in the pattern recognition process.
Machine Learning
The pattern matching and pattern discovery components of data mining are often performed by using machine learning techniques. Machine learning encompasses a variety of methods that represent the convergence of statistics, biological modeling, adaptive control theory, psychology, and artificial intelligence (AI). The spectrum of machine learning technologies applicable to data mining in bioinformatics include inductive logic programming, genetic algorithms, neural networks, statistical methods, Bayesian methods, decision trees, and Hidden Markov Models.
Inductive logic programming uses a set of rules or heuristics to categorize data. Genetic algorithms are based on evolutionary principles wherein a particular function or definition that best fits the constraints of an environment survives to the next generation, and the other functions are eliminated. Neural networks learn to associate input patterns with output patterns in a way that allows them to categorize new patterns and to extrapolate trends from data. The statistical methods used to support data mining are generally some form of feature extraction, classification, or clustering. Decision trees are hierarchically arranged questions and answers that lead to classification. A Hidden Markov Model (HMM) is a statistical model for an ordered sequence of symbols, acting as a stochastic state machine that generates a symbol each time a transition is made from one state to the next. Transitions between states are specified by transition probabilities.
Regardless of the underlying technology, most machine learning follows the general process outlined in Figure 4. Input data are fed to a comparison engine that compares the data with an underlying model. The results of the comparison engine then direct a software actor to initiate some type of change. This output, whether it takes the form of a change in data or a modification of the underlying model, is evaluated by an evaluation engine, which uses the underlying goals of the system as a point of reference. Feedback from the actor and the evaluation engine direct changes in the model. In this scenario, the goals can be standard patterns that are known to be associated with the input data. Alternatively, the goals can be states, such as minimal change in output compared with the system's previous encounter with the same data.
Figure 4 The machine learning process.
Text Mining
The primary store of functional data that links clinical medicine, pharmacology, sequence data, and structure data is in the form of biomedicine documents in online bibliographic databases such as PubMed. Mining these databases is expected to reveal the relationships between structure and function at the molecular level and their relationship to pharmacology and clinical medicine.
However, text mining, automatically extracting this data from documents, which is published in the form of unstructured free text, often in several languages, is a non-trivial task. Although computer languages such as LISP (LISt Processing) have been developed expressly for handling free text, working with free text remains one of the most challenging areas of computer science. This is primarily because, unlike the analysis of the sequence of amino acids in a protein, natural language is ambiguous and often references data not contained in the document under study. For example, a research article on the expression of a particular gene in PubMed may contain numerous synonyms, acronyms, and abbreviations. Furthermore, despite editing to constrain the sentences to proper English (or other language), the syntaxthe ordering of words and their relationships to other elements in phrases and sentencesis typically author-specific. The article may also reference an experimental method that isn't defined because it's assumed as common knowledge in the intended readership. In addition, text mining is complicated because of the variability of how data are represented in a typical text document. Data on a particular topic may appear in the main body of text, in a footnote, in a table, or embedded in a graphic illustration
The most promising approaches to text mining online documents rely on natural language processing (NLP), a technology that encompasses a variety of computational methods ranging from simple keyword extraction to semantic analysis. The simplest NLP systems work by parsing documents and identifying the documents with recognized keywords such as "protein" or "amino acid". The contents of the tagged documents can then be copied to a local database and later reviewed.
More-elaborate NLP systems use statistical methods to recognize not only relevant keywords, but their distribution within a document. In this way, it's possible to infer context. For example, an NLP system can identify documents with the keywords "amino acid", "neurofibromatosis", and "clinical outcome" in the same paragraph. The result of this more-advanced analysis is document clusters, each of which represents data on a specific topic in a particular context.
The most advanced NLP systems work at the semantic levelthe analysis of how meaning is created by the use and interrelationships of words, phrases, and sentences in a sentence.
The processing phase of NLP involves one or more of a variety of the following techniques:
StemmingIdentifying the stem of each word. For example, "hybridized", "hybridizing", and "hybridization" would be stemmed to "hybrid". As a result, the analysis phase of the NLP process has to deal with only the stem of each word, not every possible permutation.
TaggingIdentifying the part of speech represented by each word, such as noun, verb, or adjective.
TokenizingSegmenting sentences into words and phrases. This process determines which words should be retained as phrases and which ones should be segmented into individual words. For example, "Type II Diabetes" should be retained as a word phrase, whereas "A patient with diabetes" would be segmented into four separate words.
Core TermsSignificant terms, such as protein names and experimental method names, are identified based on a dictionary of core terms. A related process is ignoring insignificant words such as "the", "and", and "a".
Resolving Abbreviations, Acronyms, and SynonymsReplacing abbreviations with the words they represent, and resolving acronyms and synonyms to a controlled vocabulary. For example, "DM" and "Diabetes Mellitus" could be resolved to "Type II Diabetes", depending on the controlled vocabulary.
The analysis phase of NLP typically involves the use of heuristics, grammar, or statistical methods. Heuristic approaches rely on a knowledge base of rules that are applied to the processed text. Grammar-based methods use language models to extract information from the processed text. Statistical methods use mathematical models to derive context and meaning from words. Often, these methods are combined in the same system. For example, grammar-based methods and statistical methods are frequently used in NLP systems to improve the performance of what could be accomplished by using either approach alone.