Jiri Klema - Research

horizontal line inv


Research Topics


Bioinformatics - genomics:

Mining Plausible Patterns from Genomic Data - searching for patterns in expression data that are meaningful with respect to current genomic knowledge.



Automated Information Extraction from Genomic Texts - utilization of DCG grammars for extraction of entity names and understanding their interaction.



Mining Patterns in Medical Sequential Data - utilization of various sequential mining methods in a longitudinal medical data.


Industrial applications:

Intelligent Diagnosis And Learning In Centrifugal Pumps - on-line and learning-based diagnosis of cavitation in centrifugal pumps.



Analysis of Critical Process Parameters in Pharmaceutical Manufacturing - analysis of historical process data through the quantification of the impact of individual process and raw material parameters on key product quality attributes.


Short abstracts


Mining Plausible Patterns from Genomic Data

The discovery of biologically interpretable knowledge from gene expression data is one of the largest contemporary genomic challenges. As large volumes of expression data are being generated, there is a great need for automated tools that provide the means to analyze them. However, the same tools can provide an overwhelming number of candidate hypotheses which can hardly be manually exploited by an expert. An additional knowledge helping to focus automatically on the most plausible candidates only can up-value the experiment significantly. Background knowledge available in literature databases, biological ontologies and other sources can be used for this purpose. We propose and verify a methodology that enables to effectively mine and represent meaningful over-expression patterns. Each pattern represents a bi-set of a gene group over-expressed in a set of biological situations. The originality of the framework consists in its constraint-based nature and an effective cross-fertilization of constraints based on expression data and background knowledge. The result is a limited set of candidate patterns that are most likely interpretable by biologists. Supplemental automatic interpretations serve to ease this process. Various constraints can generate plausible pattern sets of different characteristics.

Automated Information Extraction from Gene Summaries

Automated extraction of links among biological entities from free biological texts has proven to be a difficult task. We propose and solve a modified task in which we extract the links from short textual gene summaries collected automatically from NCBI website. The main simplification lies in the fact that each summary is unambiguously attached to a single gene. The agent part of binary biological interactions is thus known by default, the goal is to identify meaningful target parts from the summary. The outcome is a structured representation of each summary that can be used as background knowledge in consequent mining of gene expression data. As the gene summaries highly interact with the other structural information resources provided by NCBI website, these resources can be used as an annotation tool and/or a feedback for performance optimization of the system being developed. In particular we use the gene ontology terms in order to evaluate and improve the information extraction process.

Mining Patterns in Medical Sequential Data

Sequential data represent an important source of potentially new medical knowledge. However, this type of data is rarely provided in a format suitable for immediate application of conventional learning or mining algorithms. We study and compare three fundamentally different sequential mining approaches -- windowing, episode rules and inductive logic programming. Windowing is one of the essential methods of data preprocessing, episode rules represent general sequential mining while inductive logic programming extracts first-order features whose structure is given a priori by background knowledge. The three approaches are demonstrated and evaluated in terms of a case study STULONG. It is a longitudinal preventive study of atherosclerosis where the data consist of series of long-term observations recording the development of risk factors and associated conditions. The intention is to identify frequent sequential/temporal patterns. Possible relations between the patterns and an onset of any of the observed cardiovascular diseases is also studied.

Intelligent Diagnosis And Learning In Centrifugal Pumps

This research topic addresses the problem of on-line diagnosis of cavitation in centrifugal pumps. The paper introduces an application of the Open Prediction System (OPS) to cavitation diagnosis. The application of OPS results in an algorithmic framework for diagnosis of cavitation in centrifugal pumps. The diagnosis is based on repeated evaluation of a data scan providing full record of input signals which are observed for a fixed short period of time. Experimental verification of the algorithmic framework and the proposed methodology proved that a condition monitoring system built upon them is capable of diagnosing a wide range of cavitation conditions that can occur in a centrifugal pump, including the very early incipient cavitation.

Capitalizing on Aggregate Data for Gaining Process Understanding

Continuous improvement of pharmaceutical manufacturing operations has not evolved at the same rate as it has in other industries. Though time-series data are routinely collected as part of equipment control systems, the data are usually not thoroughly evaluated. This topic investigates batch data, in-process and release test data, and time-series data from various operations in an effort to determine which parameters are most critical to the target pharmaceutical variable (dissolution, yield, etc.). We provide an evidence of the value of process analytical technology (PAT) initiatives focused on the analysis of historical process data through the quantification of the impact of individual process and raw material parameters on key product quality attributes.

horizontal line

The last change: