Development and applications of new semantic data mining methods in life sciences

Project coordinator: dr. Nada Lavrač, IJS

Coordinator for NIB: prof. dr. Metka Filipič

Code: J2-5478

Duration: 1.8.2013 - 31.7.2016


Knowledge discovery in databases is the area of computer science aimed at automatic search and exploration of large volumes of data with the goal of finding new hypotheses in the form of models and patterns automatically induced from the data. The discovered models/patterns are especially interesting if they are unexpected or if they contribute to the confirmation of yet unproven hypotheses. The limitation of current publicly available data mining and knowledge discovery platforms is their capacity of dealing only with simple tabular data. However, motivated by the increasing volume of semi-structured, heterogeneous and distributed data, the objective of the proposed SemDM project is to address this challenge and enhance the currently available data mining platforms by the ability to make use of distributed, heterogeneous information and knowledge sources, required for data analysis in knowledge-intensive domains.  The project has the following objectives:  - To develop new algorithms for Semantic Data Mining (SemDM) which will enable knowledge discovery from data stored in heterogeneous (structured, semi-structured and unstructured) and distributed data and knowledge sources, including semantically annotated data stored in publicly available ontologies (Gene Ontology and other knowledge sources available in the Linked Open Data cloud). - To develop a novel, science-oriented data mining platform ClowdFlows which will upgrade our recently developed Orange4WS platform, to enable browser-based construction of innovative data mining workflows from local and distributed data processing and mining services. - To apply and validate the proposed service-oriented Semantic Data Mining approach to two case studies, one in breast cancer data analysis and another in the discovery of glioma patients subgroups to validate novel molecular markers.  In the glioma case study, JSI and NIB researchers will jointly try to find new discoveries concerning glioblastoma (GBM), the most common and most aggressive form of glioma cancer. Recently, several biomarkers have been proposed as prognostic and predictive factors with respect to the patient’s therapy responsis, but so far none of them was applied in therapeutics. There is a need to decipher the interactive relationships among contributing genes in the clinical arena to make fast and accurate diagnosis of tumor grade and predict the prognosis of a particular patient. We argue that this can be achieved by a systems biology approach based on discovering subgroups of GBM patients, most likely based on their cell of origin (stem cells) and infiltrating stromal (stem) cells, resulting in distinct patterns of tumor progression. The project application aims to take advantage of studying GBM cancer stem cells and stromal supporting cells to identify genes - biomarkers that are relevant for GBM prognosis and targeting. The project will contribute to the development of new Semantic Data Mining algorithms, the improvement of their public accessibility through the web-based ClowdFlows platform, and to the generation of new knowledge in medical and bioinformatics domains. The work on this project will be performed in close collaboration of data mining experts from Jožef Stefan Institute (JSI) with domain experts from National Institute of Biology (NIB).

Significance for science

This project addresses the open problem of assisting scientists with the increasingly daunting task of heterogeneous and distributed information fusion and knowledge discovery. Solving this problem requires the development of a new computational paradigm that integrates ideas from different supporting domains. An adequate solution to this problem will result in new technologies that are relevant to a range of applications, some of which are also mentioned in the EU FP7 ICT work programme, such as Challenge 4 on Content and Challenge 5 on Healthcare. It covers issues such as knowledge management and creation, but goes beyond them in assisting users (particularly scientists) in knowledge discovery across distributed information repositories. The project will advance the state-of-the-art by developing a theoretical framework for semantic data mining, new data mining algorithms and a new approach to interactively formulate and refine powerful knowledge discovery workflows. Evidently, the proposed project solves an open problem and it is clearly pursing a long term objective with a high technological potential. Successful results of the SemDM project can contribute to Europe’s knowledge industry enabling it to become more effective, efficient and competitive. The challenges addressed by the SemDM project can not be adequately addressed with existing ICT methodologies or their incremental improvements since the methods developed within SemDM will be substantially different from existing information fusion and knowledge discovery technologies and will require the collaboration of scientists with diverse backgrounds to tackle challenges in innovative information fusion, data mining, distributed information retrieval, and sophisticated user interfaces. A successful outcome of the project may have, firstly, a significant impact on the data mining technology and on science, and in a longer term, when adapted to knowledge discovery, also a considerable impact on the ability of Europe’s private and public sector in public data analysis. SemDM project has the potential to implement and demonstrate a paradigm shift in information and knowledge management, discovery, fusion and understanding. SemDM prototype will establish a strong scientific and technological basis for a broader, interdisciplinary research community as well as help cultivating the underlying methodologies to a level at which it can attract investment from industry, especially in the pharmaceutical and biotechnology sector.