Ontology-mediated Queries for Graph Databases
Principle Investigator: Mantas Šimkus
Modern software applications need to manage and access information with increasingly complex structure, and Knowledge Representation and Reasoning (KR&R) has proved to provide promising means to address this challenge. This area of Artificial Intelligence (AI) is developing techniques to enable automated reasoning about knowledge in a given application domain, captured in a suitably tailored machine-readable format. When knowledge with complex structure needs to be represented, building ontologies expressed in Description Logics (DLs) is one of the most popular choices [Baader et al., 2003]. For instance, DLs provide the foundations for the W3C-standardized languages for describing resources in the Semantic Web using the so-called OWL ontologies [OWL Working Group, 2009, Horrocks, 2008]. Supported by efficient reasoners, DLs are being successfully used to capture complex knowledge in life sciences, e.g., to systematize thousands of medical terms in the SNOMED CT ontology (Systematized Nomenclature of Medicine—Clinical Terms), which is used in the health care systems of the US, the UK, and other countries [Spackman, 2000]. A major application of DLs is the data integration paradigm called Ontology-based Data Access (OBDA) [Poggi et al., 2008]. Here an ontology provides a unified view (a conceptual schema) of the application domain, reconciling the differences among the multiple data sources and providing machine-readable domain knowledge. A key reasoning task in OBDA is answering ontology-mediated queries (OMQs). Intuitively, in this setting a user query formulated using the vocabulary of the conceptual schema is answered by an OBDA engine by incorporating information from the various information sources, and addressing the possible information incompleteness by employing the represented domain knowledge to infer new facts when possible.
DLs allow to model the domain of interest using unary and binary relation symbols, called concept names and role names, respectively. The semantics to DL ontologies is given in terms of models, which are nothing else but relational databases with only unary and binary relations, satisfying the ontology seen as a collection of integrity constraints. These structures can be naturally represented as graphs, whose nodes are labeled with concept names and edges are labeled with role names. Such graphs are a basic object of investigations in graph databases (see, e.g., [Consens and Mendelzon, 1990, Barceló et al., 2012a, Barceló et al., 2015, Barceló et al., 2014, Bourhis et al., 2014, Reutter, 2013, Barceló et al., 2012b]), which are gaining increasing popularity as an alternative to traditional relational databases. Graph databases provide more flexibility as they do not require a database schema, which in classic relational databases imposes a very rigid structure on the data. E.g., RDF triple stores for Web data are concrete instances of graph databases. OBDA and graph databases are close in spirit in the sense that both deal with loosely structured data. However, they also have major differences, which this project aims to reconcile.
The first difference between OBDA and graph databases is the assumption made regarding completeness of information. Most DLs are fragments of classical first-order logic [Borgida, 1996], and thus have semantics that makes the open-world assumption. That is, the semantics of an ontology is given by the set of all its models. This view naturally supports modeling and reasoning in the presence of incomplete knowledge, and hence is adopted by the current OBDA approaches. In contrast, in the traditional (graph) database setting one makes the closed-world assumption; intuitively, it means that every fact that is not explicitly present in the database must be assumed to be false. This is natural when the information stored in a database is known to be complete. For example, a municipality-provided table of bus routes can be assumed complete, i.e. if the table does not include a route to a certain destination, then it can be assumed that there is no such route. It has been acknowledged by both DL researchers and database theorists that these assumptions are too strong, and that formalisms simultaneously supporting open-world and closed-world reasoning need to be developed [Lutz et al., 2012, Motik and Rosati, 2010, Casoli and Lenzerini, 1990, Knorr et al., 2011, Fan and Geerts, 2010, Benedikt et al., 2016]. E.g., in an OBDA system that integrates information about a city’s restaurants and bus routes, it may be useful to view the municipality-provided data about existing bus routes as complete, while a crowd-sourced collection of restaurants as incomplete. Ideally, query answering algorithms should exploit this information to exclude irrelevant models and infer more informative answers. A recent approach to this challenge are extensions of DLs with closed predicates [Franconi et al., 2011, Franconi et al., 2013, Lutz et al., 2013, Lutz et al., 2015, Ngo et al., 2016], which allow the user to specify which relations of the input database should be considered as complete.
The second difference between OBDA and graph databases is the query languages that are used for accessing data. The DL community has mainly focused on OMQs based on conjunctive queries (CQs), which are the most common query language for relational databases. The computational complexity and efficient algorithms for such OMQs have received significant attention in the recent years (see, e.g., [Ortiz and Šimkus, 2012, Bienvenu and Ortiz, 2015] for surveys). However, CQs lack navigational features, which are central in query languages for graph databases. E.g., the natural query to retrieve pairs (x,y) such that y is a police station that is reachable from a bus stop y using one or more bus connections is not supported by the majority of OMQ answering approaches. In contrast, navigational queries like regular path queries (RPQs) in graph databases effortlessly support such and more complex queries.
In this project we will study the database-theoretic foundations of combining closed predicates and navigational features in ontology-mediated query answering, paving the way for very powerful techniques for intelligent data access, and going significantly beyond the current state-of-the-art. In particular, the main goal of this project is to study the relative expressiveness and the complexity of static analysis problems for ontology-mediated queries with navigational features and closed predicates (ONCs).
Characterizing Relative Expressiveness This goal aims to compare the expressive power of ONCs with the expressive power of classic query languages. To this end, we will mainly develop translations from ONCs into (suitable variations) of Datalog. This study will clarify the relationship between the emerging and the classic approaches to data access, e.g., it will allow us to transfer existing results from the setting of Datalog to the setting of ONCs. This study will also open the way to reusing existing efficient Datalog engines for answering ONCs in real-life scenarios.
Complexity of Static Analysis Problems We will study the computational complexity of reasoning tasks aimed at supporting the design of ONCs. In particular, we will investigate the query containment and the query emptiness problems, the most primitive static analysis problems that lie at the core query design and optimization techniques.
By studying the combination of navigational features and closed predicates, this project aims to provide substantial impact in at least two research areas: in KR&R, and in Database Theory. From the KR&R perspective, ONCs are superior to standard OMQs because they allow for a more flexible navigation of information, and allow to assert that certain parts of the knowledge are complete. From the Database Theory perspective, ONCs are superior to standard navigational query languages for graph databases because they allow to declare some relations as incomplete (in contrast to the default assumption that they are complete), and then interrelate the complete and incomplete relations by means of an ontology.
[Baader et al., 2003] Baader, F., Calvanese, D., McGuinness, D. L., Nardi, D., and Patel-Schneider, P. F., editors (2003). The Description Logic Handbook: Theory, Implementation, and Applications.
Cambridge University Press.
[Barceló et al., 2012b] Barceló, P., Pérez, J., and Reutter, J. L. (2012b). Relative expressiveness of
nested regular expressions. In Proc. of the A. Mendelzon Int. WS on Found. of Data Management
(AMW 2012), pages 180–195. CEUR-WS.org.
[Bienvenu and Ortiz, 2015] Bienvenu, M. and Ortiz, M. (2015). Ontology-mediated query answering
with data-tractable description logics. In Proc. of Reasoning Web Summer School 2015, volume 9203 of
LNCS, pages 218–307. Springer.
[Bourhis et al., 2014] Bourhis, P., Krötzsch, M., and Rudolph, S. (2014). How to best nest regular
path queries. In Proc. of DL 2014, volume 1193 of CEUR Workshop Proceedings, pages 404–415.
[Franconi et al., 2013] Franconi, E., Kerhet, V., and Ngo, N. (2013). Exact query reformulation over
databases with first-order and description logics ontologies. J. Artif. Intell. Res. (JAIR), 48:885–922.
[Ortiz and Šimkus, 2012] Ortiz, M. and Šimkus, M. (2012). Reasoning and query answering in
description logics. In Proc. of the Reasoning Web Summer School 2012, volume 7487 of LNCS, pages
[OWL Working Group, 2009] OWL
Working Group, W. (27 October 2009). OWL 2 Web Ontology Language: Document Overview. W3C
Recommendation. Available at http://www.w3.org/TR/owl2-overview/.