Dagger: Digging for Interesting Aggregates in RDF Graphs

Get Complete Project Material File(s) Now! »

Research environment

We describe below the research environment of the internship: the Cedar team of Inria.

Inria

Inria, the French Institute for Research in Computer Science and Automation (Insti-tut national de recherche en informatique et en automatique, in French), is a French national research institution focusing on computer science and applied mathemat-ics. It aims at promoting “scientific excellence for technology transfer and society”. Research work is organized based on “project-teams”, by gathering researchers from similar domains.
Inria has totally 8 research centers (in Bordeaux, Grenoble-Inovallée, Lille, Nancy, Paris-Rocquencourt, Rennes, Saclay, and Sophia Antipolis) and also collaborates with other ace-ademic research outside these centers, such as the LRI lab of Université Paris Sud or the LIX lab of Ecole Polytechnique. Research teams are divided into 5 diﬀerent fields:
• Applied Mathematics, Computation and Simulation
• Algorithmics, Programming, Software and Architecture
• Networks, Systems and Services, Distributed Computing
• Perception, Cognition and Interaction
• Digital Health, Biology and Earth
Cedar belongs to “Perception, Cognition and Interaction”, under the sub-category of “Data and Knowledge Representation and Processing”.
Inria employs 3800 people, among which there are 1300 researchers, 1000 Ph.D. stu-dents and 500 postdoctorates.

Inria: organization

The Chairman and CEO of Inria is Anotoine Petit. Prior to that, he is a PhD and processor in Computer Science, and a formerly Deputy Managing Director of Inria.
The management team has totally 18 people, including the president of diﬀerent research centers. Chart 2.1 illustrated the general organization of Inria.
The organization of Inria also includes an administrative committee, two scientific committees, and the regular visit of an outside committee.

Inria: key figures

Below, we collect some recent key figures of Inria (Updated in July 2017), which illustrate diﬀerent aspects of the institute in 2016.
• Scientific activities
– 183 Inria project-teams
– 4,500 scientific publications per year
• Industrial relations
– 410 active patents
– 126 software programmes registered with France’s Software Protection Agency in 2016
– 44 start-ups Inria since 2010
• Structure
– 8 research centres located throughout France (Paris, Rennes, Sophia An-tipolis, Grenoble, Nancy, Bordeaux, Lille and Saclay) and a head oﬃce in Rocquencourt, near Paris.
• Human resources
– 2,400 members from 102 diﬀerent countries
– 1,200 PhD students
• Budgetary resources
– Total budget: €231M
– Proportion self-financed: 25%

Inria Saclay

Inria Saclay is the center that I am aﬃliated. It was created at 1st January 2008. The center has 450 scientists and 100 research support stuﬀ. It has 31 research teams, 23 of which are joint ventures with other establishments like CEA Saclay and École Polytechnique. Many of co-joint establishments are located at “Plateau de Saclay”.
In addition to its natural collaborations within the European Union, Inria Saclay also participates in a large number of collaboration projects aorund the world. These mixed projects are created through the personal contacts of individual researchers and teams, or through bi- or multi-lateral agreements. Research partners are from establishments or universities like University of California, Berkeley and Stanford University.

The Cedar team

I was integrated into the Cedar team of Inria-Saclay. It was created in 2016 and it is joint team between Inria Saclay and LIX (CNRS – UMR 7161 and Ecole Polytech-nique). The team focus on “Rich data analytics at cloud scale”.
The team has 3 permanent members: Ioana Manolescu (INRIA senior researcher, team leader), Yanlei Diao (École Polytechnique professor, on a joint position with Télécom ParisTech) and Michaël Thomazo (INRIA junior researcher, associate team leader). The team has also 2 associated members: François Goasdoué (Professor at Univ. Rennes 1) and Xavier Tannier (LIMSI, CNRS/U. Paris Sud).
At present, the team has 7 Ph.D. students, 2 engineers, 2 interns (including me) and 1 post-doc.

Research themes

The research projects of the Cedar team are focused on two themes:
• Exploiting parallel data processing infrastructures. The direction aims at devel-oping highly scalable, parallel Big Data storage and processing tools.
• Seeking new paradigms of user interaction with Big Data. The direction aims at devising new tools for mining values from Big data, based on exploratory query, analytics for semantic graphs etc.

Current projects

ContentCheck
The project aims at designing new models, algorithms and tools for fact checking.
Fact-checking is the task of assessing the factual accuracy of claims, typically prior to publication. Modern fact-checking is faced with a triple revolution in terms of scale, complexity, and visibility, as claims and background knowledge are increasingly digital.
This project brings together academic labs with expertise in data management, natural language processing, automated reasoning and data mining, and a fact-checking team of journalists from a major french web media.
The team working on this project aims to establish fact-checking as a data manage-ment problem, endows it with sound foundations, designs and deploys new algorithms for automating fact-checking, and validates them by close interaction with the jour-nalists.
WebClaimExplain
It is a co-joint project with AIST (Advanced Industrial Science and Technology) in Japan.
The recent evolution of the Internet, such as the emergence of social networks, open data and sensor networks, have made it challenging for people or businesses to cope with the information deluge. Virtually any decision, from voting to calling the doc-tor or buying stocks, is based on facts we find around us, and increasingly on the Internet.Facts are published by individuals (e.g. journalists, lobbyists, social users), organizations (e.g. public relations agencies, media outlets, governments), or ma-chines (e.g. news generators, disaster monitors). The goal of this research is to create tools to find explanations for facts and verify claims made online. Several challenges emerge in the pursuit of such a goal:
• Statements are generally made in natural languages, which are notoriously hard to process algorithmically;
• Even when statements are available in machine-processable form, determining the “truth” of claims is diﬃcult because of the inherent lack of contextual infor-mation (time, space, political views, belief systems, etc.);
• Assuming “suﬃcient” context is available, one still needs to use external sources and inference mechanisms to draw conclusions — if the trustworthiness of such sources and rules are subject to caution, this may lead to weak or simply wrong conclusions. In this respect, it is clear that the process cannot be fully au-tomated. The main focus of our work will be explanation finding via trusted sources, based on the observation that one can only trust a statement if he/she can explain it through rules and proofs that can themselves be trusted.
RDFSummary
This project seeks to devise a query-oriented tool for summarization of RDF graphs. The problem of RDF summarization, meaning: given an RDF graph, find an RDF summary graph which summarizes the input dataset as accurately as possible, while being possibly orders of magnitude smaller than the original graph. Such a summary can be used in a variety of contexts: to help an RDF application designer get ac-quainted with a new dataset, as a first-level user interface, or as a support for query optimization as traditionally the case in semi-structured graph data management etc.
Diﬀerent kinds of summaries are proposed based on the treatment of typed resources and degree of similarity of property sets between resources, including weak, typed-weak, strong and typed strong summary.
A set of sample summaries of diﬀerent RDF datasets can be accessed at: https://team.inria.fr/cedar/projects/rdfsummary/.

READ Pollution and sources of pollution

Semantic Web

The semantic web is an extension of the Web through standards by the World Wide Web Consortium (W3C). The term was coined by Tim Berners-Lee, aiming at creating a new generation of web (Web 3.0) on which the data can be processed and understood by machines.
Currenlty, the World Wide Web is based mainly on HTML (Hypertext Markup Lan-guage) documents, a markup convention that is used for coding web page. An HTML page is composed by metadata tags, along with the data to be expressed. An HTML page can be rendered by a web browser. Nowadays, web technologies are among the most important technologies of Information, large amounts of information are ex-pressed using web pages. However, HTML has no capability to express semantics. For instance, by reading an HTML page, we may know that the profession of a person called Bob is “musician”. However, this information are placed in diﬀerent tags of a plain HTML page with no explicit relationships with each other. Consequently, we cannot know that Bob is a musician and a musician is a person. More specifically, machines cannot obtain or infer semantics from a plain HTML page.
The semantic web is designed to make the web understandable. It involves publishing in languages specifically designed for data: Resource Description Framework (RDF), Web Ontology Language (OWL) and and Extensible Markup Language (XML). HTML describe documents and links between them. RDF, OWL and XML, by contrast, can describe things in the real world such as People, Position or Country.
The set of technologies of semantic web amis at replacing the current content of Web documents. By publishing using technologies of the semantic web, content is combined with meaning. A machine can thus process and infer knowledge over the content, using processes similar to human deductive reasoning and inference.

RDF, RDF graph

RDF (the Resource Description Framework) is the W3C standard for describing interconnected resources. It is designed as a metadata data model. It is in the form of subject-predicate-obejct expressions, known as triples. The subject denotes the resource, the predicate denotes aspects or characteristics of the resource, and expresses a relationship between the subject and the object.
For example, one way to represent the notion « Bob has the profession musician » in RDF is the triple:
• a subject denoting Bob
• a predicate denoting « has the profession »
• an object denoting « Musician »
This standard for describing resources is a key component of Semantic Web project. By using RDF through the World Wide Web, software can process and exchange machine-readable data, making it possible to build more intelligent applications like Knowledge Graphs.
RDF can be expressed using diﬀerent formats, based on diﬀerent ways for storing and transmitting data. The most important formats are: Notation3, Turtle and N-Triples.
An RDF graph is a set of triples, typically denoted as (s; p; o), where s is the subject, p is the predicate (also called property) and o is the object, or value of the property p of s. RDF graph is complex and heterogeneous; in particular, a resource can have several values for a property. For instance, resource representing journal articles may have several values for the property creator.
Below is an example of RDF graph:
As the example illustrates, subjects and objects are presented as node, whereas a property is represented as an edge between two nodes. From this example, we know that: “Kevin Chen” is a Person, his given name is “Kevin” and his family name is “Chen”, his homepage as well as his mail box address. It should noted that nodes and edges can be URI (Unique Resource Identifier)1 such as: http://www.example.org/~kevin/contact.rdf#kevinchen is a URI (denoted as s for short) for representing a person called Kevin Chen.
In contrast, property objects can be URIs or literals (which can be viewed as constants; see below).
The special property: http://www.w3.org/1999/02/22-rdf-syntax-ns#type (type, for short) allows a form of resource typing. In this example, we know that (s;type;Person), stating that the resource represented by s is of type Person.
A resource may have one or several types, or it may lack types. Property value (the object) can be URIs, blank nodes (a special form of unknown/unspecified nodes), Strings, Integers, float numbers, dates etc. Values may be typed explicitly, as in “12:5^^xsd:decimal”, but may also appear simply as “12:5”. In the example above, the property value for given name and family name is Strings, whereas this information may not appear in the original data.

Table of contents :

GET THE COMPLETE PROJECT