Conférenciers invités
-
« Web sémantique : beaucoup de données, quelques connaissances et un peu de raisonnement. »
François Goasdoué (Université de Rennes 1, ENSSAT-IRISA) et Marie-Christine Rousset (Université de Grenoble-Alpes, LIG, et Institut Universitaire de France).
Abstract : Le Resource Description Framework (RDF), standard du W3C pour le Web Sémantique, suscite un intérêt croissant de la part de la communauté Bases de Données. Ce modèle de données est en effet particulièrement adapté à la représentation de Big Data (données très volumineuses, hétérogènes et incomplètes) et a déjà une incarnation phare dans le Linked Data (http://linkeddata.org).
RDF est un modèle flexible qui permet d’exprimer de manière uniforme, sous forme de triplets, des méta-données sur des entités référencées par des URIs, mais aussi des connaissances sur le schéma des classes et des propriétés, qui constituent ce qu’on appelle souvent des ontologies.
La mise en œuvre du Web sémantique consiste à exploiter ces connaissances par des algorithmes de raisonnement pour compléter par inférence l’ensemble des réponses à des requêtes, et aussi pour enrichir et lier les données de plusieurs sources.
Dans cet exposé, nous soulignerons les similitudes et les spécificités du modèle RDF par rapport aux modèles formels des bases de données déductives et des bases de données incomplètes. Nous dégagerons ensuite les défis découlant de ces spécificités pour mettre en œuvre des techniques efficaces d’interrogation ainsi que de liage de données et de connaissances. Nous présenterons enfin les principales approches proposées dans la littérature récente pour relever certains de ces défis.
François Goasdoué est Professeur en Informatique à l’Université de Rennes 1. Ses travaux de recherche sont menés à l’interface des Bases de Données et de la Représentation des Connaissances & Raisonnement ; ils portent sur la gestion efficace de données (consistance, interrogation, mise-à-jour, etc) dans le cadre des graphes RDF et des bases de connaissances OWL2, et dans des architectures centralisées, décentralisées et massivement parallèles. Ses résultats sont régulièrement publiés dans les revues et conférences majeures de Bases de données et d’Intelligence Artificielle.
Marie-Christine Rousset is a Professor of Computer Science at the University of Grenoble. Her areas of research are Knowledge Representation, Information Integration, Linked Data and the Semantic Web. She has published around 100 refereed international journal articles and conference papers, and participated in several cooperative industry-university projects. She received a best paper award from AAAI in 1996, and has been nominated ECCAI fellow in 2005. She has served in many program committees of international conferences and workshops and in editorial boards of several journals.
Chair : Ioana Manolescu
-
« Accommoder les miettes de données : Ingrédients, Recettes et Astuces »
Amélie Marian (Rutgers U) et Arnaud Sahuguet (NYU Urban Science)
Abstract : Notre existence devient chaque jour de plus en plus digitale. Nos interactions sociales, professionnelles, éducatives, financières, sportives, culturelles, etc. sont désormais conduites via des intermédiaires digitaux. Pour chacune de ces interactions, une trace digitale est créée.
Ces miettes de données constituent un réservoir de connaissances énorme et encore mal exploité pour les individus, le secteur privé mais aussi le secteur public. Ces miettes représentent à la fois un immense espoir – médecine personnalisée, assistant personnel intelligent, ville intelligente – mais aussi une terrible crainte – surveillance étatique, fin de la vie privée, hyper-marketing.
Dans cet exposé, nous passerons en revue les différents types de données générées par nos interaction digitales; les modèles de données et métadonnées qui y sont associées; les techniques de stockage et de requêtes associées; ainsi que des exemple d’usages pour des produits grand public, des applications liées à la recherche médicale et des applications citoyennes.Amélie Marian is an Associate Professor in the Computer Science Department at Rutgers University. Her research interests are in Personal Information Management, Ranked Query Processing, Semi-structured data and Web data Management. Amélie received her Ph.D. in Computer Science from Columbia University in 2005. From March 1999 to August 2000, Amélie was a member of the VERSO project at INRIA-Rocquencourt. She received B.S. and M.S. degrees from Université Paris Dauphine, France in 1998 and 1999, respectively. She is the recipient of a Microsoft Live Labs Award (2006), three Google Research Awards (2008, 2010, and 2012) and an NSF CAREER award (2009).
Dr Arnaud Sahuguet is a technologist and entrepreneur with a passion to invent, architect and build products that leverage technology to solve meaningful problems and have a large social impact. His goal is to empower people and organizations to be more productive and collaborative through innovation. Before joining GovLab as Chief Technology Officer, Arnaud spent 8 years at Google as a product manager for speech recognition and Google Maps; he founded and launched the OneToday mobile fundraising platform for Google.org; he also worked on child protection and civic innovation. Before Google, he spent 5 years at Bell Labs research as member of technical staff working on standardization, identity management and converged services. Arnaud holds a PhD in Computer Science from Univ. of Pennsylvania, a MSc from Ecole Nationale des Ponts et Chaussées and a BSc from Ecole Polytechnique in France. Full profile at https://www.linkedin.com/in/sahuguet
-
« Data integration challenges raised by self-service Business Intelligence »
Eric Simon, SAP France
Abstract : Enterprise Business Intelligence (BI) traditionally provides solutions to business users for managed reporting (ad-hoc query and reporting or pixel-perfect reporting), dashboards and data analysis. BI solutions heavily rely on the IT organization to create the data warehouse and data marts underpinning the BI system, as well as the semantic layers specifically designed over this trusted data foundation to model information used by reports, dashboards and analytic queries. A decade ago, BI has evolved to empower business users to create personalized reports and analytical queries, and let them manipulate and explore information directly, without resorting to IT. Business users and analysts are now demanding access to true “self-service” capabilities beyond data discovery and rich interactive visualization of IT-curated data sources, to include access to sophisticated data integration tools to prepare their data for analysis, and data governance capabilities. This growing demand raises the need for new data-driven and iterative solutions better suited to business users than the traditional « design-test-deploy » paradigm typically adopted by IT organizations. In this paradigm shift, business users « model their data as they go » creating their own analyses, reports and performance indicators. Business users need new powerful data-driven and interactive user interfaces as well as new capabilities to search for data, easily assess the quality of data, semi-automate the curation, profiling, and enrichment of data, and suggest how to expand and combine datasets that are semantically related depending on the user interaction context and profile. This talk will review the requirements of « self-service BI » and explain the technical challenges it raises to provide more data-driven data integration solutions. Some of the recent directions taken by SAP in this field will be outlined and illustrated. Open issues will be presented at the end.
Chair : Christine Collet
Tutoriels
-
« Data cleaning in the big data era »
Paolo Papotti and Jorge Quiané-Ruiz, Qatar Computing Research Institute (QCRI)
Abstract : In the « big data » era, data is often dirty in nature because of several reasons, such as typos, missing values, and duplicates. The intrinsic problem with dirty data is that it can lead to poor results in analytic tasks. For instance, Experian QAS Inc. reported that poor customer data cost British businesses £8 billion loss of revenue in 2011. Therefore, data cleaning is an unavoidable task to have reliable data for final applications, such as querying and mining. Data cleaning (a.k.a. data preparation) is a popular activity in both industry and in academia. Nevertheless, data cleaning is hard in practice as it requires a great amount of manual work. Several systems have been proposed to achieve the level of automation and scalability required by the volume and variety in big data. They rely on a formal, declarative approach based on first order logic: users provide high-level specifications of their tasks (the « what »); the systems compute optimal solutions without human intervention on the generated code (the « how »). However, despite the positive results in automating the data cleaning task, the volume (scalability) and variety of big data remain two open problems. In this tutorial, we first describe recent results in tackling data cleaning with a declarative approach. We then discuss how this experience has pushed several groups to explore a new approach to the problem to deal with the volume and variety of big data. In particular, we discuss how user defined functions and declarative specifications can coexist in a unified system, ultimately taking the best from both worlds.
Chair : Sihem Amer-Yahia