SIMPLIFY 2021

1st International Workshop on Data Analytics and Machine Learning Made Simple

Co-located with EDBT 2021, Nicosia, CyprusOnline, March 23, 2021

Call for Papers


Scope

There exists a plethora of current applications, with widely different characteristics though, that are generating and need to process massive amounts of static or streaming data. For example, Data Lakes gather large amounts of diverse data from a multitude of data sources with the aim to enable data analysts to perform ad hoc, self-service analytics, and to train machine learning models, reducing the time from data to insights. These operations are also particularly challenging in the case of applications that are processing streaming Big Data. Achieving this goal requires addressing various challenges relating to data volume, velocity, dynamicity, heterogeneity, and potentially (geo-)distributed data processing.

Although there exists a plethora of techniques, algorithms and tools to manage, query and analyze various types of data, they typically require a high degree of data management skills and expertise, as well as significant time and effort for data preparation, parameter tuning and design and implementation of data analytics and machine learning pipelines.

The aim of the SIMPLIFY workshop is to bring together computer scientists with interests in this field to present recent innovations, find topics of common interest and to stimulate further development of new approaches that greatly simplify the work of a data analyst when performing data analytics, or when employing machine learning algorithms, over Big Data.



Topics

Topics of interest include (but are not limited to):
      Novel architectures for data analytics and ML over data lakes
      Novel architectures for data analytics and online ML over streaming data
      Query processing over heterogeneous data
      Query processing over geo-distributed data
      Query optimization of data processing workflows
      Algorithms for mining and analytics over heterogeneous data
      Algorithms for online machine learning and data mining
      Similarity search and entity resolution
      Interactive data exploration
      Visual analytics over heterogeneous data
      Deep learning platforms
      Application papers demonstrating the impact of techniques relevant to SIMPLIFY



Submission guidelines

We invite submissions of novel research, completed or in-progress work, vision, and system papers. The page limit for regular research papers is 6 pages. Additionally, we welcome submission of short papers, up to 4 pages, of the following types: (a) papers that describe ongoing work that has not yet reached the maturity required for a full research paper; (b) vision papers that describe a vision for the future of the field; (c) system/application papers and demos.

Papers must present original work and not have been submitted or accepted for publication in any other workshop, conference or journal.

Submitted papers must follow the ACM Proceedings Format (adapted template for EDBT 2021 can be found here) and should be submitted electronically as PDF documents using the online EasyChair submission system:
https://easychair.org/conferences/?conf=simplify2021
All workshop papers will be indexed by DBLP and will be published online at CEUR.

Important Dates


      Submission deadline: December 22, 2020 December 29, 2020
      Notification to authors: January 22, 2021 January 25, 2021
      Camera-ready deadline: February 1, 2021 February 8, 2021

Committees


Workshop Chairs

      Antonios Deligiannakis, Technical University of Crete
      Manolis Koubarakis, National and Kapodistrian University of Athens
      Dimitris Skoutas, Athena Research Center



Program Committee

      Alexander Artikis, NCSR "Demokritos"
      Konstantina Bereta, National and Kapodistrian University of Athens
      Daniele Bonetta, Oracle Labs
      Bikash Chandra, Ecole Polytechnique Fédérale de Lausanne
      Nikos Giatrakos, Athena Research Center
      Damien Graux, ADAPT Centre and Trinity College Dublin
      Asterios Katsifodimos, Delft University of Technology
      Georgia Koutrika, Athena Research Center
      Matteo Lissandrini, Aalborg University
      Davide Mottin, Aarhus University
      Ioannis Mytilinis, Ecole Polytechnique Fédérale de Lausanne
      Eirini Ntoutsi, L3S Research Center
      Odysseas Papapetrou, Eindhoven University of Technology
      Matthias Renz, Christian-Albrechts-Universität zu Kiel
      Dimitris Sacharidis, Vienna University of Technology
      Alkis Simitsis, Athena Research Center
      Giovanni Simonini, Università di Modena e Reggio Emilia
      Thanasis Vergoulis, Athena Research Center
      Nikolay Yakovets, Eindhoven University of Technology

Program

Time in GMT+1 (Central Europe)

Session 1
09:00-09:05 Welcome
09:05-09:45 Keynote by Ralf Klinkenberg: "Real-Time Data Streaming and Machine Learning Model Deployment from an Easy-to-Use Graphical User Interface"presentation
Break
Session 2
10:00-10:15 Scale-independent Data Analysis with Database-backed Dataframes: a Case Studypresentation
Phanwadee Sinthong, Michael Carey and Yuhan Yao
10:15-10:30 What's Mine is Yours, What's Yours is Mine: Simplifying Significance Testing With Big Datapresentation
Karan Matnani, Valerie Liptak and George Forman
10:30-10:45 Simplifying p-value Calculation for the Unbiased microRNA Enrichment Analysis, Using ML-techniquespresentation
Konstantinos Zagganas, Maria Lioli, Thanasis Vergoulis and Theodore Dalamagas
10:45-11:00 Storage Management in Smart Data Lakepresentation
Haoqiong Bian, Bikash Chandra, Ioannis Mytilinis and Anastasia Ailamaki
11:00-11:15 Easy Sparkpresentation
Ylaise van den Wildenberg, Wouter W.L. Nuijten and Odysseas Papapetrou
11:15-11:30 MRbox: Simplifying Working with Remote Heterogeneous Analytics and Storage Services via Localised Viewspresentation
Athina Kyriakou and Iraklis Angelos Klampanos
Break
Session 3
11:45-12:00 Multi-Attribute Similarity Search for Interactive Data Explorationpresentation
Kostas Patroumpas, Alexandros Zeakis, Dimitrios Skoutas and Roberto Santoro
12:00-12:15 Speculative Execution of Similarity Queries: Real-Time Parameter Optimization through Visual Explorationpresentation
Thilo Spinner, Udo Schlegel, Martin Schall, Fabian Sperrle, Rita Sevastjanova, Beatrice Gobbo, Julius Rauscher, Mennatallah El-Assady and Daniel A. Keim
12:15-12:30 An Empirical Evaluation of Early Time-Series Classification Algorithmspresentation
Evgenios Kladis, Charilaos Akasiadis, Evangelos Michelioudakis, Elias Alevizos and Alexandros Artikis
12:30-12:45 Weighted Load Balancing Mechanisms over Streaming Big Data for Online Machine Learningpresentation
Petros Petrou, Sophia Karagiorgou and Dimitrios Alexandrou
12:45-13:00 Simplifying Impact Prediction for Scientific Articlespresentation
Thanasis Vergoulis, Ilias Kanellos, Giorgos Giannopoulos and Theodore Dalamagas
Closing

Keynote

Title: Real-Time Data Streaming and Machine Learning Model Deployment from an Easy-to-Use Graphical User Interface

Abstract:

Increasing amounts of available data and advanced data analysis and machine learning enable new insights, forecasts, automation, and other value creating solutions for many use cases across many industries. The value of such solution of increases significantly, if a wider variety of data sources can be integrated and if real-time predictions based on real-time data support decision processes and automations. Simplifying the design and deployment of data analysis and machine learning processes enables broader groups of users to leverage the power of machine learning for their use cases. The EU-funded R&D project INFORE (Interactive Extreme-Scale Analytics and Forecasting) addresses the challenges posed by huge datasets and data streams and paves the way for real-time, interactive extreme-scale analytics and forecasting. Today, at an increasing rate, industrial and scientific institutions need to deal with massive data flows, streaming-in from maritime surveillance applications, financial forecasting applications or cancer cells growth simulations as well as a multitude of other sources. The ability to forecast, as early as possible, a good approximation to the outcome of a time-consuming and resource demanding computational task allows to quickly identify undesired outcomes and save valuable amount of time, effort and computational resources. Since not everyone is a data scientist and since not everyone knows how to configure and program the tools needed for real-time data streaming, making the design and deployment of data analysis processes simpler is crucial for a wider adoption of these technologies and to enable users in many industries to leverage the value creation potential for their use cases. Within the INFORE project, the focus of RapidMiner is to provide a unified software platform seamlessly integrating all data sources, data streaming technologies (Kafka, Flink, Spark Streaming, etc.), machine learning algorithms and libraries (Python, R, Google TensorFlow, DL4J, H2O, Keras, Weka, etc.), model validation schemes (cross-validation, sliding window validation, etc.), deployment options, visualizations, model monitoring and operations with easy-to-use interfaces. Users can visually design data analysis workflows in an easy-to-use Graphical User Interface (GUI) without having to code and train and deploy machine learned models locally on their computer or server or on distributed data streams, Hadoop or Spark clusters, in the cloud, or on the edge – all from a single unified graphical user interface. The goal is to ease and accelerate the process from data and idea to productive analysis processes and value creation with machine learning for as many users as possible, including not only data scientists but also domain experts from various industries like beer brewers, electrical engineers, manufacturers, etc. as well as business analysts and managers. Within INFORE, RapidMiner and its project partners have developed an easy-to-use framework for handling and integrating large data streams from various sources using various standard technologies like Kafka, Flink, and Spark Streaming, for data preprocessing, for machine learning and parameter optimization (mostly offline), and for real-time model deployment on real-time data streams (online). The INFORE project demonstrates the applicability of this framework and its time series and forecasting capabilities and its Complex Event Detection and Prediction (CEP) capabilities on use cases in various domains including maritime surveillance and issue detection, financial time series forecasting, and cancer cell growth simulations and predictions. This presentation focuses on the developed real-time data streaming and machine learning framework and its user interface in RapidMiner.

Presenter: Ralf Klinkenberg (RapidMiner)

Ralf KlinkenbergRalf Klinkenberg, founder and head of research at RapidMiner, is a data-driven entrepreneur with more than 30 years of experience in machine learning and advanced data analytics research, software development, consulting, and applications in the automotive, aviation, chemical, finance, healthcare, insurance, internet, manufacturing, pharmaceutical, retail, software, and telecom industries. He holds Master of Science degrees in computer science with focus on artificial intelligence, machine learning, and predictive analytics from Technical University of Dortmund, Germany, and Missouri University of Science and Technology (MST), Rolla, MO, USA. In 2001 he initiated the open source data mining software project RapidMiner and in 2007 he founded the predictive analytics software company RapidMiner with Dr. Ingo Mierswa. In 2008 he won the European Open Source Business Award and 2016 he was awarded the European Data Innovator Award. In 2017 the German government invited him to the steering committee of the “Plattform Lernende Systeme”, an initiative of the German government to promote the use of machine learning and artificial intelligence in industry and society, which he serves since then. In 2018 and 2020 he consulted the German government in the formulation of its artificial intelligence strategy. Ralf Klinkenberg is co-organizer of the Industrial Data Science (IDS) conference series. He is passionate about learning in humans and machines as well as about how to leverage data to make organization more data-driven, more agile, more efficient and effective, and more successful using data mining and machine learning, both from a business and a technical perspective. Today RapidMiner has 770,000+ registered users in 150+ countries world-wide and is one of the most widely used predictive analytics platforms world-wide. The analysts of Forrester and Gartner view RapidMiner as one of the world-leading software platforms for machine learning and data science.