1st International Workshop on Data Analytics and Machine Learning Made Simple
Co-located with EDBT 2021, Nicosia, CyprusOnline, March 23, 2021
There exists a plethora of current applications, with widely different characteristics though, that are generating and need to process massive amounts of static or streaming data. For example, Data Lakes gather large amounts of diverse data from a multitude of data sources with the aim to enable data analysts to perform ad hoc, self-service analytics, and to train machine learning models, reducing the time from data to insights. These operations are also particularly challenging in the case of applications that are processing streaming Big Data. Achieving this goal requires addressing various challenges relating to data volume, velocity, dynamicity, heterogeneity, and potentially (geo-)distributed data processing.
Although there exists a plethora of techniques, algorithms and tools to manage, query and analyze various types of data, they typically require a high degree of data management skills and expertise, as well as significant time and effort for data preparation, parameter tuning and design and implementation of data analytics and machine learning pipelines.
The aim of the SIMPLIFY workshop is to bring together computer scientists with interests in this field to present recent innovations, find topics of common interest and to stimulate further development of new approaches that greatly simplify the work of a data analyst when performing data analytics, or when employing machine learning algorithms, over Big Data.
Topics of interest include (but are not limited to):
Novel architectures for data analytics and ML over data lakes
Novel architectures for data analytics and online ML over streaming data
Query processing over heterogeneous data
Query processing over geo-distributed data
Query optimization of data processing workflows
Algorithms for mining and analytics over heterogeneous data
Algorithms for online machine learning and data mining
Similarity search and entity resolution
Interactive data exploration
Visual analytics over heterogeneous data
Deep learning platforms
Application papers demonstrating the impact of techniques relevant to SIMPLIFY
We invite submissions of novel research, completed or in-progress work, vision, and system papers. The page limit for regular research papers is 6 pages. Additionally, we welcome submission of short papers, up to 4 pages, of the following types: (a) papers that describe ongoing work that has not yet reached the maturity required for a full research paper; (b) vision papers that describe a vision for the future of the field; (c) system/application papers and demos.
Papers must present original work and not have been submitted or accepted for publication in any other workshop, conference or journal.
Submitted papers must follow the ACM Proceedings Format (adapted template for EDBT 2021 can be found here) and should be submitted electronically as PDF documents using the online EasyChair submission system:
https://easychair.org/conferences/?conf=simplify2021
All workshop papers will be indexed by DBLP and will be published online at CEUR.
Submission deadline: December 22, 2020 December 29, 2020
Notification to authors: January 22, 2021 January 25, 2021
Camera-ready deadline: February 1, 2021 February 8, 2021
Antonios Deligiannakis, Technical University of Crete
Manolis Koubarakis, National and Kapodistrian University of Athens
Dimitris Skoutas, Athena Research Center
Alexander Artikis, NCSR "Demokritos"
Konstantina Bereta, National and Kapodistrian University of Athens
Daniele Bonetta, Oracle Labs
Bikash Chandra, Ecole Polytechnique Fédérale de Lausanne
Nikos Giatrakos, Athena Research Center
Damien Graux, ADAPT Centre and Trinity College Dublin
Asterios Katsifodimos, Delft University of Technology
Georgia Koutrika, Athena Research Center
Matteo Lissandrini, Aalborg University
Davide Mottin, Aarhus University
Ioannis Mytilinis, Ecole Polytechnique Fédérale de Lausanne
Eirini Ntoutsi, L3S Research Center
Odysseas Papapetrou, Eindhoven University of Technology
Matthias Renz, Christian-Albrechts-Universität zu Kiel
Dimitris Sacharidis, Vienna University of Technology
Alkis Simitsis, Athena Research Center
Giovanni Simonini, Università di Modena e Reggio Emilia
Thanasis Vergoulis, Athena Research Center
Nikolay Yakovets, Eindhoven University of Technology
Time in GMT+1 (Central Europe)
Session 1 | |
09:00-09:05 | Welcome |
09:05-09:45 | Keynote by Ralf Klinkenberg: "Real-Time Data Streaming and Machine Learning Model Deployment from an Easy-to-Use Graphical User Interface" |
Break | |
Session 2 | |
10:00-10:15 | Scale-independent Data Analysis with Database-backed Dataframes: a Case Study Phanwadee Sinthong, Michael Carey and Yuhan Yao |
10:15-10:30 | What's Mine is Yours, What's Yours is Mine: Simplifying Significance Testing With Big Data Karan Matnani, Valerie Liptak and George Forman |
10:30-10:45 | Simplifying p-value Calculation for the Unbiased microRNA Enrichment Analysis, Using ML-techniques Konstantinos Zagganas, Maria Lioli, Thanasis Vergoulis and Theodore Dalamagas |
10:45-11:00 | Storage Management in Smart Data Lake Haoqiong Bian, Bikash Chandra, Ioannis Mytilinis and Anastasia Ailamaki |
11:00-11:15 | Easy Spark Ylaise van den Wildenberg, Wouter W.L. Nuijten and Odysseas Papapetrou |
11:15-11:30 | MRbox: Simplifying Working with Remote Heterogeneous Analytics and Storage Services via Localised Views Athina Kyriakou and Iraklis Angelos Klampanos |
Break | |
Session 3 | |
11:45-12:00 | Multi-Attribute Similarity Search for Interactive Data Exploration Kostas Patroumpas, Alexandros Zeakis, Dimitrios Skoutas and Roberto Santoro |
12:00-12:15 | Speculative Execution of Similarity Queries: Real-Time Parameter Optimization through Visual Exploration Thilo Spinner, Udo Schlegel, Martin Schall, Fabian Sperrle, Rita Sevastjanova, Beatrice Gobbo, Julius Rauscher, Mennatallah El-Assady and Daniel A. Keim |
12:15-12:30 | An Empirical Evaluation of Early Time-Series Classification Algorithms Evgenios Kladis, Charilaos Akasiadis, Evangelos Michelioudakis, Elias Alevizos and Alexandros Artikis |
12:30-12:45 | Weighted Load Balancing Mechanisms over Streaming Big Data for Online Machine Learning Petros Petrou, Sophia Karagiorgou and Dimitrios Alexandrou |
12:45-13:00 | Simplifying Impact Prediction for Scientific Articles Thanasis Vergoulis, Ilias Kanellos, Giorgos Giannopoulos and Theodore Dalamagas |
Closing |
Title: Real-Time Data Streaming and Machine Learning Model Deployment from an Easy-to-Use Graphical User Interface
Abstract:
Increasing amounts of available data and advanced data analysis and machine learning enable new insights, forecasts, automation, and other value creating solutions for many use cases across many industries. The value of such solution of increases significantly, if a wider variety of data sources can be integrated and if real-time predictions based on real-time data support decision processes and automations. Simplifying the design and deployment of data analysis and machine learning processes enables broader groups of users to leverage the power of machine learning for their use cases. The EU-funded R&D project INFORE (Interactive Extreme-Scale Analytics and Forecasting) addresses the challenges posed by huge datasets and data streams and paves the way for real-time, interactive extreme-scale analytics and forecasting. Today, at an increasing rate, industrial and scientific institutions need to deal with massive data flows, streaming-in from maritime surveillance applications, financial forecasting applications or cancer cells growth simulations as well as a multitude of other sources. The ability to forecast, as early as possible, a good approximation to the outcome of a time-consuming and resource demanding computational task allows to quickly identify undesired outcomes and save valuable amount of time, effort and computational resources. Since not everyone is a data scientist and since not everyone knows how to configure and program the tools needed for real-time data streaming, making the design and deployment of data analysis processes simpler is crucial for a wider adoption of these technologies and to enable users in many industries to leverage the value creation potential for their use cases. Within the INFORE project, the focus of RapidMiner is to provide a unified software platform seamlessly integrating all data sources, data streaming technologies (Kafka, Flink, Spark Streaming, etc.), machine learning algorithms and libraries (Python, R, Google TensorFlow, DL4J, H2O, Keras, Weka, etc.), model validation schemes (cross-validation, sliding window validation, etc.), deployment options, visualizations, model monitoring and operations with easy-to-use interfaces. Users can visually design data analysis workflows in an easy-to-use Graphical User Interface (GUI) without having to code and train and deploy machine learned models locally on their computer or server or on distributed data streams, Hadoop or Spark clusters, in the cloud, or on the edge – all from a single unified graphical user interface. The goal is to ease and accelerate the process from data and idea to productive analysis processes and value creation with machine learning for as many users as possible, including not only data scientists but also domain experts from various industries like beer brewers, electrical engineers, manufacturers, etc. as well as business analysts and managers. Within INFORE, RapidMiner and its project partners have developed an easy-to-use framework for handling and integrating large data streams from various sources using various standard technologies like Kafka, Flink, and Spark Streaming, for data preprocessing, for machine learning and parameter optimization (mostly offline), and for real-time model deployment on real-time data streams (online). The INFORE project demonstrates the applicability of this framework and its time series and forecasting capabilities and its Complex Event Detection and Prediction (CEP) capabilities on use cases in various domains including maritime surveillance and issue detection, financial time series forecasting, and cancer cell growth simulations and predictions. This presentation focuses on the developed real-time data streaming and machine learning framework and its user interface in RapidMiner.Presenter: Ralf Klinkenberg (RapidMiner)
Ralf Klinkenberg, founder and head of research at RapidMiner, is a data-driven entrepreneur with more than 30 years of experience in machine learning and advanced data analytics research, software development, consulting, and applications in the automotive, aviation, chemical, finance, healthcare, insurance, internet, manufacturing, pharmaceutical, retail, software, and telecom industries. He holds Master of Science degrees in computer science with focus on artificial intelligence, machine learning, and predictive analytics from Technical University of Dortmund, Germany, and Missouri University of Science and Technology (MST), Rolla, MO, USA. In 2001 he initiated the open source data mining software project RapidMiner and in 2007 he founded the predictive analytics software company RapidMiner with Dr. Ingo Mierswa. In 2008 he won the European Open Source Business Award and 2016 he was awarded the European Data Innovator Award. In 2017 the German government invited him to the steering committee of the “Plattform Lernende Systeme”, an initiative of the German government to promote the use of machine learning and artificial intelligence in industry and society, which he serves since then. In 2018 and 2020 he consulted the German government in the formulation of its artificial intelligence strategy. Ralf Klinkenberg is co-organizer of the Industrial Data Science (IDS) conference series. He is passionate about learning in humans and machines as well as about how to leverage data to make organization more data-driven, more agile, more efficient and effective, and more successful using data mining and machine learning, both from a business and a technical perspective. Today RapidMiner has 770,000+ registered users in 150+ countries world-wide and is one of the most widely used predictive analytics platforms world-wide. The analysts of Forrester and Gartner view RapidMiner as one of the world-leading software platforms for machine learning and data science.