Dr Ahmed Awad
Permanent URI for this collection
Dr Ahmed Awad is an associate professor. Formerly, Prof. Awad was a professor of Big Data at the Institute of Computer Science, University of Tartu, Estonia. Before that, Dr. Awad was an associate professor of Information Systems at the faculty of computers and artificial intelligence Cairo University, Egypt.
Browse
Recent Submissions
Item C-3PA: Streaming Conformance, Confidence and Completeness in Prefix-Alignments(Springer, Cham, 2023) Raun, Kristo; Nielsen, Max; Burattin, Andrea; Awad, AhmedThe aim of streaming conformance checking is to find dis crepancies between process executions on streaming data and the refer ence process model. The state-of-the-art output from streaming confor mance checking is a prefix-alignment. However, current techniques that output a prefix-alignment are unable to handle warm-starting scenarios. Further, no indication is given of how close the trace is to termination—a highly relevant measure in a streaming setting. This paper introduces a novel approximate streaming conformance checking algorithm that enriches prefix-alignments with confidence and completeness measures. Empirical tests on synthetic and real-life datasets demonstrate that the new method outputs prefix-alignments that have a cost that is highly correlated with the output from the state of-the-art optimal prefix-alignments. Furthermore, the method is able to handle warm-starting scenarios and indicate the confidence level of the prefix-alignment. A stress test shows that the method is well-suited for fast-paced event streams.Item On The Shift to Decentralised Identity Management in Distributed Data Exchange Systems(ACM DIGITAL LIBRARY, 2023) Bakhtina, Mariia; Matulevičius, Raimundas; Awad, Ahmed; Kivimäki, PetteriThe commonly used centralised trust and centralised identity man agement make information systems and organisations prone to a single point of failure. Therefore, decentralised identity man agement has appeared as an alternative solution to mitigate the weaknesses of centralised identity. In this paper, we propose an approach of system analysis that should guide organisations that considers the transition to decentralised identity management. The approach aims to support decision-making about the usefulness of the transition based on the created assessment model. The approach is validated through a case study of the X-Road ecosystem.Item I Will Survive: An Event-driven Conformance Checking Approach Over Process Streams(ACM DIGITAL LIBRARY, 2023) Raun, Kristo; Tommassini, Riccardo; Awad, AhmedOnline conformance checking deals with finding discrepancies be tween real-life and modeled behavior on data streams. The current state-of-the-art output of online conformance checking is a prefix alignment, which is used for pinpointing the exact deviations in terms of the trace and the model while accommodating a trace’s unknown termination in an online setting. Current methods for producing prefix-alignments are computationally expensive and hinder the applicability in real-life settings. This paper introduces a new approximate algorithm – I Will Survive (IWS). The algorithm utilizes the trie data structure to improve the calculation speed, while remaining memory-efficient. Comparative analysis on real-life and synthetic datasets shows that the IWS algorithm can achieve an order of magnitude faster execution time while having a smaller error cost, compared to the current state of the art. In extreme cases, the IWS finds prefix alignments roughly three orders of magnitude faster than previous approximate methods. The IWS algorithm includes a discounted decay time setting for more efficient memory usage and a look ahead limit for improving computation time. Finally, the algorithm is stress tested for performance using a simulation of high-traffic event streams.Item A Decentralised Public Key Infrastructure for X-Road(ACM DIGITAL LIBRARY, 2023) Bakhtina, Mariia; Long Leung, Kin; Matulevičius, Raimundas; Awad, Ahmed; Švenda, PetrX-Road is an open-source solution that acts as a data exchange layer and enables secure data exchange between organisations. X Road serves as the backbone of digital infrastructure in the public sector (e.g., enabling Estonia’s digital public services) and private sector (e.g., enabling clients’ data exchange in the Japanese en ergy sector). An approach and architecture were recently proposed for the X-Road data exchange systems to move from public key infrastructure (PKI) with centralised certification authorities to de centralised PKI (DPKI). In this paper, we develop a proof of concept for the designed DPKI-based architecture that leverages distributed ledger-based identifiers and verifiable credentials to establish trust between information systems using Hyperledger Indy and Hyper ledger Aries. We evaluate the proof of concept implementation against the design and functional requirements. The results show that the proposed system architecture is technically feasible and satisfies the identified design goals and functional requirements. To the best of our knowledge, this paper presents the first open-access system prototype for an organisation’s identity management fol lowing self-sovereign identity principles. The presented proof of concept proves that DPKI helps to address some of the scalability issues of PKI, improve control over identity and mitigate replay attacks and a single point of failure in the X-Road system.Item Big Data Analytics from the Rich Cloud to the Frugal Edge(IEEE, 2023) M. Awaysheh, Feras; Tommasini, Riccardo; Awad, Ahmed—Modern systems and applications generate and con sume an enormous amount of data from different sources, including mobile edge computing and IoT systems. Our ability to locate and analyze these massive amounts of data will shape the future, building next-generation Big Data Analytics (BDA) and artificial intelligence systems in critical domains. Traditionally, big data materialize in a centralized repository (e.g., the cloud) for running sophisticated analytics using decent computation. Nevertheless, many modern applications and critical domains require low-latency data analysis with the right decision at the right time standard for building trust. With the advent of edge computing, that traditional deployment model shifted closer to the data sources at the network’s edge. Such a shift was motivated by minimized latency, increased uptime, and enhanced efficiencies. This paper studies the BDA building blocks, analyzes the deployment requirements for edge-based BDA QoS, and drafts future trends. It also discusses critical open issues and further research directions for the next step of edge-based BDA.Item Keyed Watermarks: A Fine-grained Tracking of Event-time in Apache Flink(IEEE, 2023) Yasser, Tawfik; Arafa, Tamer; El-Helw, Mohamed; Awad, AhmedBig Data Stream processing engines such as Apache Flink use windowing techniques to handle unbounded streams of events. Gathering all perti nent input within a window is crucial for event time windowing since it affects how accurate results are. A significant part of this process is played by watermarks, which are unique timestamps that show the passage of events in time. However, the current watermark generation method in Apache Flink, which works at the level of the input stream, tends to favor faster sub-streams, resulting in dropped events from slower sub-streams. In our analysis, we found that Apache Flink’s vanilla watermark generation approach caused around 33% loss of data if 50% of the keys around the median are delayed. Furthermore, the loss surpassed 37% when 50% of random keys are delayed. In this paper, we present a novel strategy called keyed watermarks to overcome data loss and increase the accuracy of data processing to at least 99% in most cases. We enable separate progress tracking by creating a unique watermark for each logical sub stream (key). In our study, we outline the architec tural and API changes necessary to implement keyed watermarks and discuss our experience in extending Apache Flink’s enormous code base. Additionally, we compare the effectiveness of our strategy against the conventional watermark generation method in terms of the accuracy of event-time tracking. Index Terms—Keyed Watermarks, Big Data Stream Processing, Event-Time Tracking, Apache Flink.Item Towards Scalable Process Mining Pipelines(IEEE, 2023) Mohamed, Belal; ElHelw, Mohamed; Awad, AhmedOver the past two decades, process mining has proven to be a valuable approach to gain insights into or ganizations’ performance. The major sub-fields of discovery, conformance, and improvement have witnessed substantial de velopment. Contributions have covered the spectrum of better algorithms, richer comparison metrics, and movement towards online analysis for process data. Mostly, these contributions were addressing process mining guidelines from the process mining manifesto. In this paper, we address the sixth guideline in the process mining manifesto. That is, process mining should be a continuous process. For this, we propose a pipelining approach that is: configurable, scalable, modular, and automated. We realize our proposal using Dask and evaluate it with different architectures, process discovery, and evaluation metrics.Item D 2IA: User-defined interval analytics on distributed streams(ProQuest Central, 2022) Awad, Ahmed; Tommasini, Riccardo; Langhi, Samuele; Kamel, Mahmoud; Della Valle, Emanuele; Sakr, SherifNowadays, modern Big Stream Processing Solutions (e.g. Spark, Flink) are working towards being the ultimate framework for streaming analytics. In order to achieve this goal, they started to offer extensions of SQL that incorporate stream-oriented primitives such as windowing and Complex Event Processing (CEP). The former enables stateful computation on infinite sequences of data items while the latter focuses on the detection of events pattern. In most of the cases, data items and events are considered instantaneous, i.e., they are single time points in a discrete temporal domain. Nevertheless, a point-based time semantics does not satisfy the requirements of a number of use-cases. For instance, it is not possible to detect the interval during which the temperature increases until the temperature begins to decrease, nor for all the relations this interval subsumes. To tackle this challenge, we present D 2IA; a set of novel abstract operators to define analytics on user-defined event intervals based on raw events and to efficiently reason about temporal relationships between intervals and/or point events. We realize the implementation of the concepts of D 2IA on top of Flink, a distributed stream processing engine for big data.Item Online correlation for unlabeled process events: A flexible CEP-based approach(ProQuest Central, 2022) M.A. Helal, Iman; Awad, AhmedProcess mining is a sub-field of data mining that focuses on analyzing timestamped and partially ordered data. This type of data is commonly called event logs. Each event is required to have at least three attributes: case ID, task ID/name, and timestamp to apply process mining techniques. Thus, any missing information need to be supplied first. Traditionally, events collected from different sources are manually correlated. While this might be acceptable in an offline setting, this is infeasible in an online setting. Recently, several use cases have emerged that call for applying process mining in an online setting. In such scenarios, a stream of high-speed and high-volume events continuously flow, e.g. IoT applications, with stringent latency requirements to have insights about the ongoing process. Thus, event correlation must be automated and occur as the data is being received. We introduce an approach that correlates unlabeled events received on a stream. Given a set of start activities, our approach correlates unlabeled events to a case identifier. Our approach is probabilistic. That implies a single uncorrelated event can be assigned to zero or more case identifiers with different probabilities. Moreover, our approach is flexible. That is, the user can supply domain knowledge in the form of constraints that reduce the correlation space. This knowledge can be supplied while the application is running. We realize our approach using complex event processing (CEP) technologies. We implemented a prototype on top of Esper, a state of the art industrial CEP engine. We compare our approach to baseline approaches. The experimental evaluation shows that our approach outperforms the throughput and latency of the baseline approaches. It also shows that using real-life logs, the accuracy of our approach can compete with the baseline approaches.Item Benchmarking Concept Drift Detectors for Online Machine Learning(Springer, Cham, 2022) Mahgoub, Mahmoud; Moharram, Hassan; Elkafrawy, Passent; Awad, AhmedConcept drift detection is an essential step to maintain the accuracy of online machine learning. The main task is to detect changes in data distribution that might cause changes in the decision bound aries for a classification algorithm. Upon drift detection, the classifica tion algorithm may reset its model or concurrently grow a new learning model. Over the past fifteen years, several drift detection methods have been proposed. Most of these methods have been implemented within the Massive Online Analysis (MOA). Moreover, a couple of studies have compared the drift detectors. However, such studies have merely focused on comparing the detection accuracy. Moreover, most of these studies are focused on synthetic data sets only. Additionally, these studies do not consider drift detectors not integrated into MOA. Furthermore, None of the studies have considered other metrics like resource consumption and runtime characteristics. These metrics are of utmost importance from an operational point of view. In this paper, we fill this gap. Namely, this paper evaluates the perfor mance of sixteen different drift detection methods using three different metrics: accuracy, runtime, and memory usage. To guarantee a fair com parison, MOA is used. Fourteen algorithms are implemented in MOA. We integrate two new algorithms (ADWIN++ and SDDM) into MOA.Item Efficient Checking of Timed Ordered Anti-patterns over Graph-Encoded Event Log(Springer, Cham, 2022) M. Zaki, Nesma; M. A. Helal, Iman; E. Hassanein, Ehab; Awad, AhmedEvent logs are used for a plethora of process analytics and mining techniques. A class of these mining activities is conformance (compliance) checking. The goal is to identify the violation of such patterns, i.e., anti-patterns. Several approaches have been proposed to tackle this analysis task. These approaches have been based on differ ent data models and storage technologies of the event log including rela tional databases, graph databases, and proprietary formats. Graph-based encoding of event logs is a promising direction that turns several process analytic tasks into queries on the underlying graph. Compliance checking is one class of such analysis tasks. In this paper, we argue that encoding log data as graphs alone is not enough to guarantee efficient processing of queries on this data. Effi ciency is important due to the interactive nature of compliance checking. Thus, anti-pattern detection would benefit from sub-linear scanning of the data. Moreover, as more data are added, e.g., new batches of logs arrive, the data size should grow sub-linearly to optimize both the space of storage and time for querying. We propose two encoding methods using graph representations, realized in Neo4J & SQL Graph Database, and show the benefits of these encoding on a special class of queries, namely timed ordered anti-patterns. Compared to several baseline encoding, our experiments show up to 5x speed up in the querying time as well as a 3x reduction in the graph size.Item Optimizing ADWIN for Steady Streams(2022 Association for Computing Machinery. ACM, 2022) Moharram, Hassan; Awad, Ahmed; M. El-Kafrawy, PassentWith the ever-growing data generation rates and stringent con straints on the latency of analyzing such data, stream analytics is overtaking. Learning from data streams, aka online machine learn ing, is no exception. However, online machine learning comes with many challenges for the different aspects of the learning process, starting from the algorithm design to the evaluation method. One of these challenges is the ability of a learning system to adapt to the change in data distribution, known as concept drift, to maintain the accuracy of the predictions. Over time, several drift detection approaches have been proposed. A prominent approach is adaptive windowing (ADWIN) which can detect changes in features data distribution without explicit feedback on the correctness of the prediction. Several variants for ADWIN have been proposed to enhance its runtime performance, w.r.t throughput, and latency. However, the drift detection accuracy of these variants was not compared with the original algorithm. Moreover, there is no study concerning the memory consumption of the variants and the origi nal algorithm. Additionally, the evaluation was done on synthetic datasets with a considerable number of drifts not covering all types of drifts or steady streams, those that do not have drifts at all or almost negligible drifts. The contribution of this paper is two-fold. First, we compare the original Adaptive Window (ADWIN) and its variants: Serial, HalfCut, and Optimistic in terms of drift detection accuracy, detec tion speed, and memory consumption, represented in the internal window size. We compare them using synthetic data sets cover ing different types of concept drifts, namely: incremental, gradual, abrupt, and steady. We also use two real-life datasets whose drifts are unknown. Second, we present ADWIN++. We use an adaptive bucket dropping technique to control window size. We evaluate our technique on the same data sets above and new datasets with fewer drifts. Experiments show that our approach saves about 80% of memory consumption. Moreover, it takes less time to detect concept drift and maintains the drift detection accuracy.Item A Novel Hadoop Security Model for Addressing Malicious Collusive Workers(ProQuest Central, 2021) M. Sauber, Amr; Awad, Ahmed; F. Shawish, Amr; M. El-Kafrawy, PassentWith the daily increase of data production and collection, Hadoop is a platform for processing big data on a distributed system. A master node globally manages running jobs, whereas worker nodes process partitions of the data locally. Hadoop uses MapReduce as an effective computing model. However, Hadoop experiences a high level of security vulnerability over hybrid and public clouds. Specially, several workers can fake results without actually processing their portions of the data. Several redundancy-based approaches have been proposed to counteract this risk. A replication mechanism is used to duplicate all or some of the tasks over multiple workers (nodes). A drawback of such approaches is that they generate a high overhead over the cluster. Additionally, malicious workers can behave well for a long period of time and attack later. *is paper presents a novel model to enhance the security of the cloud environment against untrusted workers. A new component called malicious workers’ trap (MWT) is developed to run on the master node to detect malicious (noncollusive and collusive) workers as they convert and attack the system. An implementation to test the proposed model and to analyze the performance of the system shows that the proposed model can accurately detect malicious workers with minor processing overhead compared to vanilla MapReduce and Verifiable MapReduce (V-MR) model [1]. In addition, MWT maintains a balance between the security and usability of the Hadoop clusterItem Efficient Approximate Conformance Checking Using Trie Data Structures(IEEE, 2021) Awad, Ahmed; Raun, Kristo; Weidlich, MatthiasConformance checking compares a process model and recorded executions of a process, i.e., a log of traces. To this end, state-of-the-art approaches compute an alignment between a trace and an execution sequence of the model. Since the construction of alignments is computationally expensive, approximation schemes have been developed to strike a balance between the efficiency and the accuracy of conformance checking. Specifically, conformance checking may rely only on so-called proxy behavior, a subset of the behavior of the model. However, the question how such proxy behavior shall be represented for efficient alignment computation has been largely neglected. In this paper, we contribute a new formulation of the proxy behavior derived from a model for approximate conformance checking. By encoding the proxy behavior using a trie data structure, we obtain a logarithmically reduced search space for alignment computation compared to a set-based representation. We show how our algorithm supports the definition of a budget for alignment computation and also augment it with strategies for meta-heuristic optimization and pruning of the search space. Evaluation experiments with five real-world event logs show that our approach reduces the runtime of alignment construction by two orders of magnitude with a modest estimation error.Item SDDM: an interpretable statistical concept drift detection method for data streams(ProQuest Central, 2021) Micevska, Simona; Awad, Ahmed; Sakr, SherifMachine learning models assume that data is drawn from a stationary distribution. However, in practice, challenges are imposed on models that need to make sense of fast-evolving data streams, where the content of data is changing and evolving over time. This change between the distributions of training data seen so-far and the distribution of newly coming data is called concept drift. It is of utmost importance to detect concept drifts to maintain the accu racy and reliability of online classifiers. Reactive drift detectors monitor the performance of the underlying machine learning model. That is, to detect a drift, feedback on the classifier output has to be given to the drift detector, known as prequential evaluation. In many real life scenarios, immediate feedback on classifier output is not possible. Thus, drift detection is delayed and gets out of context. Moreover, the drift detector output is in the form of a binary answer if there is a drift or not. However, it is equally important to explain the source of drift. In this paper, we present the Statistical Drift Detection Method (SDDM) which can detect drifts by monitoring the change of data distribution without the need for feedback on classifier output. Moreover, the detection is quantified and the source of drift is identified. We empirically evaluate our method against the state-of-the-art on both synthetic and real life data sets. SDDM outperforms other related approaches by producing a smaller number of false positives and false negatives.Item Predicting Remaining Cycle Time from Ongoing Cases: A Survival Analysis-Based Approach(Springer, Cham, 2020) Baskharon, Fadi; Awad, Ahmed; Di Francescomarino, ChiaraPredicting the remaining cycle time of running cases is one important use case of predictive process monitoring. Different approaches that learn from event logs, e.g., relying on an existing representation of the process or leveraging machine learning approaches, have been pro posed in literature to tackle this problem. Machine learning-based tech niques have shown superiority over other techniques with respect to the accuracy of the prediction as well as freedom from knowledge about the underlying process models generating the logs. However, all proposed approaches learn from complete traces. This might cause delays in start ing new training cycles as usually process instances might last over long time periods of hours, days, weeks or even months. In this paper, we propose a machine learning approach that can learn from incomplete ongoing traces. Using a time-aware survival analysis technique, we can train a neural network to predict the remaining cycle time of a running case. Our approach accepts as input both complete and incomplete traces. We have evaluated our approach on different real-life datasets and compared it with a state of the art baseline. Results show that our approach, in many cases, is able to outperform the baseline approach both in accuracy and training timeItem DISGD: A Distributed Shared-nothing Matrix Factorization for Large Scale Online Recommender Systems(Open proceedings org, 2020) Hazem, Heidy; Awad, Ahmed; Hassan, Ahmed; Sakr, SherifWith the web-scale data volumes and high velocity of generation rates, it has become crucial that the training process for recom mender systems be a continuous process which is performed on live data, i.e., on data streams. In practice, such systems have to address three main requirements including the ability to adapt their trained model with each incoming data element, the ability to handle concept drifts and the ability to scale with the volume of the data. In principle, matrix factorization is one of the popular approaches to train a recommender model. Stochastic Gradient Descent (SGD) has been a successful optimization approach for matrix factorization. Several approaches have been proposed that handle the first and second requirements. For the third require ment, in the realm of data streams, distributed approaches depend on a shared memory architecture. This requires obtaining locks before performing updates. In general, the success of main-stream big data processing systems is supported by their shared-nothing architecture. In this paper, we propose DISGD, a distributed shared-nothing variant of an incremental SGD. The proposal is motivated by an observation that with large volumes of data, the overwrite of updates, lock free updates, does not affect the result with sparse user-item matrices. Compared to the baseline incremental approach, our evaluation on several datasets shows not only improvement in processing time but also improved recall by 55%.Item Process Mining over Unordered Event Streams(IEEE, 2020) Awad, Ahmed; Weidlich, Matthias; Sakr, SherifProcess mining is no longer limited to the one-off analysis of static event logs extracted from a single enterprise system. Rather, process mining may strive for immediate insights based on streams of events that are continuously generated by diverse information systems. This requires online algorithms that, instead of keeping the whole history of event data, work incrementally and update analysis results upon the arrival of new events. While such online algorithms have been proposed for several process mining tasks, from discovery through confor mance checking to time prediction, they all assume that an event stream is ordered, meaning that the order of event generation coincides with their arrival at the analysis engine. Yet, once events are emitted by independent, distributed systems, this assumption may not hold true, which compromises analysis accuracy. In this paper, we provide the first contribution towards handling unordered event streams in process mining. Specifically, we formalize the notion of out-of-order arrival of events, where an online analysis algorithm needs to process events in an order different from their generation. Using directly-follows graphs as a basic model for many process mining tasks, we provide two approaches to handle such unorderedness, either through buffering or speculative processing. Our experiments with synthetic and real-life event data show that these techniques help mitigate the accuracy loss induced by unordered streams.Item MINARET: A Recommendation Framework for Scientific Reviewers(Open proceedings org, 2019) R. Moawad, Mohamed; Maher, Mohamed; Awad, Ahmed; Sakr, SherifWe are witnessing a continuous growth in the size of scien tific communities and the number of scientific publications. This phenomenon requires a continuous effort for ensuring the quality of publications and a healthy scientific evalu ation process. Peer reviewing is the de facto mechanism to assess the quality of scientific work. For journal editors, managing an efficient and effective manuscript peer review process is not a straightforward task. In particular, a main component in the journal editors’ role is, for each submit ted manuscript, to ensure selecting adequate reviewers who need to be: 1) Matching on their research interests with the topic of the submission, 2) Fair in their evaluation of the submission, i.e., no conflict of interest with the authors, 3) Qualified in terms of various aspects including scientific impact, previous review/authorship experience for the jour nal, quality of the reviews, etc. Thus, manually selecting and assessing the adequate reviewers is becoming tedious and time consuming task. We demonstrate MINARET, a recommendation framework for selecting scientific reviewers. The framework facilitates the job of journal editors for conducting an efficient and effective scientific review process. The framework exploits the valuable information available on the modern scholarly Websites (e.g., Google Scholar, ACM DL, DBLP, Publons) for identifying candidate reviewers relevant to the topic of the manuscript, filtering them (e.g. excluding those with potential conflict of interest), and ranking them based on several metrics configured by the editor (user). The framework extracts the required information for the rec ommendation process from the online resources on-the-fly which ensures the output recommendations to be dynamic and based on up-to-date information.Item Adaptive Watermarks: A Concept Drift-based Approach for Predicting Event-Time Progress in Data Streams(Open proceedings org, 2019) Awad, Ahmed; Traub, Jonas; Sakr, SherifEvent-time based stream processing is concerned with analyzing data with respect to its generation time. In most of the cases, data gets delayed during its journey from the source(s) to the stream processing engine. This is known as late data arrival. Among the different approaches for out-of-order stream processing, low watermarks are proposed to inject special records within data streams, i.e., watermarks. A watermark is a timestamp which indicates that no data with a timestamp older than the water mark should be observed later on. Any element as such is consid ered a late arrival. Watermark generation is usually periodic and heuristic-based. The limitation of such watermark generation strategy is its rigidness regarding the frequency of data arrival as well as the delay that data may encounter. In this paper, we propose an adaptive watermark generation strategy. Our strat egy decides adaptively when to generate watermarks and with what timestamp without a priori adjustment. We treat changes in data arrival frequency and changes in delays as concept drifts in stream data mining. We use an Adaptive Window (ADWIN) as our concept drift sensor for the change in the distribution of arrival rate and delay. We have implemented our approach on top of Apache Flink. We compare our approach with periodic water mark generation using two real-life data sets. Our results show that adaptive watermarks achieve a lower average latency by triggering windows earlier and a lower rate of dropped elements by delaying watermarks when out-of-order data is expected.