BSpace Repository :: Browsing by Author "Awad, Ahmed"

Browsing by Author "Awad, Ahmed"

Now showing 1 - 20 of 23

A Decentralised Public Key Infrastructure for X-Road
(ACM DIGITAL LIBRARY, 2023) Bakhtina, Mariia; Long Leung, Kin; Matulevičius, Raimundas; Awad, Ahmed; Švenda, Petr
X-Road is an open-source solution that acts as a data exchange layer and enables secure data exchange between organisations. X Road serves as the backbone of digital infrastructure in the public sector (e.g., enabling Estonia’s digital public services) and private sector (e.g., enabling clients’ data exchange in the Japanese en ergy sector). An approach and architecture were recently proposed for the X-Road data exchange systems to move from public key infrastructure (PKI) with centralised certification authorities to de centralised PKI (DPKI). In this paper, we develop a proof of concept for the designed DPKI-based architecture that leverages distributed ledger-based identifiers and verifiable credentials to establish trust between information systems using Hyperledger Indy and Hyper ledger Aries. We evaluate the proof of concept implementation against the design and functional requirements. The results show that the proposed system architecture is technically feasible and satisfies the identified design goals and functional requirements. To the best of our knowledge, this paper presents the first open-access system prototype for an organisation’s identity management fol lowing self-sovereign identity principles. The presented proof of concept proves that DPKI helps to address some of the scalability issues of PKI, improve control over identity and mitigate replay attacks and a single point of failure in the X-Road system.
A Novel Hadoop Security Model for Addressing Malicious Collusive Workers
(ProQuest Central, 2021) M. Sauber, Amr; Awad, Ahmed; F. Shawish, Amr; M. El-Kafrawy, Passent
With the daily increase of data production and collection, Hadoop is a platform for processing big data on a distributed system. A master node globally manages running jobs, whereas worker nodes process partitions of the data locally. Hadoop uses MapReduce as an effective computing model. However, Hadoop experiences a high level of security vulnerability over hybrid and public clouds. Specially, several workers can fake results without actually processing their portions of the data. Several redundancy-based approaches have been proposed to counteract this risk. A replication mechanism is used to duplicate all or some of the tasks over multiple workers (nodes). A drawback of such approaches is that they generate a high overhead over the cluster. Additionally, malicious workers can behave well for a long period of time and attack later. *is paper presents a novel model to enhance the security of the cloud environment against untrusted workers. A new component called malicious workers’ trap (MWT) is developed to run on the master node to detect malicious (noncollusive and collusive) workers as they convert and attack the system. An implementation to test the proposed model and to analyze the performance of the system shows that the proposed model can accurately detect malicious workers with minor processing overhead compared to vanilla MapReduce and Verifiable MapReduce (V-MR) model [1]. In addition, MWT maintains a balance between the security and usability of the Hadoop cluster
Adaptive Watermarks: A Concept Drift-based Approach for Predicting Event-Time Progress in Data Streams
(Open proceedings org, 2019) Awad, Ahmed; Traub, Jonas; Sakr, Sherif
Event-time based stream processing is concerned with analyzing data with respect to its generation time. In most of the cases, data gets delayed during its journey from the source(s) to the stream processing engine. This is known as late data arrival. Among the different approaches for out-of-order stream processing, low watermarks are proposed to inject special records within data streams, i.e., watermarks. A watermark is a timestamp which indicates that no data with a timestamp older than the water mark should be observed later on. Any element as such is consid ered a late arrival. Watermark generation is usually periodic and heuristic-based. The limitation of such watermark generation strategy is its rigidness regarding the frequency of data arrival as well as the delay that data may encounter. In this paper, we propose an adaptive watermark generation strategy. Our strat egy decides adaptively when to generate watermarks and with what timestamp without a priori adjustment. We treat changes in data arrival frequency and changes in delays as concept drifts in stream data mining. We use an Adaptive Window (ADWIN) as our concept drift sensor for the change in the distribution of arrival rate and delay. We have implemented our approach on top of Apache Flink. We compare our approach with periodic water mark generation using two real-life data sets. Our results show that adaptive watermarks achieve a lower average latency by triggering windows earlier and a lower rate of dropped elements by delaying watermarks when out-of-order data is expected.
Benchmarking Concept Drift Detectors for Online Machine Learning
(Springer, Cham, 2022) Mahgoub, Mahmoud; Moharram, Hassan; Elkafrawy, Passent; Awad, Ahmed
Concept drift detection is an essential step to maintain the accuracy of online machine learning. The main task is to detect changes in data distribution that might cause changes in the decision bound aries for a classification algorithm. Upon drift detection, the classifica tion algorithm may reset its model or concurrently grow a new learning model. Over the past fifteen years, several drift detection methods have been proposed. Most of these methods have been implemented within the Massive Online Analysis (MOA). Moreover, a couple of studies have compared the drift detectors. However, such studies have merely focused on comparing the detection accuracy. Moreover, most of these studies are focused on synthetic data sets only. Additionally, these studies do not consider drift detectors not integrated into MOA. Furthermore, None of the studies have considered other metrics like resource consumption and runtime characteristics. These metrics are of utmost importance from an operational point of view. In this paper, we fill this gap. Namely, this paper evaluates the perfor mance of sixteen different drift detection methods using three different metrics: accuracy, runtime, and memory usage. To guarantee a fair com parison, MOA is used. Fourteen algorithms are implemented in MOA. We integrate two new algorithms (ADWIN++ and SDDM) into MOA.
Big Data Analytics from the Rich Cloud to the Frugal Edge
(IEEE, 2023) M. Awaysheh, Feras; Tommasini, Riccardo; Awad, Ahmed
—Modern systems and applications generate and con sume an enormous amount of data from different sources, including mobile edge computing and IoT systems. Our ability to locate and analyze these massive amounts of data will shape the future, building next-generation Big Data Analytics (BDA) and artificial intelligence systems in critical domains. Traditionally, big data materialize in a centralized repository (e.g., the cloud) for running sophisticated analytics using decent computation. Nevertheless, many modern applications and critical domains require low-latency data analysis with the right decision at the right time standard for building trust. With the advent of edge computing, that traditional deployment model shifted closer to the data sources at the network’s edge. Such a shift was motivated by minimized latency, increased uptime, and enhanced efficiencies. This paper studies the BDA building blocks, analyzes the deployment requirements for edge-based BDA QoS, and drafts future trends. It also discusses critical open issues and further research directions for the next step of edge-based BDA.
Big Stream Processing Systems: An Experimental Evaluation
(IEEE computer society, 2019) Shahverdi, Elkhan; Awad, Ahmed; Sakr, Sherif
As the world gets more instrumented and connected, we are witnessing a flood of digital data generated from various hardware (e.g., sensors) or software in the format of flowing streams of data. Real-time processing for such massive amounts of streaming data is a crucial requirement in several application domains including financial markets, surveillance systems, man ufacturing, smart cities, and scalable monitoring infrastructure. In the last few years, several big stream processing engines have been introduced to tackle this challenge. In this article, we present an extensive experimental study of five popular systems in this domain, namely, Apache Storm, Apache Flink, Apache Spark, Kafka Streams and Hazelcast Jet. We report and analyze the performance characteristics of these systems. In addition, we report a set of insights and important lessons that we have learned from conducting our experiments.
C-3PA: Streaming Conformance, Confidence and Completeness in Prefix-Alignments
(Springer, Cham, 2023) Raun, Kristo; Nielsen, Max; Burattin, Andrea; Awad, Ahmed
The aim of streaming conformance checking is to find dis crepancies between process executions on streaming data and the refer ence process model. The state-of-the-art output from streaming confor mance checking is a prefix-alignment. However, current techniques that output a prefix-alignment are unable to handle warm-starting scenarios. Further, no indication is given of how close the trace is to termination—a highly relevant measure in a streaming setting. This paper introduces a novel approximate streaming conformance checking algorithm that enriches prefix-alignments with confidence and completeness measures. Empirical tests on synthetic and real-life datasets demonstrate that the new method outputs prefix-alignments that have a cost that is highly correlated with the output from the state of-the-art optimal prefix-alignments. Furthermore, the method is able to handle warm-starting scenarios and indicate the confidence level of the prefix-alignment. A stress test shows that the method is well-suited for fast-paced event streams.
Calculation of Average Road Speed Based on Car-to-Car Messaging
(IEEE, 2019) Ramzy, Ahmed; Awad, Ahmed; A. Kamel, Amr; Hegazy, Osman; Sakr, Sherif
Arrival time prediction provided by most of naviga tion systems is affected by several factors, such as road condition, travel time, weather condition, car speed, etc. These predictions are mainly based on historical data. Systems that provide near real-time road condition updates, e.g. Google Maps, depend on crowdsourcing GPS data from cars or mobile devices on the road. GPS data thus has a long journey to travel from their sources to the analytics engine on the cloud before a status update is sent back to the client. Between the time taken for GPS data to be broadcast, received and processed, significant changes in road conditions can take place and would still be unreported, leading to wrong decisions on the route to choose. Road condition, especially average speed of cars, monitoring is of a local and continuous nature. It needs to be accomplished near GPS stream data sources to reduce latency and increase the accuracy of reporting. Solutions based on geo-distributed road monitoring, using Fog-computing paradigm, provide lower latency and higher accuracy than centralized (cloud-based) approaches. Yet, they require a heavy investment and a large infrastructure, which might be a limit for its utility in some countries, e.g. Egypt. In this paper, we propose a more dynamic approach to continuously update average speed on the road. The computation is done locally on the client device, e.g. the traveling car or the mobile device of the traveler. We compare, through simulation, our proposed approach to the fog-computing-based traffic monitoring. Simulation results give an empirical evidence on the correctness of our results compared to fog-based speed calculation. Index Terms—Traffic Monitoring; D2D Communication; Crowdsourcing
D 2IA: User-defined interval analytics on distributed streams
(ProQuest Central, 2022) Awad, Ahmed; Tommasini, Riccardo; Langhi, Samuele; Kamel, Mahmoud; Della Valle, Emanuele; Sakr, Sherif
Nowadays, modern Big Stream Processing Solutions (e.g. Spark, Flink) are working towards being the ultimate framework for streaming analytics. In order to achieve this goal, they started to offer extensions of SQL that incorporate stream-oriented primitives such as windowing and Complex Event Processing (CEP). The former enables stateful computation on infinite sequences of data items while the latter focuses on the detection of events pattern. In most of the cases, data items and events are considered instantaneous, i.e., they are single time points in a discrete temporal domain. Nevertheless, a point-based time semantics does not satisfy the requirements of a number of use-cases. For instance, it is not possible to detect the interval during which the temperature increases until the temperature begins to decrease, nor for all the relations this interval subsumes. To tackle this challenge, we present D 2IA; a set of novel abstract operators to define analytics on user-defined event intervals based on raw events and to efficiently reason about temporal relationships between intervals and/or point events. We realize the implementation of the concepts of D 2IA on top of Flink, a distributed stream processing engine for big data.
D2IA: Stream Analytics on User-Defined Event Intervals
(Springer Nature Switzerland AG, 2019) Awad, Ahmed; Tommasini, Riccardo; Kamel, Mahmoud; Della Valle, Emanuele; Sakr, Sherif
Nowadays, modern Big Stream Processing Solutions (e.g. Spark, Flink) are working towards ultimate frameworks for streaming analytics. In order to achieve this goal, they started to offer extensions of SQL that incorporate stream-oriented primitives such as windowing and Complex Event Processing (CEP). The former enables stateful com putation on infinite sequences of data items while the latter focuses on the detection of events pattern. In most of the cases, data items and events are considered instantaneous, i.e., they are single time points in a discrete temporal domain. Nevertheless, a point-based time semantics does not satisfy the requirements of a number of use-cases. For instance, it is not possible to detect the interval during which the temperature increases until the temperature begins to decrease, nor all the relations this interval subsumes. To tackle this challenge, we present D2IA; a set of novel abstract operators to define analytics on user-defined event inter vals based on raw events and to efficiently reason about temporal rela tionships between intervals and/or point events. We realize the imple mentation of the concepts of D2IA on top of Esper, a centralized stream processing system, and Flink, a distributed stream processing engine for big data.
DISGD: A Distributed Shared-nothing Matrix Factorization for Large Scale Online Recommender Systems
(Open proceedings org, 2020) Hazem, Heidy; Awad, Ahmed; Hassan, Ahmed; Sakr, Sherif
With the web-scale data volumes and high velocity of generation rates, it has become crucial that the training process for recom mender systems be a continuous process which is performed on live data, i.e., on data streams. In practice, such systems have to address three main requirements including the ability to adapt their trained model with each incoming data element, the ability to handle concept drifts and the ability to scale with the volume of the data. In principle, matrix factorization is one of the popular approaches to train a recommender model. Stochastic Gradient Descent (SGD) has been a successful optimization approach for matrix factorization. Several approaches have been proposed that handle the first and second requirements. For the third require ment, in the realm of data streams, distributed approaches depend on a shared memory architecture. This requires obtaining locks before performing updates. In general, the success of main-stream big data processing systems is supported by their shared-nothing architecture. In this paper, we propose DISGD, a distributed shared-nothing variant of an incremental SGD. The proposal is motivated by an observation that with large volumes of data, the overwrite of updates, lock free updates, does not affect the result with sparse user-item matrices. Compared to the baseline incremental approach, our evaluation on several datasets shows not only improvement in processing time but also improved recall by 55%.
Efficient Approximate Conformance Checking Using Trie Data Structures
(IEEE, 2021) Awad, Ahmed; Raun, Kristo; Weidlich, Matthias
Conformance checking compares a process model and recorded executions of a process, i.e., a log of traces. To this end, state-of-the-art approaches compute an alignment between a trace and an execution sequence of the model. Since the construction of alignments is computationally expensive, approximation schemes have been developed to strike a balance between the efficiency and the accuracy of conformance checking. Specifically, conformance checking may rely only on so-called proxy behavior, a subset of the behavior of the model. However, the question how such proxy behavior shall be represented for efficient alignment computation has been largely neglected. In this paper, we contribute a new formulation of the proxy behavior derived from a model for approximate conformance checking. By encoding the proxy behavior using a trie data structure, we obtain a logarithmically reduced search space for alignment computation compared to a set-based representation. We show how our algorithm supports the definition of a budget for alignment computation and also augment it with strategies for meta-heuristic optimization and pruning of the search space. Evaluation experiments with five real-world event logs show that our approach reduces the runtime of alignment construction by two orders of magnitude with a modest estimation error.
Efficient Checking of Timed Ordered Anti-patterns over Graph-Encoded Event Log
(Springer, Cham, 2022) M. Zaki, Nesma; M. A. Helal, Iman; E. Hassanein, Ehab; Awad, Ahmed
Event logs are used for a plethora of process analytics and mining techniques. A class of these mining activities is conformance (compliance) checking. The goal is to identify the violation of such patterns, i.e., anti-patterns. Several approaches have been proposed to tackle this analysis task. These approaches have been based on differ ent data models and storage technologies of the event log including rela tional databases, graph databases, and proprietary formats. Graph-based encoding of event logs is a promising direction that turns several process analytic tasks into queries on the underlying graph. Compliance checking is one class of such analysis tasks. In this paper, we argue that encoding log data as graphs alone is not enough to guarantee efficient processing of queries on this data. Effi ciency is important due to the interactive nature of compliance checking. Thus, anti-pattern detection would benefit from sub-linear scanning of the data. Moreover, as more data are added, e.g., new batches of logs arrive, the data size should grow sub-linearly to optimize both the space of storage and time for querying. We propose two encoding methods using graph representations, realized in Neo4J & SQL Graph Database, and show the benefits of these encoding on a special class of queries, namely timed ordered anti-patterns. Compared to several baseline encoding, our experiments show up to 5x speed up in the querying time as well as a 3x reduction in the graph size.
I Will Survive: An Event-driven Conformance Checking Approach Over Process Streams
(ACM DIGITAL LIBRARY, 2023) Raun, Kristo; Tommassini, Riccardo; Awad, Ahmed
Online conformance checking deals with finding discrepancies be tween real-life and modeled behavior on data streams. The current state-of-the-art output of online conformance checking is a prefix alignment, which is used for pinpointing the exact deviations in terms of the trace and the model while accommodating a trace’s unknown termination in an online setting. Current methods for producing prefix-alignments are computationally expensive and hinder the applicability in real-life settings. This paper introduces a new approximate algorithm – I Will Survive (IWS). The algorithm utilizes the trie data structure to improve the calculation speed, while remaining memory-efficient. Comparative analysis on real-life and synthetic datasets shows that the IWS algorithm can achieve an order of magnitude faster execution time while having a smaller error cost, compared to the current state of the art. In extreme cases, the IWS finds prefix alignments roughly three orders of magnitude faster than previous approximate methods. The IWS algorithm includes a discounted decay time setting for more efficient memory usage and a look ahead limit for improving computation time. Finally, the algorithm is stress tested for performance using a simulation of high-traffic event streams.
Keyed Watermarks: A Fine-grained Tracking of Event-time in Apache Flink
(IEEE, 2023) Yasser, Tawfik; Arafa, Tamer; El-Helw, Mohamed; Awad, Ahmed
Big Data Stream processing engines such as Apache Flink use windowing techniques to handle unbounded streams of events. Gathering all perti nent input within a window is crucial for event time windowing since it affects how accurate results are. A significant part of this process is played by watermarks, which are unique timestamps that show the passage of events in time. However, the current watermark generation method in Apache Flink, which works at the level of the input stream, tends to favor faster sub-streams, resulting in dropped events from slower sub-streams. In our analysis, we found that Apache Flink’s vanilla watermark generation approach caused around 33% loss of data if 50% of the keys around the median are delayed. Furthermore, the loss surpassed 37% when 50% of random keys are delayed. In this paper, we present a novel strategy called keyed watermarks to overcome data loss and increase the accuracy of data processing to at least 99% in most cases. We enable separate progress tracking by creating a unique watermark for each logical sub stream (key). In our study, we outline the architec tural and API changes necessary to implement keyed watermarks and discuss our experience in extending Apache Flink’s enormous code base. Additionally, we compare the effectiveness of our strategy against the conventional watermark generation method in terms of the accuracy of event-time tracking. Index Terms—Keyed Watermarks, Big Data Stream Processing, Event-Time Tracking, Apache Flink.
MINARET: A Recommendation Framework for Scientific Reviewers
(Open proceedings org, 2019) R. Moawad, Mohamed; Maher, Mohamed; Awad, Ahmed; Sakr, Sherif
We are witnessing a continuous growth in the size of scien tific communities and the number of scientific publications. This phenomenon requires a continuous effort for ensuring the quality of publications and a healthy scientific evalu ation process. Peer reviewing is the de facto mechanism to assess the quality of scientific work. For journal editors, managing an efficient and effective manuscript peer review process is not a straightforward task. In particular, a main component in the journal editors’ role is, for each submit ted manuscript, to ensure selecting adequate reviewers who need to be: 1) Matching on their research interests with the topic of the submission, 2) Fair in their evaluation of the submission, i.e., no conflict of interest with the authors, 3) Qualified in terms of various aspects including scientific impact, previous review/authorship experience for the jour nal, quality of the reviews, etc. Thus, manually selecting and assessing the adequate reviewers is becoming tedious and time consuming task. We demonstrate MINARET, a recommendation framework for selecting scientific reviewers. The framework facilitates the job of journal editors for conducting an efficient and effective scientific review process. The framework exploits the valuable information available on the modern scholarly Websites (e.g., Google Scholar, ACM DL, DBLP, Publons) for identifying candidate reviewers relevant to the topic of the manuscript, filtering them (e.g. excluding those with potential conflict of interest), and ranking them based on several metrics configured by the editor (user). The framework extracts the required information for the rec ommendation process from the online resources on-the-fly which ensures the output recommendations to be dynamic and based on up-to-date information.
On The Shift to Decentralised Identity Management in Distributed Data Exchange Systems
(ACM DIGITAL LIBRARY, 2023) Bakhtina, Mariia; Matulevičius, Raimundas; Awad, Ahmed; Kivimäki, Petteri
The commonly used centralised trust and centralised identity man agement make information systems and organisations prone to a single point of failure. Therefore, decentralised identity man agement has appeared as an alternative solution to mitigate the weaknesses of centralised identity. In this paper, we propose an approach of system analysis that should guide organisations that considers the transition to decentralised identity management. The approach aims to support decision-making about the usefulness of the transition based on the created assessment model. The approach is validated through a case study of the X-Road ecosystem.
Online correlation for unlabeled process events: A flexible CEP-based approach
(ProQuest Central, 2022) M.A. Helal, Iman; Awad, Ahmed
Process mining is a sub-field of data mining that focuses on analyzing timestamped and partially ordered data. This type of data is commonly called event logs. Each event is required to have at least three attributes: case ID, task ID/name, and timestamp to apply process mining techniques. Thus, any missing information need to be supplied first. Traditionally, events collected from different sources are manually correlated. While this might be acceptable in an offline setting, this is infeasible in an online setting. Recently, several use cases have emerged that call for applying process mining in an online setting. In such scenarios, a stream of high-speed and high-volume events continuously flow, e.g. IoT applications, with stringent latency requirements to have insights about the ongoing process. Thus, event correlation must be automated and occur as the data is being received. We introduce an approach that correlates unlabeled events received on a stream. Given a set of start activities, our approach correlates unlabeled events to a case identifier. Our approach is probabilistic. That implies a single uncorrelated event can be assigned to zero or more case identifiers with different probabilities. Moreover, our approach is flexible. That is, the user can supply domain knowledge in the form of constraints that reduce the correlation space. This knowledge can be supplied while the application is running. We realize our approach using complex event processing (CEP) technologies. We implemented a prototype on top of Esper, a state of the art industrial CEP engine. We compare our approach to baseline approaches. The experimental evaluation shows that our approach outperforms the throughput and latency of the baseline approaches. It also shows that using real-life logs, the accuracy of our approach can compete with the baseline approaches.
Optimizing ADWIN for Steady Streams
(2022 Association for Computing Machinery. ACM, 2022) Moharram, Hassan; Awad, Ahmed; M. El-Kafrawy, Passent
With the ever-growing data generation rates and stringent con straints on the latency of analyzing such data, stream analytics is overtaking. Learning from data streams, aka online machine learn ing, is no exception. However, online machine learning comes with many challenges for the different aspects of the learning process, starting from the algorithm design to the evaluation method. One of these challenges is the ability of a learning system to adapt to the change in data distribution, known as concept drift, to maintain the accuracy of the predictions. Over time, several drift detection approaches have been proposed. A prominent approach is adaptive windowing (ADWIN) which can detect changes in features data distribution without explicit feedback on the correctness of the prediction. Several variants for ADWIN have been proposed to enhance its runtime performance, w.r.t throughput, and latency. However, the drift detection accuracy of these variants was not compared with the original algorithm. Moreover, there is no study concerning the memory consumption of the variants and the origi nal algorithm. Additionally, the evaluation was done on synthetic datasets with a considerable number of drifts not covering all types of drifts or steady streams, those that do not have drifts at all or almost negligible drifts. The contribution of this paper is two-fold. First, we compare the original Adaptive Window (ADWIN) and its variants: Serial, HalfCut, and Optimistic in terms of drift detection accuracy, detec tion speed, and memory consumption, represented in the internal window size. We compare them using synthetic data sets cover ing different types of concept drifts, namely: incremental, gradual, abrupt, and steady. We also use two real-life datasets whose drifts are unknown. Second, we present ADWIN++. We use an adaptive bucket dropping technique to control window size. We evaluate our technique on the same data sets above and new datasets with fewer drifts. Experiments show that our approach saves about 80% of memory consumption. Moreover, it takes less time to detect concept drift and maintains the drift detection accuracy.
Predicting Remaining Cycle Time from Ongoing Cases: A Survival Analysis-Based Approach
(Springer, Cham, 2020) Baskharon, Fadi; Awad, Ahmed; Di Francescomarino, Chiara
Predicting the remaining cycle time of running cases is one important use case of predictive process monitoring. Different approaches that learn from event logs, e.g., relying on an existing representation of the process or leveraging machine learning approaches, have been pro posed in literature to tackle this problem. Machine learning-based tech niques have shown superiority over other techniques with respect to the accuracy of the prediction as well as freedom from knowledge about the underlying process models generating the logs. However, all proposed approaches learn from complete traces. This might cause delays in start ing new training cycles as usually process instances might last over long time periods of hours, days, weeks or even months. In this paper, we propose a machine learning approach that can learn from incomplete ongoing traces. Using a time-aware survival analysis technique, we can train a neural network to predict the remaining cycle time of a running case. Our approach accepts as input both complete and incomplete traces. We have evaluated our approach on different real-life datasets and compared it with a state of the art baseline. Results show that our approach, in many cases, is able to outperform the baseline approach both in accuracy and training time

Browsing by Author "Awad, Ahmed"

Results Per Page

Sort Options