BSpace Repository :: Browsing by Author "Sakr, Sherif"

Browsing by Author "Sakr, Sherif"

Now showing 1 - 9 of 9

Adaptive Watermarks: A Concept Drift-based Approach for Predicting Event-Time Progress in Data Streams
(Open proceedings org, 2019) Awad, Ahmed; Traub, Jonas; Sakr, Sherif
Event-time based stream processing is concerned with analyzing data with respect to its generation time. In most of the cases, data gets delayed during its journey from the source(s) to the stream processing engine. This is known as late data arrival. Among the different approaches for out-of-order stream processing, low watermarks are proposed to inject special records within data streams, i.e., watermarks. A watermark is a timestamp which indicates that no data with a timestamp older than the water mark should be observed later on. Any element as such is consid ered a late arrival. Watermark generation is usually periodic and heuristic-based. The limitation of such watermark generation strategy is its rigidness regarding the frequency of data arrival as well as the delay that data may encounter. In this paper, we propose an adaptive watermark generation strategy. Our strat egy decides adaptively when to generate watermarks and with what timestamp without a priori adjustment. We treat changes in data arrival frequency and changes in delays as concept drifts in stream data mining. We use an Adaptive Window (ADWIN) as our concept drift sensor for the change in the distribution of arrival rate and delay. We have implemented our approach on top of Apache Flink. We compare our approach with periodic water mark generation using two real-life data sets. Our results show that adaptive watermarks achieve a lower average latency by triggering windows earlier and a lower rate of dropped elements by delaying watermarks when out-of-order data is expected.
Big Stream Processing Systems: An Experimental Evaluation
(IEEE computer society, 2019) Shahverdi, Elkhan; Awad, Ahmed; Sakr, Sherif
As the world gets more instrumented and connected, we are witnessing a flood of digital data generated from various hardware (e.g., sensors) or software in the format of flowing streams of data. Real-time processing for such massive amounts of streaming data is a crucial requirement in several application domains including financial markets, surveillance systems, man ufacturing, smart cities, and scalable monitoring infrastructure. In the last few years, several big stream processing engines have been introduced to tackle this challenge. In this article, we present an extensive experimental study of five popular systems in this domain, namely, Apache Storm, Apache Flink, Apache Spark, Kafka Streams and Hazelcast Jet. We report and analyze the performance characteristics of these systems. In addition, we report a set of insights and important lessons that we have learned from conducting our experiments.
Calculation of Average Road Speed Based on Car-to-Car Messaging
(IEEE, 2019) Ramzy, Ahmed; Awad, Ahmed; A. Kamel, Amr; Hegazy, Osman; Sakr, Sherif
Arrival time prediction provided by most of naviga tion systems is affected by several factors, such as road condition, travel time, weather condition, car speed, etc. These predictions are mainly based on historical data. Systems that provide near real-time road condition updates, e.g. Google Maps, depend on crowdsourcing GPS data from cars or mobile devices on the road. GPS data thus has a long journey to travel from their sources to the analytics engine on the cloud before a status update is sent back to the client. Between the time taken for GPS data to be broadcast, received and processed, significant changes in road conditions can take place and would still be unreported, leading to wrong decisions on the route to choose. Road condition, especially average speed of cars, monitoring is of a local and continuous nature. It needs to be accomplished near GPS stream data sources to reduce latency and increase the accuracy of reporting. Solutions based on geo-distributed road monitoring, using Fog-computing paradigm, provide lower latency and higher accuracy than centralized (cloud-based) approaches. Yet, they require a heavy investment and a large infrastructure, which might be a limit for its utility in some countries, e.g. Egypt. In this paper, we propose a more dynamic approach to continuously update average speed on the road. The computation is done locally on the client device, e.g. the traveling car or the mobile device of the traveler. We compare, through simulation, our proposed approach to the fog-computing-based traffic monitoring. Simulation results give an empirical evidence on the correctness of our results compared to fog-based speed calculation. Index Terms—Traffic Monitoring; D2D Communication; Crowdsourcing
D 2IA: User-defined interval analytics on distributed streams
(ProQuest Central, 2022) Awad, Ahmed; Tommasini, Riccardo; Langhi, Samuele; Kamel, Mahmoud; Della Valle, Emanuele; Sakr, Sherif
Nowadays, modern Big Stream Processing Solutions (e.g. Spark, Flink) are working towards being the ultimate framework for streaming analytics. In order to achieve this goal, they started to offer extensions of SQL that incorporate stream-oriented primitives such as windowing and Complex Event Processing (CEP). The former enables stateful computation on infinite sequences of data items while the latter focuses on the detection of events pattern. In most of the cases, data items and events are considered instantaneous, i.e., they are single time points in a discrete temporal domain. Nevertheless, a point-based time semantics does not satisfy the requirements of a number of use-cases. For instance, it is not possible to detect the interval during which the temperature increases until the temperature begins to decrease, nor for all the relations this interval subsumes. To tackle this challenge, we present D 2IA; a set of novel abstract operators to define analytics on user-defined event intervals based on raw events and to efficiently reason about temporal relationships between intervals and/or point events. We realize the implementation of the concepts of D 2IA on top of Flink, a distributed stream processing engine for big data.
D2IA: Stream Analytics on User-Defined Event Intervals
(Springer Nature Switzerland AG, 2019) Awad, Ahmed; Tommasini, Riccardo; Kamel, Mahmoud; Della Valle, Emanuele; Sakr, Sherif
Nowadays, modern Big Stream Processing Solutions (e.g. Spark, Flink) are working towards ultimate frameworks for streaming analytics. In order to achieve this goal, they started to offer extensions of SQL that incorporate stream-oriented primitives such as windowing and Complex Event Processing (CEP). The former enables stateful com putation on infinite sequences of data items while the latter focuses on the detection of events pattern. In most of the cases, data items and events are considered instantaneous, i.e., they are single time points in a discrete temporal domain. Nevertheless, a point-based time semantics does not satisfy the requirements of a number of use-cases. For instance, it is not possible to detect the interval during which the temperature increases until the temperature begins to decrease, nor all the relations this interval subsumes. To tackle this challenge, we present D2IA; a set of novel abstract operators to define analytics on user-defined event inter vals based on raw events and to efficiently reason about temporal rela tionships between intervals and/or point events. We realize the imple mentation of the concepts of D2IA on top of Esper, a centralized stream processing system, and Flink, a distributed stream processing engine for big data.
DISGD: A Distributed Shared-nothing Matrix Factorization for Large Scale Online Recommender Systems
(Open proceedings org, 2020) Hazem, Heidy; Awad, Ahmed; Hassan, Ahmed; Sakr, Sherif
With the web-scale data volumes and high velocity of generation rates, it has become crucial that the training process for recom mender systems be a continuous process which is performed on live data, i.e., on data streams. In practice, such systems have to address three main requirements including the ability to adapt their trained model with each incoming data element, the ability to handle concept drifts and the ability to scale with the volume of the data. In principle, matrix factorization is one of the popular approaches to train a recommender model. Stochastic Gradient Descent (SGD) has been a successful optimization approach for matrix factorization. Several approaches have been proposed that handle the first and second requirements. For the third require ment, in the realm of data streams, distributed approaches depend on a shared memory architecture. This requires obtaining locks before performing updates. In general, the success of main-stream big data processing systems is supported by their shared-nothing architecture. In this paper, we propose DISGD, a distributed shared-nothing variant of an incremental SGD. The proposal is motivated by an observation that with large volumes of data, the overwrite of updates, lock free updates, does not affect the result with sparse user-item matrices. Compared to the baseline incremental approach, our evaluation on several datasets shows not only improvement in processing time but also improved recall by 55%.
MINARET: A Recommendation Framework for Scientific Reviewers
(Open proceedings org, 2019) R. Moawad, Mohamed; Maher, Mohamed; Awad, Ahmed; Sakr, Sherif
We are witnessing a continuous growth in the size of scien tific communities and the number of scientific publications. This phenomenon requires a continuous effort for ensuring the quality of publications and a healthy scientific evalu ation process. Peer reviewing is the de facto mechanism to assess the quality of scientific work. For journal editors, managing an efficient and effective manuscript peer review process is not a straightforward task. In particular, a main component in the journal editors’ role is, for each submit ted manuscript, to ensure selecting adequate reviewers who need to be: 1) Matching on their research interests with the topic of the submission, 2) Fair in their evaluation of the submission, i.e., no conflict of interest with the authors, 3) Qualified in terms of various aspects including scientific impact, previous review/authorship experience for the jour nal, quality of the reviews, etc. Thus, manually selecting and assessing the adequate reviewers is becoming tedious and time consuming task. We demonstrate MINARET, a recommendation framework for selecting scientific reviewers. The framework facilitates the job of journal editors for conducting an efficient and effective scientific review process. The framework exploits the valuable information available on the modern scholarly Websites (e.g., Google Scholar, ACM DL, DBLP, Publons) for identifying candidate reviewers relevant to the topic of the manuscript, filtering them (e.g. excluding those with potential conflict of interest), and ranking them based on several metrics configured by the editor (user). The framework extracts the required information for the rec ommendation process from the online resources on-the-fly which ensures the output recommendations to be dynamic and based on up-to-date information.
Process Mining over Unordered Event Streams
(IEEE, 2020) Awad, Ahmed; Weidlich, Matthias; Sakr, Sherif
Process mining is no longer limited to the one-off analysis of static event logs extracted from a single enterprise system. Rather, process mining may strive for immediate insights based on streams of events that are continuously generated by diverse information systems. This requires online algorithms that, instead of keeping the whole history of event data, work incrementally and update analysis results upon the arrival of new events. While such online algorithms have been proposed for several process mining tasks, from discovery through confor mance checking to time prediction, they all assume that an event stream is ordered, meaning that the order of event generation coincides with their arrival at the analysis engine. Yet, once events are emitted by independent, distributed systems, this assumption may not hold true, which compromises analysis accuracy. In this paper, we provide the first contribution towards handling unordered event streams in process mining. Specifically, we formalize the notion of out-of-order arrival of events, where an online analysis algorithm needs to process events in an order different from their generation. Using directly-follows graphs as a basic model for many process mining tasks, we provide two approaches to handle such unorderedness, either through buffering or speculative processing. Our experiments with synthetic and real-life event data show that these techniques help mitigate the accuracy loss induced by unordered streams.
SDDM: an interpretable statistical concept drift detection method for data streams
(ProQuest Central, 2021) Micevska, Simona; Awad, Ahmed; Sakr, Sherif
Machine learning models assume that data is drawn from a stationary distribution. However, in practice, challenges are imposed on models that need to make sense of fast-evolving data streams, where the content of data is changing and evolving over time. This change between the distributions of training data seen so-far and the distribution of newly coming data is called concept drift. It is of utmost importance to detect concept drifts to maintain the accu racy and reliability of online classifiers. Reactive drift detectors monitor the performance of the underlying machine learning model. That is, to detect a drift, feedback on the classifier output has to be given to the drift detector, known as prequential evaluation. In many real life scenarios, immediate feedback on classifier output is not possible. Thus, drift detection is delayed and gets out of context. Moreover, the drift detector output is in the form of a binary answer if there is a drift or not. However, it is equally important to explain the source of drift. In this paper, we present the Statistical Drift Detection Method (SDDM) which can detect drifts by monitoring the change of data distribution without the need for feedback on classifier output. Moreover, the detection is quantified and the source of drift is identified. We empirically evaluate our method against the state-of-the-art on both synthetic and real life data sets. SDDM outperforms other related approaches by producing a smaller number of false positives and false negatives.

Browsing by Author "Sakr, Sherif"

Results Per Page

Sort Options