Keyed Watermarks: A Fine-grained Tracking of Event-time in Apache Flink
Date
2023
Journal Title
Journal ISSN
Volume Title
Publisher
IEEE
Abstract
Big Data Stream processing engines such
as Apache Flink use windowing techniques to handle
unbounded streams of events. Gathering all perti nent input within a window is crucial for event time windowing since it affects how accurate results
are. A significant part of this process is played by
watermarks, which are unique timestamps that show
the passage of events in time. However, the current
watermark generation method in Apache Flink, which
works at the level of the input stream, tends to
favor faster sub-streams, resulting in dropped events
from slower sub-streams. In our analysis, we found
that Apache Flink’s vanilla watermark generation
approach caused around 33% loss of data if 50% of
the keys around the median are delayed. Furthermore,
the loss surpassed 37% when 50% of random keys are
delayed.
In this paper, we present a novel strategy called
keyed watermarks to overcome data loss and increase
the accuracy of data processing to at least 99% in
most cases. We enable separate progress tracking by
creating a unique watermark for each logical sub stream (key). In our study, we outline the architec tural and API changes necessary to implement keyed
watermarks and discuss our experience in extending
Apache Flink’s enormous code base. Additionally, we
compare the effectiveness of our strategy against the
conventional watermark generation method in terms
of the accuracy of event-time tracking.
Index Terms—Keyed Watermarks, Big Data Stream
Processing, Event-Time Tracking, Apache Flink.
Description
Keywords
Keyed Watermarks,Big Data Stream Processing,Event-Time Tracking,Apache Flink
Citation
Yasser, T. et al. (2023) “Keyed Watermarks: A Fine-grained Tracking of Event-time in Apache Flink,” in 2023 5th Novel Intelligent and Leading Emerging Sciences Conference (NILES), pp. 23–28.