TY - GEN
T1 - Optimizing queue-based semi-stream joins with indexed master data
AU - Naeem, M. Asif
AU - Weber, Gerald
AU - Lutteroth, Christof
AU - Dobbie, Gillian
PY - 2014
Y1 - 2014
N2 - In Data Stream Management Systems (DSMS) semi-stream processing has become a popular area of research due to the high demand of applications for up-to-date information (e.g. in real-time data warehousing). A common operation in stream processing is joining an incoming stream with disk-based master data, also known as semi-stream join. This join typically works under the constraint of limited main memory, which is generally not large enough to hold the whole disk-based master data. Many semi-stream joins use a queue of stream tuples to amortize the disk access to the master data, and use an index to allow directed access to master data, avoiding the loading of unnecessary master data. In such a situation the question arises which master data partitions should be accessed, as any stream tuple from the queue could serve as a lookup element for accessing the master data index. Existing algorithms use simple safe and correct strategies, but are not optimal in the sense that they maximize the join service rate. In this paper we analyze strategies for selecting an appropriate lookup element, particularly for skewed stream data. We show that a good selection strategy can improve the performance of a semi-stream join significantly, both for synthetic and real data sets with known skewed distributions.
AB - In Data Stream Management Systems (DSMS) semi-stream processing has become a popular area of research due to the high demand of applications for up-to-date information (e.g. in real-time data warehousing). A common operation in stream processing is joining an incoming stream with disk-based master data, also known as semi-stream join. This join typically works under the constraint of limited main memory, which is generally not large enough to hold the whole disk-based master data. Many semi-stream joins use a queue of stream tuples to amortize the disk access to the master data, and use an index to allow directed access to master data, avoiding the loading of unnecessary master data. In such a situation the question arises which master data partitions should be accessed, as any stream tuple from the queue could serve as a lookup element for accessing the master data index. Existing algorithms use simple safe and correct strategies, but are not optimal in the sense that they maximize the join service rate. In this paper we analyze strategies for selecting an appropriate lookup element, particularly for skewed stream data. We show that a good selection strategy can improve the performance of a semi-stream join significantly, both for synthetic and real data sets with known skewed distributions.
UR - http://dx.doi.org/10.1007/978-3-319-10160-6_16
UR - https://www.scopus.com/pages/publications/84906861198
U2 - 10.1007/978-3-319-10160-6_16
DO - 10.1007/978-3-319-10160-6_16
M3 - Chapter in a published conference proceeding
SN - 9783319101590
T3 - Lecture Notes in Computer Science
SP - 171
EP - 182
BT - Data Warehousing and Knowledge Discovery
A2 - Bellatreche, L.
A2 - Mohania, M. K.
PB - Springer India
ER -