Optimizing queue-based semi-stream joins with indexed master data

M. Asif Naeem, Gerald Weber, Christof Lutteroth, Gillian Dobbie

Research output: Chapter in Book/Report/Conference proceedingConference contribution

2 Citations (Scopus)

Abstract

In Data Stream Management Systems (DSMS) semi-stream processing has become a popular area of research due to the high demand of applications for up-to-date information (e.g. in real-time data warehousing). A common operation in stream processing is joining an incoming stream with disk-based master data, also known as semi-stream join. This join typically works under the constraint of limited main memory, which is generally not large enough to hold the whole disk-based master data. Many semi-stream joins use a queue of stream tuples to amortize the disk access to the master data, and use an index to allow directed access to master data, avoiding the loading of unnecessary master data. In such a situation the question arises which master data partitions should be accessed, as any stream tuple from the queue could serve as a lookup element for accessing the master data index. Existing algorithms use simple safe and correct strategies, but are not optimal in the sense that they maximize the join service rate. In this paper we analyze strategies for selecting an appropriate lookup element, particularly for skewed stream data. We show that a good selection strategy can improve the performance of a semi-stream join significantly, both for synthetic and real data sets with known skewed distributions.
Original languageEnglish
Title of host publicationData Warehousing and Knowledge Discovery
Subtitle of host publicationProceedings of the 16th International Conference, DaWaK 2014, Munich, Germany, September 2-4, 2014
EditorsL. Bellatreche, M. K. Mohania
PublisherSpringer India
Pages171-182
Number of pages12
ISBN (Print)9783319101590
DOIs
Publication statusPublished - 2014

Publication series

NameLecture Notes in Computer Science
Volume8646

Fingerprint

Data warehouses
Processing
Joining
Data storage equipment

Cite this

Naeem, M. A., Weber, G., Lutteroth, C., & Dobbie, G. (2014). Optimizing queue-based semi-stream joins with indexed master data. In L. Bellatreche, & M. K. Mohania (Eds.), Data Warehousing and Knowledge Discovery: Proceedings of the 16th International Conference, DaWaK 2014, Munich, Germany, September 2-4, 2014 (pp. 171-182). (Lecture Notes in Computer Science; Vol. 8646). Springer India. https://doi.org/10.1007/978-3-319-10160-6_16

Optimizing queue-based semi-stream joins with indexed master data. / Naeem, M. Asif; Weber, Gerald; Lutteroth, Christof; Dobbie, Gillian.

Data Warehousing and Knowledge Discovery: Proceedings of the 16th International Conference, DaWaK 2014, Munich, Germany, September 2-4, 2014 . ed. / L. Bellatreche; M. K. Mohania. Springer India, 2014. p. 171-182 (Lecture Notes in Computer Science; Vol. 8646).

Research output: Chapter in Book/Report/Conference proceedingConference contribution

Naeem, MA, Weber, G, Lutteroth, C & Dobbie, G 2014, Optimizing queue-based semi-stream joins with indexed master data. in L Bellatreche & MK Mohania (eds), Data Warehousing and Knowledge Discovery: Proceedings of the 16th International Conference, DaWaK 2014, Munich, Germany, September 2-4, 2014 . Lecture Notes in Computer Science, vol. 8646, Springer India, pp. 171-182. https://doi.org/10.1007/978-3-319-10160-6_16
Naeem MA, Weber G, Lutteroth C, Dobbie G. Optimizing queue-based semi-stream joins with indexed master data. In Bellatreche L, Mohania MK, editors, Data Warehousing and Knowledge Discovery: Proceedings of the 16th International Conference, DaWaK 2014, Munich, Germany, September 2-4, 2014 . Springer India. 2014. p. 171-182. (Lecture Notes in Computer Science). https://doi.org/10.1007/978-3-319-10160-6_16
Naeem, M. Asif ; Weber, Gerald ; Lutteroth, Christof ; Dobbie, Gillian. / Optimizing queue-based semi-stream joins with indexed master data. Data Warehousing and Knowledge Discovery: Proceedings of the 16th International Conference, DaWaK 2014, Munich, Germany, September 2-4, 2014 . editor / L. Bellatreche ; M. K. Mohania. Springer India, 2014. pp. 171-182 (Lecture Notes in Computer Science).
@inproceedings{b817f320f7e44689ac4332fe46cfeded,
title = "Optimizing queue-based semi-stream joins with indexed master data",
abstract = "In Data Stream Management Systems (DSMS) semi-stream processing has become a popular area of research due to the high demand of applications for up-to-date information (e.g. in real-time data warehousing). A common operation in stream processing is joining an incoming stream with disk-based master data, also known as semi-stream join. This join typically works under the constraint of limited main memory, which is generally not large enough to hold the whole disk-based master data. Many semi-stream joins use a queue of stream tuples to amortize the disk access to the master data, and use an index to allow directed access to master data, avoiding the loading of unnecessary master data. In such a situation the question arises which master data partitions should be accessed, as any stream tuple from the queue could serve as a lookup element for accessing the master data index. Existing algorithms use simple safe and correct strategies, but are not optimal in the sense that they maximize the join service rate. In this paper we analyze strategies for selecting an appropriate lookup element, particularly for skewed stream data. We show that a good selection strategy can improve the performance of a semi-stream join significantly, both for synthetic and real data sets with known skewed distributions.",
author = "Naeem, {M. Asif} and Gerald Weber and Christof Lutteroth and Gillian Dobbie",
year = "2014",
doi = "10.1007/978-3-319-10160-6_16",
language = "English",
isbn = "9783319101590",
series = "Lecture Notes in Computer Science",
publisher = "Springer India",
pages = "171--182",
editor = "L. Bellatreche and Mohania, {M. K.}",
booktitle = "Data Warehousing and Knowledge Discovery",
address = "India",

}

TY - GEN

T1 - Optimizing queue-based semi-stream joins with indexed master data

AU - Naeem, M. Asif

AU - Weber, Gerald

AU - Lutteroth, Christof

AU - Dobbie, Gillian

PY - 2014

Y1 - 2014

N2 - In Data Stream Management Systems (DSMS) semi-stream processing has become a popular area of research due to the high demand of applications for up-to-date information (e.g. in real-time data warehousing). A common operation in stream processing is joining an incoming stream with disk-based master data, also known as semi-stream join. This join typically works under the constraint of limited main memory, which is generally not large enough to hold the whole disk-based master data. Many semi-stream joins use a queue of stream tuples to amortize the disk access to the master data, and use an index to allow directed access to master data, avoiding the loading of unnecessary master data. In such a situation the question arises which master data partitions should be accessed, as any stream tuple from the queue could serve as a lookup element for accessing the master data index. Existing algorithms use simple safe and correct strategies, but are not optimal in the sense that they maximize the join service rate. In this paper we analyze strategies for selecting an appropriate lookup element, particularly for skewed stream data. We show that a good selection strategy can improve the performance of a semi-stream join significantly, both for synthetic and real data sets with known skewed distributions.

AB - In Data Stream Management Systems (DSMS) semi-stream processing has become a popular area of research due to the high demand of applications for up-to-date information (e.g. in real-time data warehousing). A common operation in stream processing is joining an incoming stream with disk-based master data, also known as semi-stream join. This join typically works under the constraint of limited main memory, which is generally not large enough to hold the whole disk-based master data. Many semi-stream joins use a queue of stream tuples to amortize the disk access to the master data, and use an index to allow directed access to master data, avoiding the loading of unnecessary master data. In such a situation the question arises which master data partitions should be accessed, as any stream tuple from the queue could serve as a lookup element for accessing the master data index. Existing algorithms use simple safe and correct strategies, but are not optimal in the sense that they maximize the join service rate. In this paper we analyze strategies for selecting an appropriate lookup element, particularly for skewed stream data. We show that a good selection strategy can improve the performance of a semi-stream join significantly, both for synthetic and real data sets with known skewed distributions.

UR - http://dx.doi.org/10.1007/978-3-319-10160-6_16

U2 - 10.1007/978-3-319-10160-6_16

DO - 10.1007/978-3-319-10160-6_16

M3 - Conference contribution

SN - 9783319101590

T3 - Lecture Notes in Computer Science

SP - 171

EP - 182

BT - Data Warehousing and Knowledge Discovery

A2 - Bellatreche, L.

A2 - Mohania, M. K.

PB - Springer India

ER -