Optimising queue-based semi-stream joins by introducing a queue of frequent pages

M. Asif Naeem, Gerald Weber, Christof Lutteroth

Research output: Chapter in Book/Report/Conference proceedingConference contribution

Abstract

Semi-stream joins perform a join between a stream and a disk-based table. These joins can easily deal with typical workloads in online real-time data warehousing in many scenarios and with relatively modest system requirements. The disk access is page-based. In the past, several proposals have been made to exploit skew in the distribution of the join attribute. Such skew is a common result of natural short- or longtailed distributions in master data. Several semi-stream joins use caching strategies in order to improve performance. This works up to a point, but these algorithms still require relatively slow processing of stream data that matches with the infrequent tuples in the master data. In this work we explore the possibility of an additional strategy to exploit data skew: disk pages that are frequently accessed as a whole are accessed with priority. We show that considerable gain in service rate can be achieved with this strategy, while keeping memory consumption low. In essence we gain a three-stage approach to deal with skewed, unsorted data: caching plus our new strategy plus processing of the long tail of the distribution. We also present a cost model for our approach and validate our approach empirically.

LanguageEnglish
Title of host publicationDatabases Theory and Applications - 27th Australasian Database Conference, ADC 2016, Proceedings
EditorsM. A. Cheema, W. Zhang, L. Chang
PublisherSpringer Verlag
Pages407-418
Number of pages12
ISBN (Print)9783319469218
DOIs
StatusPublished - 1 Jan 2016
Event27th Australasian Database Conference on Databases Theory and Applications, ADC 2016 - Sydney, USA United States
Duration: 28 Sep 201629 Sep 2016

Publication series

NameLecture Notes in Computer Science (including subseries Lecture Notes in Artificial Intelligence and Lecture Notes in Bioinformatics)
Volume9877

Conference

Conference27th Australasian Database Conference on Databases Theory and Applications, ADC 2016
CountryUSA United States
CitySydney
Period28/09/1629/09/16

Fingerprint

Join
Queue
Data warehouses
Skew
Processing
Caching
Data storage equipment
Data Warehousing
Cost Model
Costs
Data Streams
Workload
Tail
Table
Attribute
Real-time
Scenarios
Strategy
Requirements

Keywords

  • Indexing
  • Performance optimisation
  • Semi-stream join

ASJC Scopus subject areas

  • Theoretical Computer Science
  • Computer Science(all)

Cite this

Asif Naeem, M., Weber, G., & Lutteroth, C. (2016). Optimising queue-based semi-stream joins by introducing a queue of frequent pages. In M. A. Cheema, W. Zhang, & L. Chang (Eds.), Databases Theory and Applications - 27th Australasian Database Conference, ADC 2016, Proceedings (pp. 407-418). (Lecture Notes in Computer Science (including subseries Lecture Notes in Artificial Intelligence and Lecture Notes in Bioinformatics); Vol. 9877 ). Springer Verlag. https://doi.org/10.1007/978-3-319-46922-5_32

Optimising queue-based semi-stream joins by introducing a queue of frequent pages. / Asif Naeem, M.; Weber, Gerald; Lutteroth, Christof.

Databases Theory and Applications - 27th Australasian Database Conference, ADC 2016, Proceedings. ed. / M. A. Cheema; W. Zhang; L. Chang. Springer Verlag, 2016. p. 407-418 (Lecture Notes in Computer Science (including subseries Lecture Notes in Artificial Intelligence and Lecture Notes in Bioinformatics); Vol. 9877 ).

Research output: Chapter in Book/Report/Conference proceedingConference contribution

Asif Naeem, M, Weber, G & Lutteroth, C 2016, Optimising queue-based semi-stream joins by introducing a queue of frequent pages. in MA Cheema, W Zhang & L Chang (eds), Databases Theory and Applications - 27th Australasian Database Conference, ADC 2016, Proceedings. Lecture Notes in Computer Science (including subseries Lecture Notes in Artificial Intelligence and Lecture Notes in Bioinformatics), vol. 9877 , Springer Verlag, pp. 407-418, 27th Australasian Database Conference on Databases Theory and Applications, ADC 2016, Sydney, USA United States, 28/09/16. https://doi.org/10.1007/978-3-319-46922-5_32
Asif Naeem M, Weber G, Lutteroth C. Optimising queue-based semi-stream joins by introducing a queue of frequent pages. In Cheema MA, Zhang W, Chang L, editors, Databases Theory and Applications - 27th Australasian Database Conference, ADC 2016, Proceedings. Springer Verlag. 2016. p. 407-418. (Lecture Notes in Computer Science (including subseries Lecture Notes in Artificial Intelligence and Lecture Notes in Bioinformatics)). https://doi.org/10.1007/978-3-319-46922-5_32
Asif Naeem, M. ; Weber, Gerald ; Lutteroth, Christof. / Optimising queue-based semi-stream joins by introducing a queue of frequent pages. Databases Theory and Applications - 27th Australasian Database Conference, ADC 2016, Proceedings. editor / M. A. Cheema ; W. Zhang ; L. Chang. Springer Verlag, 2016. pp. 407-418 (Lecture Notes in Computer Science (including subseries Lecture Notes in Artificial Intelligence and Lecture Notes in Bioinformatics)).
@inproceedings{58b95f0629ea4318a31a27fbdaffe344,
title = "Optimising queue-based semi-stream joins by introducing a queue of frequent pages",
abstract = "Semi-stream joins perform a join between a stream and a disk-based table. These joins can easily deal with typical workloads in online real-time data warehousing in many scenarios and with relatively modest system requirements. The disk access is page-based. In the past, several proposals have been made to exploit skew in the distribution of the join attribute. Such skew is a common result of natural short- or longtailed distributions in master data. Several semi-stream joins use caching strategies in order to improve performance. This works up to a point, but these algorithms still require relatively slow processing of stream data that matches with the infrequent tuples in the master data. In this work we explore the possibility of an additional strategy to exploit data skew: disk pages that are frequently accessed as a whole are accessed with priority. We show that considerable gain in service rate can be achieved with this strategy, while keeping memory consumption low. In essence we gain a three-stage approach to deal with skewed, unsorted data: caching plus our new strategy plus processing of the long tail of the distribution. We also present a cost model for our approach and validate our approach empirically.",
keywords = "Indexing, Performance optimisation, Semi-stream join",
author = "{Asif Naeem}, M. and Gerald Weber and Christof Lutteroth",
year = "2016",
month = "1",
day = "1",
doi = "10.1007/978-3-319-46922-5_32",
language = "English",
isbn = "9783319469218",
series = "Lecture Notes in Computer Science (including subseries Lecture Notes in Artificial Intelligence and Lecture Notes in Bioinformatics)",
publisher = "Springer Verlag",
pages = "407--418",
editor = "Cheema, {M. A.} and Zhang, {W. } and L. Chang",
booktitle = "Databases Theory and Applications - 27th Australasian Database Conference, ADC 2016, Proceedings",
address = "Germany",

}

TY - GEN

T1 - Optimising queue-based semi-stream joins by introducing a queue of frequent pages

AU - Asif Naeem, M.

AU - Weber, Gerald

AU - Lutteroth, Christof

PY - 2016/1/1

Y1 - 2016/1/1

N2 - Semi-stream joins perform a join between a stream and a disk-based table. These joins can easily deal with typical workloads in online real-time data warehousing in many scenarios and with relatively modest system requirements. The disk access is page-based. In the past, several proposals have been made to exploit skew in the distribution of the join attribute. Such skew is a common result of natural short- or longtailed distributions in master data. Several semi-stream joins use caching strategies in order to improve performance. This works up to a point, but these algorithms still require relatively slow processing of stream data that matches with the infrequent tuples in the master data. In this work we explore the possibility of an additional strategy to exploit data skew: disk pages that are frequently accessed as a whole are accessed with priority. We show that considerable gain in service rate can be achieved with this strategy, while keeping memory consumption low. In essence we gain a three-stage approach to deal with skewed, unsorted data: caching plus our new strategy plus processing of the long tail of the distribution. We also present a cost model for our approach and validate our approach empirically.

AB - Semi-stream joins perform a join between a stream and a disk-based table. These joins can easily deal with typical workloads in online real-time data warehousing in many scenarios and with relatively modest system requirements. The disk access is page-based. In the past, several proposals have been made to exploit skew in the distribution of the join attribute. Such skew is a common result of natural short- or longtailed distributions in master data. Several semi-stream joins use caching strategies in order to improve performance. This works up to a point, but these algorithms still require relatively slow processing of stream data that matches with the infrequent tuples in the master data. In this work we explore the possibility of an additional strategy to exploit data skew: disk pages that are frequently accessed as a whole are accessed with priority. We show that considerable gain in service rate can be achieved with this strategy, while keeping memory consumption low. In essence we gain a three-stage approach to deal with skewed, unsorted data: caching plus our new strategy plus processing of the long tail of the distribution. We also present a cost model for our approach and validate our approach empirically.

KW - Indexing

KW - Performance optimisation

KW - Semi-stream join

UR - http://www.scopus.com/inward/record.url?scp=84990061924&partnerID=8YFLogxK

U2 - 10.1007/978-3-319-46922-5_32

DO - 10.1007/978-3-319-46922-5_32

M3 - Conference contribution

SN - 9783319469218

T3 - Lecture Notes in Computer Science (including subseries Lecture Notes in Artificial Intelligence and Lecture Notes in Bioinformatics)

SP - 407

EP - 418

BT - Databases Theory and Applications - 27th Australasian Database Conference, ADC 2016, Proceedings

A2 - Cheema, M. A.

A2 - Zhang, W.

A2 - Chang, L.

PB - Springer Verlag

ER -