Skewed distributions in semi-stream joins

how much can caching help?

M. Asif Naeem, Gillian Dobbie, Christof Lutteroth, Gerald Weber

Research output: Contribution to journalArticle

1 Citation (Scopus)

Abstract

Semi-stream join algorithms join a fast data stream with a disk-based relation. This is important, for example, in real-time data warehousing where a stream of transactions is joined with master data before loading it into a data warehouse. In many important scenarios, the stream input has a skewed distribution, which makes certain performance optimizations possible. We propose two such optimization techniques: (1) a caching technique for frequently used master data and (2) a technique for selective load shedding of stream tuples. The caching technique is fine-grained, operating on a tuple-level. Furthermore, it is generic in the sense that it can be applied to different semi-stream join algorithms to deal with data skew. We analyze it by combining it with various well-known semi-stream joins, and show that it improves the service rate by more than 40% for typical data with skewed distributions. The load shedding technique sheds the fraction of the stream that is most expensive to join. In contrast to existing approaches, the service rate improves under load shedding. We present experimental data showing significant improvements as compared to related approaches and perform a sensitivity analysis for various internal parameters.

Original languageEnglish
Pages (from-to)63-74
Number of pages12
JournalInformation Systems Journal
Volume64
Early online date28 Sep 2016
DOIs
Publication statusPublished - Mar 2017

Fingerprint

Data warehouses
Sensitivity analysis

Keywords

  • Front-stage cache
  • Join
  • Performance optimization
  • Semi-stream processing

Cite this

Skewed distributions in semi-stream joins : how much can caching help? / Naeem, M. Asif; Dobbie, Gillian; Lutteroth, Christof; Weber, Gerald.

In: Information Systems Journal, Vol. 64, 03.2017, p. 63-74.

Research output: Contribution to journalArticle

Naeem, M. Asif ; Dobbie, Gillian ; Lutteroth, Christof ; Weber, Gerald. / Skewed distributions in semi-stream joins : how much can caching help?. In: Information Systems Journal. 2017 ; Vol. 64. pp. 63-74.
@article{90a1dfd91b78454e945be93fe95a0a37,
title = "Skewed distributions in semi-stream joins: how much can caching help?",
abstract = "Semi-stream join algorithms join a fast data stream with a disk-based relation. This is important, for example, in real-time data warehousing where a stream of transactions is joined with master data before loading it into a data warehouse. In many important scenarios, the stream input has a skewed distribution, which makes certain performance optimizations possible. We propose two such optimization techniques: (1) a caching technique for frequently used master data and (2) a technique for selective load shedding of stream tuples. The caching technique is fine-grained, operating on a tuple-level. Furthermore, it is generic in the sense that it can be applied to different semi-stream join algorithms to deal with data skew. We analyze it by combining it with various well-known semi-stream joins, and show that it improves the service rate by more than 40{\%} for typical data with skewed distributions. The load shedding technique sheds the fraction of the stream that is most expensive to join. In contrast to existing approaches, the service rate improves under load shedding. We present experimental data showing significant improvements as compared to related approaches and perform a sensitivity analysis for various internal parameters.",
keywords = "Front-stage cache, Join, Performance optimization, Semi-stream processing",
author = "Naeem, {M. Asif} and Gillian Dobbie and Christof Lutteroth and Gerald Weber",
year = "2017",
month = "3",
doi = "10.1016/j.is.2016.09.007",
language = "English",
volume = "64",
pages = "63--74",
journal = "Information Systems Journal",
issn = "1350-1917",
publisher = "Wiley-Blackwell",

}

TY - JOUR

T1 - Skewed distributions in semi-stream joins

T2 - how much can caching help?

AU - Naeem, M. Asif

AU - Dobbie, Gillian

AU - Lutteroth, Christof

AU - Weber, Gerald

PY - 2017/3

Y1 - 2017/3

N2 - Semi-stream join algorithms join a fast data stream with a disk-based relation. This is important, for example, in real-time data warehousing where a stream of transactions is joined with master data before loading it into a data warehouse. In many important scenarios, the stream input has a skewed distribution, which makes certain performance optimizations possible. We propose two such optimization techniques: (1) a caching technique for frequently used master data and (2) a technique for selective load shedding of stream tuples. The caching technique is fine-grained, operating on a tuple-level. Furthermore, it is generic in the sense that it can be applied to different semi-stream join algorithms to deal with data skew. We analyze it by combining it with various well-known semi-stream joins, and show that it improves the service rate by more than 40% for typical data with skewed distributions. The load shedding technique sheds the fraction of the stream that is most expensive to join. In contrast to existing approaches, the service rate improves under load shedding. We present experimental data showing significant improvements as compared to related approaches and perform a sensitivity analysis for various internal parameters.

AB - Semi-stream join algorithms join a fast data stream with a disk-based relation. This is important, for example, in real-time data warehousing where a stream of transactions is joined with master data before loading it into a data warehouse. In many important scenarios, the stream input has a skewed distribution, which makes certain performance optimizations possible. We propose two such optimization techniques: (1) a caching technique for frequently used master data and (2) a technique for selective load shedding of stream tuples. The caching technique is fine-grained, operating on a tuple-level. Furthermore, it is generic in the sense that it can be applied to different semi-stream join algorithms to deal with data skew. We analyze it by combining it with various well-known semi-stream joins, and show that it improves the service rate by more than 40% for typical data with skewed distributions. The load shedding technique sheds the fraction of the stream that is most expensive to join. In contrast to existing approaches, the service rate improves under load shedding. We present experimental data showing significant improvements as compared to related approaches and perform a sensitivity analysis for various internal parameters.

KW - Front-stage cache

KW - Join

KW - Performance optimization

KW - Semi-stream processing

UR - http://www.scopus.com/inward/record.url?scp=85006827123&partnerID=8YFLogxK

UR - http://dx.doi.org/10.1016/j.is.2016.09.007

U2 - 10.1016/j.is.2016.09.007

DO - 10.1016/j.is.2016.09.007

M3 - Article

VL - 64

SP - 63

EP - 74

JO - Information Systems Journal

JF - Information Systems Journal

SN - 1350-1917

ER -