Skewed distributions in semi-stream joins: how much can caching help?

M. Asif Naeem, Gillian Dobbie, Christof Lutteroth, Gerald Weber

Research output: Contribution to journalArticlepeer-review

5 Citations (SciVal)

Abstract

Semi-stream join algorithms join a fast data stream with a disk-based relation. This is important, for example, in real-time data warehousing where a stream of transactions is joined with master data before loading it into a data warehouse. In many important scenarios, the stream input has a skewed distribution, which makes certain performance optimizations possible. We propose two such optimization techniques: (1) a caching technique for frequently used master data and (2) a technique for selective load shedding of stream tuples. The caching technique is fine-grained, operating on a tuple-level. Furthermore, it is generic in the sense that it can be applied to different semi-stream join algorithms to deal with data skew. We analyze it by combining it with various well-known semi-stream joins, and show that it improves the service rate by more than 40% for typical data with skewed distributions. The load shedding technique sheds the fraction of the stream that is most expensive to join. In contrast to existing approaches, the service rate improves under load shedding. We present experimental data showing significant improvements as compared to related approaches and perform a sensitivity analysis for various internal parameters.

Original languageEnglish
Pages (from-to)63-74
Number of pages12
JournalInformation Systems
Volume64
Early online date28 Sept 2016
DOIs
Publication statusPublished - Mar 2017

Keywords

  • Front-stage cache
  • Join
  • Performance optimization
  • Semi-stream processing

Fingerprint

Dive into the research topics of 'Skewed distributions in semi-stream joins: how much can caching help?'. Together they form a unique fingerprint.

Cite this