A Memory-Optimal Many-To-Many Semi-Stream Join

Asif Naeem, Gerald Weber, Christof Lutteroth

Research output: Contribution to journalArticle

8 Downloads (Pure)

Abstract

Semi-stream join algorithms join a fast stream input with a disk-based master data relation. A common class of these algorithms is derived from hash joins: they use the stream as build input for a main hash table, and also include a cache for frequent master data. The composition of the cache is very important for performance; however, the decision of which master data to cache has so far been solely based on heuristics. We present the first formal criterion, a cache inequality that leads to a provably optimal composition of the cache in a semi-stream many-to-many equijoin algorithm. We propose a novel algorithm, Semi-Stream Balanced Join (SSBJ), which exploits this cache inequality to achieve a given service rate with a provably minimal amount of memory for all stream distributions. We present a cost model for SSBJ and compare its service rate empirically and analytically with other related approaches.

Original languageEnglish
Pages (from-to)1-27
Number of pages27
JournalDistributed and Parallel Databases
Early online date31 Aug 2018
DOIs
Publication statusE-pub ahead of print - 31 Aug 2018

Fingerprint

Data storage equipment
Chemical analysis
Costs

Keywords

  • Cache optimization
  • Many-to-many semi-stream join
  • Performance evaluation

ASJC Scopus subject areas

  • Software
  • Information Systems
  • Hardware and Architecture
  • Information Systems and Management

Cite this

A Memory-Optimal Many-To-Many Semi-Stream Join. / Naeem, Asif; Weber, Gerald; Lutteroth, Christof.

In: Distributed and Parallel Databases, 31.08.2018, p. 1-27.

Research output: Contribution to journalArticle

@article{4a5cea3f364b4bab8083cc637422cd39,
title = "A Memory-Optimal Many-To-Many Semi-Stream Join",
abstract = "Semi-stream join algorithms join a fast stream input with a disk-based master data relation. A common class of these algorithms is derived from hash joins: they use the stream as build input for a main hash table, and also include a cache for frequent master data. The composition of the cache is very important for performance; however, the decision of which master data to cache has so far been solely based on heuristics. We present the first formal criterion, a cache inequality that leads to a provably optimal composition of the cache in a semi-stream many-to-many equijoin algorithm. We propose a novel algorithm, Semi-Stream Balanced Join (SSBJ), which exploits this cache inequality to achieve a given service rate with a provably minimal amount of memory for all stream distributions. We present a cost model for SSBJ and compare its service rate empirically and analytically with other related approaches.",
keywords = "Cache optimization, Many-to-many semi-stream join, Performance evaluation",
author = "Asif Naeem and Gerald Weber and Christof Lutteroth",
year = "2018",
month = "8",
day = "31",
doi = "10.1007/s10619-018-7247-z",
language = "English",
pages = "1--27",
journal = "Distributed and Parallel Databases",
issn = "0926-8782",
publisher = "Springer Verlag",

}

TY - JOUR

T1 - A Memory-Optimal Many-To-Many Semi-Stream Join

AU - Naeem, Asif

AU - Weber, Gerald

AU - Lutteroth, Christof

PY - 2018/8/31

Y1 - 2018/8/31

N2 - Semi-stream join algorithms join a fast stream input with a disk-based master data relation. A common class of these algorithms is derived from hash joins: they use the stream as build input for a main hash table, and also include a cache for frequent master data. The composition of the cache is very important for performance; however, the decision of which master data to cache has so far been solely based on heuristics. We present the first formal criterion, a cache inequality that leads to a provably optimal composition of the cache in a semi-stream many-to-many equijoin algorithm. We propose a novel algorithm, Semi-Stream Balanced Join (SSBJ), which exploits this cache inequality to achieve a given service rate with a provably minimal amount of memory for all stream distributions. We present a cost model for SSBJ and compare its service rate empirically and analytically with other related approaches.

AB - Semi-stream join algorithms join a fast stream input with a disk-based master data relation. A common class of these algorithms is derived from hash joins: they use the stream as build input for a main hash table, and also include a cache for frequent master data. The composition of the cache is very important for performance; however, the decision of which master data to cache has so far been solely based on heuristics. We present the first formal criterion, a cache inequality that leads to a provably optimal composition of the cache in a semi-stream many-to-many equijoin algorithm. We propose a novel algorithm, Semi-Stream Balanced Join (SSBJ), which exploits this cache inequality to achieve a given service rate with a provably minimal amount of memory for all stream distributions. We present a cost model for SSBJ and compare its service rate empirically and analytically with other related approaches.

KW - Cache optimization

KW - Many-to-many semi-stream join

KW - Performance evaluation

UR - http://www.scopus.com/inward/record.url?scp=85053382445&partnerID=8YFLogxK

U2 - 10.1007/s10619-018-7247-z

DO - 10.1007/s10619-018-7247-z

M3 - Article

SP - 1

EP - 27

JO - Distributed and Parallel Databases

JF - Distributed and Parallel Databases

SN - 0926-8782

ER -