A Memory-Optimal Many-To-Many Semi-Stream Join

Asif Naeem, Gerald Weber, Christof Lutteroth

Research output: Contribution to journalArticle

Abstract

Semi-stream join algorithms join a fast stream input with a disk-based master data relation. A common class of these algorithms is derived from hash joins: they use the stream as build input for a main hash table, and also include a cache for frequent master data. The composition of the cache is very important for performance; however, the decision of which master data to cache has so far been solely based on heuristics. We present the first formal criterion, a cache inequality that leads to a provably optimal composition of the cache in a semi-stream many-to-many equijoin algorithm. We propose a novel algorithm, Semi-Stream Balanced Join (SSBJ), which exploits this cache inequality to achieve a given service rate with a provably minimal amount of memory for all stream distributions. We present a cost model for SSBJ and compare its service rate empirically and analytically with other related approaches.

LanguageEnglish
Number of pages27
JournalDistributed and Parallel Databases
Early online date31 Aug 2018
StatusE-pub ahead of print - 31 Aug 2018

Fingerprint

Data storage equipment
Chemical analysis
Costs

Cite this

A Memory-Optimal Many-To-Many Semi-Stream Join. / Naeem, Asif; Weber, Gerald; Lutteroth, Christof.

In: Distributed and Parallel Databases, 31.08.2018.

Research output: Contribution to journalArticle

@article{4a5cea3f364b4bab8083cc637422cd39,
title = "A Memory-Optimal Many-To-Many Semi-Stream Join",
abstract = "Semi-stream join algorithms join a fast stream input with a disk-based master data relation. A common class of these algorithms is derived from hash joins: they use the stream as build input for a main hash table, and also include a cache for frequent master data. The composition of the cache is very important for performance; however, the decision of which master data to cache has so far been solely based on heuristics. We present the first formal criterion, a cache inequality that leads to a provably optimal composition of the cache in a semi-stream many-to-many equijoin algorithm. We propose a novel algorithm, Semi-Stream Balanced Join (SSBJ), which exploits this cache inequality to achieve a given service rate with a provably minimal amount of memory for all stream distributions. We present a cost model for SSBJ and compare its service rate empirically and analytically with other related approaches.",
author = "Asif Naeem and Gerald Weber and Christof Lutteroth",
year = "2018",
month = "8",
day = "31",
language = "English",
journal = "Distributed and Parallel Databases",
issn = "0926-8782",
publisher = "Springer Verlag",

}

TY - JOUR

T1 - A Memory-Optimal Many-To-Many Semi-Stream Join

AU - Naeem, Asif

AU - Weber, Gerald

AU - Lutteroth, Christof

PY - 2018/8/31

Y1 - 2018/8/31

N2 - Semi-stream join algorithms join a fast stream input with a disk-based master data relation. A common class of these algorithms is derived from hash joins: they use the stream as build input for a main hash table, and also include a cache for frequent master data. The composition of the cache is very important for performance; however, the decision of which master data to cache has so far been solely based on heuristics. We present the first formal criterion, a cache inequality that leads to a provably optimal composition of the cache in a semi-stream many-to-many equijoin algorithm. We propose a novel algorithm, Semi-Stream Balanced Join (SSBJ), which exploits this cache inequality to achieve a given service rate with a provably minimal amount of memory for all stream distributions. We present a cost model for SSBJ and compare its service rate empirically and analytically with other related approaches.

AB - Semi-stream join algorithms join a fast stream input with a disk-based master data relation. A common class of these algorithms is derived from hash joins: they use the stream as build input for a main hash table, and also include a cache for frequent master data. The composition of the cache is very important for performance; however, the decision of which master data to cache has so far been solely based on heuristics. We present the first formal criterion, a cache inequality that leads to a provably optimal composition of the cache in a semi-stream many-to-many equijoin algorithm. We propose a novel algorithm, Semi-Stream Balanced Join (SSBJ), which exploits this cache inequality to achieve a given service rate with a provably minimal amount of memory for all stream distributions. We present a cost model for SSBJ and compare its service rate empirically and analytically with other related approaches.

M3 - Article

JO - Distributed and Parallel Databases

T2 - Distributed and Parallel Databases

JF - Distributed and Parallel Databases

SN - 0926-8782

ER -