A Memory-Optimal Many-To-Many Semi-Stream Join

Asif Naeem, Gerald Weber, Christof Lutteroth

Research output: Contribution to journalArticle

1 Downloads (Pure)

Abstract

Semi-stream join algorithms join a fast stream input with a disk-based master data relation. A common class of these algorithms is derived from hash joins: they use the stream as build input for a main hash table, and also include a cache for frequent master data. The composition of the cache is very important for performance; however, the decision of which master data to cache has so far been solely based on heuristics. We present the first formal criterion, a cache inequality that leads to a provably optimal composition of the cache in a semi-stream many-to-many equijoin algorithm. We propose a novel algorithm, Semi-Stream Balanced Join (SSBJ), which exploits this cache inequality to achieve a given service rate with a provably minimal amount of memory for all stream distributions. We present a cost model for SSBJ and compare its service rate empirically and analytically with other related approaches.

Original languageEnglish
Number of pages27
JournalDistributed and Parallel Databases
Early online date31 Aug 2018
Publication statusE-pub ahead of print - 31 Aug 2018

Fingerprint

Data storage equipment
Chemical analysis
Costs

Cite this

A Memory-Optimal Many-To-Many Semi-Stream Join. / Naeem, Asif; Weber, Gerald; Lutteroth, Christof.

In: Distributed and Parallel Databases, 31.08.2018.

Research output: Contribution to journalArticle

@article{4a5cea3f364b4bab8083cc637422cd39,
title = "A Memory-Optimal Many-To-Many Semi-Stream Join",
abstract = "Semi-stream join algorithms join a fast stream input with a disk-based master data relation. A common class of these algorithms is derived from hash joins: they use the stream as build input for a main hash table, and also include a cache for frequent master data. The composition of the cache is very important for performance; however, the decision of which master data to cache has so far been solely based on heuristics. We present the first formal criterion, a cache inequality that leads to a provably optimal composition of the cache in a semi-stream many-to-many equijoin algorithm. We propose a novel algorithm, Semi-Stream Balanced Join (SSBJ), which exploits this cache inequality to achieve a given service rate with a provably minimal amount of memory for all stream distributions. We present a cost model for SSBJ and compare its service rate empirically and analytically with other related approaches.",
author = "Asif Naeem and Gerald Weber and Christof Lutteroth",
year = "2018",
month = "8",
day = "31",
language = "English",
journal = "Distributed and Parallel Databases",
issn = "0926-8782",
publisher = "Springer Verlag",

}

TY - JOUR

T1 - A Memory-Optimal Many-To-Many Semi-Stream Join

AU - Naeem, Asif

AU - Weber, Gerald

AU - Lutteroth, Christof

PY - 2018/8/31

Y1 - 2018/8/31

N2 - Semi-stream join algorithms join a fast stream input with a disk-based master data relation. A common class of these algorithms is derived from hash joins: they use the stream as build input for a main hash table, and also include a cache for frequent master data. The composition of the cache is very important for performance; however, the decision of which master data to cache has so far been solely based on heuristics. We present the first formal criterion, a cache inequality that leads to a provably optimal composition of the cache in a semi-stream many-to-many equijoin algorithm. We propose a novel algorithm, Semi-Stream Balanced Join (SSBJ), which exploits this cache inequality to achieve a given service rate with a provably minimal amount of memory for all stream distributions. We present a cost model for SSBJ and compare its service rate empirically and analytically with other related approaches.

AB - Semi-stream join algorithms join a fast stream input with a disk-based master data relation. A common class of these algorithms is derived from hash joins: they use the stream as build input for a main hash table, and also include a cache for frequent master data. The composition of the cache is very important for performance; however, the decision of which master data to cache has so far been solely based on heuristics. We present the first formal criterion, a cache inequality that leads to a provably optimal composition of the cache in a semi-stream many-to-many equijoin algorithm. We propose a novel algorithm, Semi-Stream Balanced Join (SSBJ), which exploits this cache inequality to achieve a given service rate with a provably minimal amount of memory for all stream distributions. We present a cost model for SSBJ and compare its service rate empirically and analytically with other related approaches.

M3 - Article

JO - Distributed and Parallel Databases

JF - Distributed and Parallel Databases

SN - 0926-8782

ER -