TY - GEN
T1 - Re-ranking approach to classification in large-scale power-law distributed category systems
AU - Babbar, Rohit
AU - Partalas, Ioannis
AU - Gaussier, Eric
AU - Amini, Massih Reza
PY - 2014/7/3
Y1 - 2014/7/3
N2 - For large-scale category systems, such as Directory Mozilla, which consist of tens of thousand categories, it has been empirically verified in earlier studies that the distribution of documents among categories can be modeled as a power- law distribution. It implies that a significant fraction of categories, referred to as rare categories, have very few documents assigned to them. This characteristic of the data makes it harder for learning algorithms to learn effective decision boundaries which can correctly detect such categories in the test set. In this work, we exploit the distribution of documents among categories to (i) derive an upper bound on the accuracy of any classifier, and (ii) propose a ranking-based algorithm which aims to maximize this upper bound. The empirical evaluation on publicly available large-scale datasets demonstrate that the proposed method not only achieves higher accuracy but also much higher coverage of rare categories as compared to state-of-the-art methods.
AB - For large-scale category systems, such as Directory Mozilla, which consist of tens of thousand categories, it has been empirically verified in earlier studies that the distribution of documents among categories can be modeled as a power- law distribution. It implies that a significant fraction of categories, referred to as rare categories, have very few documents assigned to them. This characteristic of the data makes it harder for learning algorithms to learn effective decision boundaries which can correctly detect such categories in the test set. In this work, we exploit the distribution of documents among categories to (i) derive an upper bound on the accuracy of any classifier, and (ii) propose a ranking-based algorithm which aims to maximize this upper bound. The empirical evaluation on publicly available large-scale datasets demonstrate that the proposed method not only achieves higher accuracy but also much higher coverage of rare categories as compared to state-of-the-art methods.
KW - Large-scale classification
KW - Power-law distribution
UR - http://www.scopus.com/inward/record.url?scp=84904566007&partnerID=8YFLogxK
U2 - 10.1145/2600428.2609509
DO - 10.1145/2600428.2609509
M3 - Chapter in a published conference proceeding
AN - SCOPUS:84904566007
SN - 9781450322591
T3 - SIGIR 2014 - Proceedings of the 37th International ACM SIGIR Conference on Research and Development in Information Retrieval
SP - 1059
EP - 1062
BT - SIGIR 2014 - Proceedings of the 37th International ACM SIGIR Conference on Research and Development in Information Retrieval
PB - Association for Computing Machinery
T2 - 37th International ACM SIGIR Conference on Research and Development in Information Retrieval, SIGIR 2014
Y2 - 6 July 2014 through 11 July 2014
ER -