TY - GEN
T1 - On empirical tradeoffs in large scale hierarchical classification
AU - Babbar, Rohit
AU - Partalas, Ioannis
AU - Gaussier, Eric
AU - Amblard, Cecile
N1 - Funding Information:
This work was supported by the French National Research Agency(ANR).
PY - 2012
Y1 - 2012
N2 - While multi-class categorization of documents has been of research interest for over a decade, relatively fewer approaches have been proposed for large scale taxonomies in which the number of classes range from hundreds of thousand as in Directory Mozilla to over a million in Wikipedia. As a result of ever increasing number of text documents and images from various sources, there is an immense need for automatic classification of documents in such large hierarchies. In this paper, we analyze the tradeoffs between the important characteristics of different classifiers employed in the top down fashion. The properties for relative comparison of these classifiers include, (i) accuracy on test instance, (ii) training time (iii) size of the model and (iv) test time required for prediction. Our analysis is motivated by the well known error bounds from learning theory, which is also further reinforced by the empirical observations on the publicly available data from the Large Scale Hierarchical Text Classification Challenge. We show that by exploiting the data heterogenity across the large scale hierarchies, one can build an overall classification system which is approximately 4 times faster for prediction, 3 times faster to train, while sacrificing only 1% point in accuracy.
AB - While multi-class categorization of documents has been of research interest for over a decade, relatively fewer approaches have been proposed for large scale taxonomies in which the number of classes range from hundreds of thousand as in Directory Mozilla to over a million in Wikipedia. As a result of ever increasing number of text documents and images from various sources, there is an immense need for automatic classification of documents in such large hierarchies. In this paper, we analyze the tradeoffs between the important characteristics of different classifiers employed in the top down fashion. The properties for relative comparison of these classifiers include, (i) accuracy on test instance, (ii) training time (iii) size of the model and (iv) test time required for prediction. Our analysis is motivated by the well known error bounds from learning theory, which is also further reinforced by the empirical observations on the publicly available data from the Large Scale Hierarchical Text Classification Challenge. We show that by exploiting the data heterogenity across the large scale hierarchies, one can build an overall classification system which is approximately 4 times faster for prediction, 3 times faster to train, while sacrificing only 1% point in accuracy.
KW - empirical tradeoffs
KW - hierarchical classification
UR - http://www.scopus.com/inward/record.url?scp=84871099561&partnerID=8YFLogxK
U2 - 10.1145/2396761.2398625
DO - 10.1145/2396761.2398625
M3 - Chapter in a published conference proceeding
AN - SCOPUS:84871099561
SN - 9781450311564
T3 - ACM International Conference Proceeding Series
SP - 2299
EP - 2302
BT - CIKM 2012 - Proceedings of the 21st ACM International Conference on Information and Knowledge Management
T2 - 21st ACM International Conference on Information and Knowledge Management, CIKM 2012
Y2 - 29 October 2012 through 2 November 2012
ER -