Clustering based approach to learning regular expressions over large alphabet for noisy unstructured text

Rohit Babbar, Nidhi Singh

Research output: Chapter or section in a book/report/conference proceedingChapter in a published conference proceeding

24 Citations (SciVal)

Abstract

Regular Expressions have been used for Information Extraction tasks in a variety of domains. The alphabet of the regular expression can either be the relevant tokens corresponding to the entity of interest or individual characters in which case the alphabet size becomes very large. The presence of noise in unstructured text documents along with increased alphabet size of the regular expressions poses a significant challenge for entity extraction tasks, and also for algorith-mically learning complex regular expressions. In this paper, we present a novel algorithm for regular expression learning which clusters similar matches to obtain the corresponding regular expressions, identifies and eliminates noisy clusters, and finally uses weighted disjunction of the most promising candidate regular expressions to obtain the final expression. The experimental results demonstrate high value of both precision and recall of this final expression, which reinforces the applicability of our approach in entity extraction tasks of practical importance.

Original languageEnglish
Title of host publicationAND '10: Proceedings of the fourth workshop on Analytics for noisy unstructured text data
PublisherAssociation for Computing Machinery
Pages43-50
Number of pages8
ISBN (Print)9781450303767
DOIs
Publication statusPublished - 26 Oct 2010
Event4th Workshop on Analytics for Noisy Unstructured Text Data, AND'10 Co-located with 19th International Conference on Information and Knowledge Management, CIKM'10 - Toronto, ON, Canada
Duration: 26 Oct 201030 Oct 2010

Publication series

NameCIKM '10: International Conference on Information and Knowledge Management
PublisherAssociation for Computing Machinery

Conference

Conference4th Workshop on Analytics for Noisy Unstructured Text Data, AND'10 Co-located with 19th International Conference on Information and Knowledge Management, CIKM'10
Country/TerritoryCanada
CityToronto, ON
Period26/10/1030/10/10

Keywords

  • Clustering in noisy text
  • Regular expression learning
  • Rule-based Information Extraction

ASJC Scopus subject areas

  • Decision Sciences(all)
  • Business, Management and Accounting(all)

Fingerprint

Dive into the research topics of 'Clustering based approach to learning regular expressions over large alphabet for noisy unstructured text'. Together they form a unique fingerprint.

Cite this