TY - GEN
T1 - Clustering based approach to learning regular expressions over large alphabet for noisy unstructured text
AU - Babbar, Rohit
AU - Singh, Nidhi
PY - 2010/10/26
Y1 - 2010/10/26
N2 - Regular Expressions have been used for Information Extraction tasks in a variety of domains. The alphabet of the regular expression can either be the relevant tokens corresponding to the entity of interest or individual characters in which case the alphabet size becomes very large. The presence of noise in unstructured text documents along with increased alphabet size of the regular expressions poses a significant challenge for entity extraction tasks, and also for algorith-mically learning complex regular expressions. In this paper, we present a novel algorithm for regular expression learning which clusters similar matches to obtain the corresponding regular expressions, identifies and eliminates noisy clusters, and finally uses weighted disjunction of the most promising candidate regular expressions to obtain the final expression. The experimental results demonstrate high value of both precision and recall of this final expression, which reinforces the applicability of our approach in entity extraction tasks of practical importance.
AB - Regular Expressions have been used for Information Extraction tasks in a variety of domains. The alphabet of the regular expression can either be the relevant tokens corresponding to the entity of interest or individual characters in which case the alphabet size becomes very large. The presence of noise in unstructured text documents along with increased alphabet size of the regular expressions poses a significant challenge for entity extraction tasks, and also for algorith-mically learning complex regular expressions. In this paper, we present a novel algorithm for regular expression learning which clusters similar matches to obtain the corresponding regular expressions, identifies and eliminates noisy clusters, and finally uses weighted disjunction of the most promising candidate regular expressions to obtain the final expression. The experimental results demonstrate high value of both precision and recall of this final expression, which reinforces the applicability of our approach in entity extraction tasks of practical importance.
KW - Clustering in noisy text
KW - Regular expression learning
KW - Rule-based Information Extraction
UR - http://www.scopus.com/inward/record.url?scp=78651287844&partnerID=8YFLogxK
U2 - 10.1145/1871840.1871848
DO - 10.1145/1871840.1871848
M3 - Chapter in a published conference proceeding
AN - SCOPUS:78651287844
SN - 9781450303767
T3 - CIKM '10: International Conference on Information and Knowledge Management
SP - 43
EP - 50
BT - AND '10: Proceedings of the fourth workshop on Analytics for noisy unstructured text data
PB - Association for Computing Machinery
T2 - 4th Workshop on Analytics for Noisy Unstructured Text Data, AND'10 Co-located with 19th International Conference on Information and Knowledge Management, CIKM'10
Y2 - 26 October 2010 through 30 October 2010
ER -