Identifying Product Entities in text with Conditional Random Fields
Summary
With a growing portion of the web dedicated to the discussion, review and retail of consumer products, it is increasingly relevant to develop methods for automated extraction of Product Entities in user generated text. In addition, it is important that the extraction models provide feedback about the quality of their output, in the form of a confidence score associated with each entity.
Conditional Random Fields, which are designed as a discriminative solution for structured output prediction, have shown to be successful for the related problem of Named Entity Recognition. Furthermore, their probabilistic nature provides a natural way to obtain a confidence score.
In this thesis, the optimal application of Conditional Random Fields to the specific problem of identifying Product Entities is investigated. A set of experiments is designed and executed to compare different choices of feature sets. The results prove that Conditional Random Fields perform better than heuristic models for this task.
In addition, several existing methods for confidence scoring are experimentally compared, and an optimized algorithm to calculate the exact confidence estimate (known as the Constrained Forward Backward estimate) is introduced. The experiments show that the more heuristic Gamma Product method has a comparable performance to the Constrained Forward Backward method, and thus provides an alternative confidence estimate to use in practice.
Finally, the F1 score for the Product Entity Recognition is enhanced even further by combining models, using the confidence score for voting.