PropertyValue
rdfs:label
  • Information extraction
rdfs:comment
  • Information extraction (IE) is a type of information retrieval whose goal is to automatically extract structured information from unstructured machine-readable documents, generally human language texts by means of natural language processing (NLP).
  • Modern Information Extraction is, in general, credited to MUC which was established by DARPA. It has evolved quite a bit and present-day evaluations are done within the context of the ACE (Automatic Content Extraction) evaluation program run by NIST.
owl:sameAs
dcterms:subject
dbkwik:itlaw/property/wikiPageUsesTemplate
abstract
  • Information extraction (IE) is a type of information retrieval whose goal is to automatically extract structured information from unstructured machine-readable documents, generally human language texts by means of natural language processing (NLP).
  • Modern Information Extraction is, in general, credited to MUC which was established by DARPA. It has evolved quite a bit and present-day evaluations are done within the context of the ACE (Automatic Content Extraction) evaluation program run by NIST. 1. * Information Extraction, Andrew McCallum, University of Massachusettes, ACM Queue November 2005 [1] A non-technical overview of information extraction that presents a 5-step high-level overview consisting of 2. 1. * Segmentation: essentially tokenization of text 3. 2. * Classification: classify each segmented piece as one of several classes (Person, Organization etc.) 4. 3. * Association: essentially relationship detection 5. 4. * Normalization: different things are normalized to be the same (3-3:30 and 1500-1530 to possibly the same ISO std). 6. 5. * Deduplication: essentially coreference resolution. 7. 6. * Talks about the different higher-level approaches to Information Extraction and applicability of these 8. 7. 1. * Simple regular expressions: for simple extraction tasks 9. 8. 2. * Rules: more complex exraction tasks but semantics are still clearly defineable 10. 9. 3. * Machine learning algorithms: Subtle rules for very complex tasks 11. 10. * Uncertainty is an integral part of information extraction and needs to be managed appropriately. Easier training needed (defining large numbers of labeled examples is not easy) and therefore semi-supervised methods and interactive extraction. 12. 11. * Note Very lightweight tutorial and a good light reading. 13. * Automatic Information Extraction, Hamish Cunnigham, University of Sheffield[2] An extensive overview of different IE tasks along with nice examples. Starts with the claim that IE tasks are faced with the specificity-complexity tradeoff; i.e., more complex the IE task the more specific the domain should be from which the information is being extracted. Several applications are listed such as marketing, PR, media analytst etc. IE tasks are broadly divided into 5 categories. 14. 1. * Named Entity 15. 2. * Coreference Resolution 16. 3. * Template Element Construction: Constructs templates by adding description to extracted information (primarily using Coreference Resolution 17. 4. * Template Relation Constrcution: Essentially Relationship identification 18. 5. * Scenario Extraction: Tie together the elements and relations into a single complex event such as Person A was replaced by Person B on Date C at Organization Y. 19. 6. * Summary: Information Extraction (modern) pretty much started with MUC (Message Understanding Conference) and the afore-mentioned tasks were the basis for this conference. The newer information extraction conference is known as ACE (Automatic Content Extraction) [3] and is significantly harder with 1 & 2 a single task, 3 & 4 a single task and 5 a separate task. 20. 7. * Note Fairly vanilla tutorial that has basically followed the MUC and ACE tasks in most of the description. 21. * Introduction to Information Extraction, Doug Appelt and David Israel, SRI, Tutorial at IJCAI 1999[4] Very detailed introduction to Information Extraction from a rule-based linguist perspective. After some introduction the tutorial talks about two main approaches to building extraction systems (a) Knowledge Engineering Approach and (b) Automatically trainable Systems. Several examples of pros and cons of the two approches are discussed. Different components of an information extraction system are described as 22. 1. * Tokenization: Straightforward 23. 2. * Morphological Processing: (a) Identify inflectional variants (b) Lexical lookup of tokens (c) Part-of-speech tagging (d) names and structured items: Identification of structured items such as dates, times, telephone numbers and proper names etc. There is a, somewhat, detailed discussion on both knowledge-based and machine learning approaches to named-entity extraction. A generic approach to building rule-based named-entity recognizers is given. Discussion of trainable named-entity taggers using HMMs etc. There is also pointers to several tools for building named-entity taggers. 24. 3. * Syntactic Analysis: shallow and full-parsing. Both knowledge-based and trainable parsers are discussed. 25. 4. * Domain Analysis: (a) Coreference analysis with a detailed description of a coreference algorithm (b) Merging of partial results 26. 5. * Note Incomplete. This is a very comprehensive tutorial but not very well-organized. Towards the end (Domain Analysis) it gets to be fairly opaque and several issues are mixed-up. 27. * Information Extaction and Integration: an Overview, William Cohen [5] An excellent overview of statistical methods for Information Extraction. Detailed explanation of various algorithms starting with simple ones like a sliding window approach used to capture the sequence information. This is followed by more powerful markovian algorithms in increasing order of power -- Hidden Markov Models, Maximum Entropy Markov Models and Conditional Random Fields. 28. * Empirical Methods in Information Extraction, Claire Cardie, AI Magazine Another overview to several of the MUC tasks. [6]