Automatic acquisition of sense-tagged corpora

The knowledge acquisition bottleneck is perhaps the major impediment to solving the word sense disambiguation (WSD) problem. Unsupervised learning methods rely on knowledge about word senses, which is barely formulated in dictionaries and lexical databases. Supervised learning methods depend heavily on the existence of manually annotated examples for every word sense, a requisite that can so far be met only for a handful of words for testing purposes, as it is done in the Senseval exercises.

Existing methods

Therefore, one of the most promising trends in WSD research is using the largest corpus ever accessible, the World Wide Web, to acquire lexical information automatically.[1] WSD has been traditionally understood as an intermediate language engineering technology which could improve applications such as information retrieval (IR). In this case, however, the reverse is also true: Web search engines implement simple and robust IR techniques that can be successfully used when mining the Web for information to be employed in WSD.

The most direct way of using the Web (and other corpora) to enhance WSD performance is the automatic acquisition of sense-tagged corpora, the fundamental resource to feed supervised WSD algorithms. Although this is far from being commonplace in the WSD literature, a number of different and effective strategies to achieve this goal have already been proposed. Some of these strategies are:

Summary

Optimistic results

The automatic extraction of examples to train supervised learning algorithms reviewed has been, by far, the best explored approach to mine the web for word sense disambiguation. Some results are certainly encouraging:

Difficulties

There are, however, several open research issues related to the use of Web examples in WSD:

Future

Besides automatic acquisition of examples from the Web, there are some other WSD experiments that have profited from the Web:

It is clear, however, that most research opportunities remain largely unexplored. For instance, little is known about how to use lexical information extracted from the Web in knowledge-based WSD systems; and it is also hard to find systems that use Web-mined parallel corpora for WSD, even though there are already efficient algorithms that use parallel corpora in WSD.

References

  1. Kilgarriff, A.; G. Grefenstette. 2003. Introduction to the special issue on the Web as corpus. Computational Linguistics 29(3)
  2. Mihalcea, Rada. 2002. Bootstrapping large sense tagged corpora. Proceedings of the Language Resources and Evaluation Conference (LREC), Las Palmas, Spain.
  3. Santamaría, Celina, Julio Gonzalo & Felisa Verdejo. 2003. Automatic association of Web directories to word senses. Computational Linguistics, 29(3): 485–502.
  4. Agirre, Eneko & David Martínez. 2004. Unsupervised WSD based on automatically retrieved examples: The importance of bias. Proceedings of the Conference on Empirical Methods in Natural Language Processing (EMNLP), Barcelona, Spain, 25–33.
  5. Mihalcea, Rada. 2002a. Word sense disambiguation with pattern learning and automatic feature selection. Natural Language Engineering, 8(4): 348–358.
  6. Agirre, Eneko & David Martínez. 2000. Exploring automatic word sense disambiguation with decision lists and the Web. Proceedings of the COLING Workshop on Semantic Annotation and Intelligent Annotation, Luxembourg, 11–19.
  7. Agirre, Eneko & David Martínez. 2004. Unsupervised WSD based on automatically retrieved examples: The importance of bias. Proceedings of the Conference on Empirical Methods in Natural Language Processing (EMNLP), Barcelona, Spain, 25–33.
  8. Mihalcea, Rada. 2002b. Bootstrapping large sense tagged corpora. Proceedings of the Language Resources and Evaluation Conference (LREC), Las Palmas, Spain.
  9. Agirre, Eneko & David Martínez. 2004. Unsupervised WSD based on automatically retrieved examples: The importance of bias. Proceedings of the Conference on Empirical Methods in Natural Language Processing (EMNLP), Barcelona, Spain, 25–33.
  10. Agirre, Eneko & David Martínez. 2000. Exploring automatic word sense disambiguation with decision lists and the Web. Proceedings of the COLING Workshop on Semantic Annotation and Intelligent Annotation, Luxembourg, 11–19.
  11. Mihalcea, Rada & Dan Moldovan. 1999. An automatic method for generating sense tagged corpora. Proceedings of the American Association for Artificial Intelligence (AAAI), Orlando, U.S.A., 461–466.
  12. Agirre, Eneko & David Martínez. 2004. Unsupervised WSD based on automati- cally retrieved examples: The importance of bias. Proceedings of the Conference on Empirical Methods in Natural Language Processing (EMNLP), Barcelona, Spain, 25–33.
  13. Mihalcea, Rada. 2002b. Bootstrapping large sense tagged corpora. Proceedings of the Language Resources and Evaluation Conference (LREC), Las Palmas, Spain.
  14. Santamaría, Celina, Julio Gonzalo & Felisa Verdejo. 2003. Automatic association of Web directories to word senses. Computational Linguistics, 29(3): 485–502.
  15. Chklovski, Tim & Rada Mihalcea. 2002. Building a sense tagged corpus with Open Mind Word Expert. Proceedings of the ACL SIGLEX Workshop on Word Sense Disambiguation: Recent Successes and Future Directions, Philadelphia, U.S.A., 116–122.
  16. Agirre, Eneko, Olatz Ansa, Eduard H. Hovy & David Martínez. 2000. Enriching very large ontologies using the WWW. Proceedings of the Ontology Learning Workshop, European Conference on Artificial Intelligence (ECAI), Berlin, Germany.
  17. Santamaría, Celina, Julio Gonzalo & Felisa Verdejo. 2003. Automatic association of Web directories to word senses. Computational Linguistics, 29(3): 485–502.
  18. Denis Turdakov, Pavel Velikhov. Semantic Relatedness Metric for Wikipedia Concepts Based on Link Analysis and its Application to Word Sense Disambiguation // SYRCoDIS.— 2008.
  19. Турдаков Денис. Устранение лексической многозначности терминов Википедии на основе скрытой модели Маркова // XI Всероссийская научная конференция «Электронные библиотеки: перспективные методы и технологии, электронные коллекции».— 2009. pdf (russian)
This article is issued from Wikipedia - version of the 11/13/2016. The text is available under the Creative Commons Attribution/Share Alike but additional terms may apply for the media files.