Hamshahri Corpus

Hamshahri Corpus Logo

The Hamshahri Corpus is a sizable Persian corpus based on the Iranian newspaper Hamshahri, one of the first online Persian newspapers in Iran. It was in initially collected and compiled by Ehsan Darrudi at DBRG Group[1] of University of Tehran. Later a team headed by Ale Ahmad [2] build on this corpus and created the first Persian Text Collection suitable for information retrieval evaluation tasks.

This corpus was created by crawling the online news articles from the Hamshahri's website and processing the HTML pages to create a standard text corpus for modern Information Retrieval experiments.

Version 1.0

The collection contains more than 160,000 articles covering the following subject categories: politics, city news, economics, reports, editorials, literature, sciences, Society, foreign news, sports, etc. The size of the documents varies from short news (under 1 KB) to rather long articles (e.g. 140 KB) with the average of 1.8 KB.

The corpus is available in several formats for download:[2]

Version 2.0

The second release of Hamshahri Corpus released on 20 October 2008. It offers several new features and improvements:

The corpus is available for download in XML format.

See also

References

  1. DBRG News Database Research Group
  2. 1 2 Hamshahri Database Research Group

External links

This article is issued from Wikipedia - version of the 4/10/2016. The text is available under the Creative Commons Attribution/Share Alike but additional terms may apply for the media files.