ANTCorpus project
Arabic News Texts Corpus

About the project


ANTCorpus stands for "Arabic News Texts Corpus". It is a research project that aims to collect texts from different sources of the web by incrementing the amount of data progressively.

The acronym ANT can remind the ants' work:
"Every ant should contribute to build the nest progressively".



How it works ?


From RSS feeds of news websites

Filter and Extract categorized data

Generate corpus documents

The team behind the scene


Technical staff


Dr. Oussama Ben Khiroun

oussama [DOT] ben [DOT] khiroun [@] gmail [DOT] com

Dr. Raja Ayed

ayed [DOT] raja [@] gmail [DOT] com

Amina Chouigui

aminachouigui [@] gmail [DOT] com

Head staff


Dr. Bilel Elayeb

Download


Download v2.1 Multi-source (Number of documents = 31.798 | Released version date = 05 December 2018)
Download v1.1 One source (Number of documents = 10.161 | Released version date = 11 June 2018)
Download v1.0 One source (Number of documents = 6.005 | Released version date = 12 August 2017)

Citation Licence

The files of ANT Corpus are subject to the following citation license:

By downloading ANT Corpus, you agree to cite at least one of our papers describing ANT Corpus (refer to the section below) and/or refer the project's main page in any kind of material you produce where ANT Corpus was used to conduct search or experimentation, whether be it a research paper, dissertation, article, poster, presentation, or documentation.
โœ… By using this data, you have agreed to the citation licence.

Publications

๐Ÿ“„ A. Chouigui, O. Ben Khiroun and B. Elayeb. An Arabic Multi-source News Corpus: Experimenting on Single-document Extractive Summarization. In Arabian Journal for Science and Engineering (AJSE 2021), 46(08), 1-14, DOI : 10.1007/s13369-020-05258-z , February 2021.

๐Ÿ“„ A. Chouigui, O. Ben Khiroun and B. Elayeb. ANT Corpus : An Arabic News Text Collection for Textual Classification. In proceedings of the 14th ACS/IEEE International Conference on Computer Systems and Applications (AICCSA 2017), pp. 135-142, Hammamet, Tunisia, October 30 - November 3, 2017.

๐Ÿ“„ A. Chouigui, O. Ben Khiroun and B. Elayeb. A TF-IDF and Co-occurrence Based Approach for Events Extraction from Arabic News Corpus. In proceedings of the 23rd International Conference on Natural Language & Information Systems (NLDB 2018), pp. 272-280, Paris, France, 13-15 June 2018.

๐Ÿ“„ A. Chouigui, O. Ben Khiroun and B. Elayeb. Related Terms Extraction from Arabic News Corpus using Word Embedding. In: OTM Conferences & Workshops: Proceedings of the 7th International Workshop on Methods, Evaluation, Tools and Applications for the Creation and Consumption of Structured Data for the e-Society (Meta4eS'18), Springer, LNCS, pp. 1-11, Valletta, Malta, 22-26 October 2018.

Workshop poster participation

"ANT Corpus: un Corpus Multi-Sources pour la Classification et la Fouille des Textes Arabes"
(Poster des Journรฉes Scientifiques Pluridisciplinaires (JSP'2018)) (in french)
Poster ANT Corpus JSP-2018