ANTCorpus project
Arabic News Texts Corpus

About the project

ANTCorpus stands for "Arabic News Texts Corpus". It is a research project that aims to collect texts from different sources of the web by incrementing the amount of data progressively.

The acronym ANT can remind the ants' work:
"Every ant should contribute to build the nest progressively".

How it works ?

From RSS feeds of news websites

Filter and Extract categorized data

Generate corpus documents

The team behind the scene

Technical staff

Dr. Oussama Ben Khiroun

oussama [DOT] ben [DOT] khiroun [@] gmail [DOT] com

Dr. Raja Ayed

ayed [DOT] raja [@] gmail [DOT] com

Amina Chouigui

aminachouigui [@] gmail [DOT] com

Head staff

Dr. Bilel Elayeb


Download v2.1 Multi-source (Number of documents = 31.798 | Released version date = 05 December 2018)
Download v1.1 One source (Number of documents = 10.161 | Released version date = 11 June 2018)
Download v1.0 One source (Number of documents = 6.005 | Released version date = 12 August 2017)

Citation Licence

The files of ANT Corpus are subject to the following citation license:

By downloading ANT Corpus, you agree to cite at least one of our papers describing ANT Corpus (refer to the section below) and/or refer the project's main page in any kind of material you produce where ANT Corpus was used to conduct search or experimentation, whether be it a research paper, dissertation, article, poster, presentation, or documentation.
✅ By using this data, you have agreed to the citation licence.


📄 A. Chouigui, O. Ben Khiroun and B. Elayeb. An Arabic Multi-source News Corpus: Experimenting on Single-document Extractive Summarization. In Arabian Journal for Science and Engineering (AJSE 2021), 46(08), 1-14, DOI : 10.1007/s13369-020-05258-z , February 2021.

📄 A. Chouigui, O. Ben Khiroun and B. Elayeb. ANT Corpus : An Arabic News Text Collection for Textual Classification. In proceedings of the 14th ACS/IEEE International Conference on Computer Systems and Applications (AICCSA 2017), pp. 135-142, Hammamet, Tunisia, October 30 - November 3, 2017.

📄 A. Chouigui, O. Ben Khiroun and B. Elayeb. A TF-IDF and Co-occurrence Based Approach for Events Extraction from Arabic News Corpus. In proceedings of the 23rd International Conference on Natural Language & Information Systems (NLDB 2018), pp. 272-280, Paris, France, 13-15 June 2018.

📄 A. Chouigui, O. Ben Khiroun and B. Elayeb. Related Terms Extraction from Arabic News Corpus using Word Embedding. In: OTM Conferences & Workshops: Proceedings of the 7th International Workshop on Methods, Evaluation, Tools and Applications for the Creation and Consumption of Structured Data for the e-Society (Meta4eS'18), Springer, LNCS, pp. 1-11, Valletta, Malta, 22-26 October 2018.

Workshop poster participation

"ANT Corpus: un Corpus Multi-Sources pour la Classification et la Fouille des Textes Arabes"
(Poster des Journées Scientifiques Pluridisciplinaires (JSP'2018)) (in french)
Poster ANT Corpus JSP-2018