Cambridge-Belgrade Persian Learner Corpus

The International project of CamBel (Cambridge-Belgrade Persian Learner Corpus) aims to provide a collection of written and spoken materials produced by learners of Persian around the world.

Overview

The CamBel Persian Learner Corpus is an error-tagged (coded) learner corpus of the written productions of Persian learners who have diverse linguistic backgrounds at A1-C2 CEFR levels (beginner to advanced) from all over the world. CamBel has been jointly compiled, recorded, and administered as part of an academic research collaboration between Persian Studies in the Faculty of Asian and Middle Eastern Studies at the University of Cambridge and the Department of Oriental Studies and Centre of Persian Studies in the Faculty of Philology at the University of Belgrade. Work on CamBel is ongoing, and new texts are continuously added and tagged.

Background

Linguistic corpora constitute reliable sources and empirical means for analysing linguistic data. They are also widely used in the fields of Second/Foreign Language Acquisition and Foreign Language Teaching research, where the most commonly used type is learner corpora.

A learner corpus consists of authentic materials (written, spoken, or mixed) produced by learners in the course of their learning endeavours. The systematic collection, processing, and analysis of data are crucial for educators, researchers, and material developers when it comes to identifying learners’ challenges and difficulties, improving curricula, creating effective learning materials, and conducting thorough error analysis. As there is a lack of such resources in the field of teaching Persian to non-Persian learners, there is an urgent need for specialised, streamlined corpora tailored to Persian as a second/foreign language. The development of CamBel for the Persian language aims to address this need and contribute to the advancement of research in this field.

The Cambridge-Belgrade Persian Learner Corpus (CamBel) is formed by merging the SFLC Error-Tagged Learner Corpus developed at the University of Belgrade with the Persian Learners Written Data (PLWD) at the University of Cambridge. To set up CamBel, three major stages were followed: constructing the corpus, proposing a system of error annotation, and developing tools and software. The practical phases included the systematic collection of data and metadata, defining the corpus design criteria, creating the error tagsets, and developing the corpus interface, software, and specific tools.

The CamBel software is equipped with four main tools in order to function as an error-tagged learner corpus and provide statistical reports. The data gathered in this corpus are predominantly written works on a variety of topics produced by learners of Persian at different levels (A1-C2) from all over the world, so the corpus presents the natural written production of Persian learners who have a range of different first languages. The learners are of both genders, various ages, different educational levels, and from different educational or academic settings.

Research Team

The CamBel Persian Learner Corpus has been jointly developed by teams of linguists, teachers, lecturers, and professors of Persian around the world under the supervision of Prof. Dr. Mahbod Ghaffari, University of Cambridge, and Prof. Dr. Saeed Safari, University of Belgrade. The project has been formally reviewed and approved by the relevant Faculty Academic Councils of the participating institutions.

For academic inquiries, access requests, or collaborative proposals, please contact the CamBel research team at farsi@fil.bg.ac.rs.