Data Sets

The competition will be carried out on the LAMIS-MASHD data set as well as the CERUG data set. The first data set contains Arabic and French handwritten texts and the second data set contain Chinese and English texts.

The CERUG data set contains handwritten documents collected from 105 Chinese subjects, predominantly students from China. Some of them live in China and the rest study in the Netherlands. Every subject is required to write four different A4 pages.  On page 1, the participants were asked to copy a text of two paragraphs in Chinese. On page 2, the subjects described certain topics they liked in their own words in Chinese. Page 3 contains English text copied from two paragraphs. This page is split into two sub pages, and each sub page contains one paragraph. In total, there are four handwritten samples from each writer, two in Chinese language and two in English language. All the documents were scanned at 300 dpi, 8 bits/pixel, gray-scale.

The LAMIS-MSHD Data set comprises 1200 handwritten text images written by 100 different writers. All the 100 writers are adults in order to ensure that they have their own characteristic handwriting style. Each writer produced 12 handwritten documents. The peculiarities of the data set are as follows. The first 6 pages contain Arabic while the remaining 6 pages contain French handwritten text. The text on each of the 12 pages is different and all writers copied the same text.

The details on the distribution of data sets can be found in the subsequent sections while for more information on the LAMIS-MSHD and the CERUG data sets; the participants are encouraged to explore [1] and [2] respectively.

a)   Dataset Distribution for Tasks 1 and 2

Tasks 1 and 2 will be carried out on all writing samples of the 105 writers of the CERUG data set. 50 writing samples will be provided as validation dataset for each task while 80 test samples per task will be used to evaluate the system performance.

The training data comprises 80 samples in Chinese and 80 in English text from a total of 40 different writers while the validation data contains Chinese and English handwriting samples of 25 different writers. The naming convention of the images is AAA_B, where AAA represents the writer ID while B represents the sample number. The training and validation data will be grouped as a function of the tasks.

The test set will comprise 160 unlabeled handwritten images, 80 in Chinese and 80 in English. The test data will also be grouped as a function of the tasks and will be provided to the participants to evaluate their systems and submit the results.

b)   Dataset Distribution for Tasks 3 and 4

Tasks 3 and 4 will be carried out on all writing samples of the 100 writers of the LAMIS-MSHD data set. 120 writing samples will be provided as validation dataset for each task while 240 test samples per task will be used to evaluate the system performance.

The training data comprises 240 samples in Arabic and 240 in French text from a total of 40 different writers while the validation data contains Arabic and French handwriting samples of 20 different writers. The naming convention of the images is CCC_D, where CCC represents the writer ID while D represents the sample number. The training and validation data will be grouped as a function of the tasks.

The test set will comprise 480 unlabeled handwritten images, 240 in Arabic and 240 in French. The test data will also be grouped as a function of the tasks and will be provided to the participants to evaluate their systems and submit the results.

Online user: 1