Development of Extensive Polish Handwritten Characters Database for Text Recognition Research

In the modern world, fast and efficient processing of non-digital (handwritten or typed) texts is the task of extreme importance. Similar to many other fields, optical character recognition (OCR) benefits from the application of machine learning (ML) which allows developing effective and accurate methods. In order to achieve good performance, a machine learning algorithm requires great amount of data. Nowadays, a large database of handwritten characters prepared by National Institute of Standards and Technology (NIST), USA, can be used for training an ML model. However, significant differences between the manners of handwriting exist in the US and Poland. That fact, along with the absence of Polish diacritical marks, causes the NIST database to be less useful for development of an OCR model for the Polish language. According to the best of the authors’ knowledge, no database with samples of Polish handwriting exists. The present research is focused at filling this gap, i.e. gathering and preparing an extensive database of Polish handwritten characters. The paper presents the very first database of Polish handwriting samples. The database is by far larger than all the datasets used in the previous attempts of implementing OCR for the Polish handwriting. It is also the first fully publicly accessible database of Polish handwriting of this scale. The same method and developed tools can be used to build handwritten characters databases of other languages.


INTRODUCTION
OCR is highly profitable in many fields such as data input, e.g. processing handwritten documents in various government institutions, quicker ways of reading information from official documents such as passport and ID cards on airports, scanning of car number plates on a car park and surveillance system, digitalizing the scientific literature published before the digital era, making it searchable [7,21]. In order to develop an OCR system, various models can be used. One of the most dynamically developing paradigms is machine learning -the approach based on statistics and optimization. Since it widely applies the methods of statistics, such models require the great amount of data to achieve sufficient performance [4,12,16]. Most often, this kind of data is gathered in a form of samples of handwritten text, where the person providing the sample is asked to fill in a form writing some specific text into fields [12,20,21]. The forms usually contain the letters, digits and other characters. The letters can be presented separately or be included into words and sentences. The filled forms later undergo the procedure of characters extraction, the extracted characters are pre-processed (e. g: centering, binearization, denoising) [4,5,19].
One of the most well-known databases of handwritten characters is the NIST dataset elaborated by National Institute of Standards and Technology [7]. This database contains a large number of handwritten characters gathered in the USA. Thus, it can potentially be used for training a machine learning model for the OCR tasks. However, the science presents the numerous evidences of diversity of handwriting manners in various countries, which renders the American database less useful for application in Poland [3,17].
Along with the Latin alphabet, the examples of databases containing the samples of other writing systems can be found in the literature [1,9,11]. The said dataset was used for developing the handwriting recognition algorithm [2,22], which shows that a large publicly available dataset is an essential factor for creating effective handwriting recognition algorithms.
Another reason for creating a Polish database of handwritten characters is the fact that the American database contains only the characters of basic Latin alphabet without special Polish characters having diacritics such as ą, ę, ć, ł, ń, ó, ź, ż. Owing to these facts, the task of creating an extensive database with the samples of Polish handwritten characters appears to be important.
The literature review reveals several attempts of developing an OCR pipeline dedicated to the Polish handwriting [8,10]. However, they did not involve such extensive handwriting database as the present research does. Grzelak et al. extended the EMNIST dataset with two Polish letters, namely, "Ą" and "Ć" [8]. On the other hand, Kurzyński and Sas analyzed the dataset containing only Polish handwriting, the dataset included the names and the surnames of broad set of medical patients [10]. However, the complete words were used in training a classifier.
Turnbull et al. analyzed the datasets containing the Polish handwriting in order to find the features allowing to distinguish between the Polish and English handwriting; the dataset, however, contained only 53 Polish and 52 English handwriting samples [18].
Górska and Janicki developed a model allowing to recognize the human extraversion level based on handwriting [6]. They examined the handwriting of numerous group including 883 persons: 404 men and 479 women. Although the number of samples was quite high, separate letters were not extracted, instead the author extracted such features as: word spacing, stability of pressure, handwriting regularity etc.
On the basis of the literature review conducted, the authors of the present article can conclude that the database presented in the paper is unique from several points of view: it is the first large-scale database of Polish handwriting samples, containing over 530 thousands of characters in the forms of separate digits, letters (common Latin and Polish diacritics) and syntax signs; • it is the largest dataset of handwriting samples collected in Poland; • it is the first fully labeled and preprocessed dataset of Polish language handwriting samples ready for use in OCR; • it is the first fully publicly accessible database of Polish handwriting of this scale.

Polish Handwritten Sample Form
The data were gathered with the use of special, one page forms, called Polish Handwritten Sample Form (PHSF). PHSF was elaborated by the authors of the present paper. The forms are anonymous, i.e. the participant did not have to provide the personal data, such as name, surname, etc. Only sex and year of birth were collected for the statistical purposes. Additionally, the information regarding the participants groups was collected using the code form (e.g. 2-B, 2-C). Each form contains the date of filling as well. The PHSF contains 16 fields where every Polish language character appears at least 3 times. The samples of handwriting were collected either in the form of separate letters or sentences. In order to ensure the appearance of a complete alphabet in the sentences, palindromes where chosen. A palindrome is a sentence containing all the letters of an alphabet in particular language. For English language, the following sentence was used: "The quick brown fox jumps over the lazy dog", while for Polish: "Mężny bądź chroń pułk twój i sześć flag" was chosen.
Participants were asked to write the characters with spaces, possibly avoiding crossing the borders of the fields.
On the other side of the form, the well known poetry (called "Invocation") of Adam Mickiewicz was provided. The participants had to write the poetry in the normal way without spaces between each letter. Figure 1 and Figure 2 present the first and the second side of PHSF.

Overview
The students of various Polish universities, along with the persons of productive age, were the main part of the participants. The said group was chosen due to the fact that it includes the persons who either are employed or are coming to the labor market. Moreover, the approach of teaching handwriting is changing nowadays -the subject of formal calligraphy disappears from schools and the handwriting itself changes its character.
The participants from both technical and human science specialties took part in the research. They were asked to fill in the forms. The average time of filling in the form was 10 minutes. Over 2000 completed forms were collected.
Initial selection including the rejection of insufficiently filled forms was carried out by hand. The forms were rejected in the case when the characters were written without spaces and/ or when many characters crossed the borders of the fields. The mentioned rejection was necessary due to high complexity of processing such insufficiently filled forms.
The next step was scanning the forms with the 600 dpi resolution. The obtained files were transformed from RGB to grey scale. Afterwards, they were thresholded and the colors were inversed, i.e. black parts became white and white parts became black. Subsequently, the procedure of character extraction was performed.
The application dedicated for character extraction was developed. The Python 3.6 was used for it. The following libraries were used: numpy, opencv, pyQt 5 and Pillow. The application had the following main functionalities: • for both PHSF sides and separate fields: editing, dilatation, erosion, rejection; • for processing separate characters: deleting, rejection, joining neighboring characters.
The application has a user friendly graphical interface - Figure 3.

Character extraction
The special procedure was developed to extract of characters. In order to describe it, let us assume that the scanned form is referred to as F being an integer grid, i.e.: ℕ × , where n and m are, correspondingly, its height and width.
The procedure of character extraction was performed in the following steps: • removal of a form header with the age/sex information not containing the characters for extraction, • field mask extraction, The description of the listed steps is presented below.
Removal of a form header is carried out by setting the values of pixels of the upper part of the form to 0. "The upper part" is defined as 18.5% the form height. The value 18.5% is set empirically. In rare cases, when some elements of header are not eliminated, the user of the dedicated application has the option of correcting the form manually.
In order to perform the next steps, i.e. aligning the forms as well as extracting separate lines and fields, the field mask was extracted. Hereinafter, the field mask is referred to as M f .
The field mask is obtained by sequent application of two filters: vertical and horizontal f v (and f m ). Each filter is used in the two morphological operations: erosion and dilation [13,15]. The mentioned filters have the following structure: (1) The horizontal filter f h is a row vector of ones of length p and the vertical filter f v is a column vector of ones of length q. The values of p and q were set empirically in the way ensuring the best performance of the algorithm. Number p was set to 220 and q -to 200.
The above-mentioned operations of erosion and dilation were applied for obtaining a field mask M f ., which was the union of two masks: the  Figure 4 presents the original scanned form and the obtained field mask. As it can be seen , the procedure of field mask extraction allows eliminating noise and makes the further steps of the form processing possible. M f , M h and M v are used in the further processing steps.
The procedure of the form alignment is an essential step due to the fact that the forms are usually placed in the scanner with slight misalignment, which can lead to poorer extraction of the characters. In order to align a form, it should be rotated around its center at the angle α. The angle α is obtained in the following way: Where (χ l , γ l ) are the coordinates of the closest to left-top corner of the F white pixel, while (χ r , γ r ) are the coordinates of closest to right-top corner of the F white pixel. The term "closest" is used here in the sense or L1 metric. Due to the fact that all the coordinates are positive numbers and any χ is less than the image width, the modulo operation can be omitted. This can be expressed in the following form: where: ( , ) C , C being the set containing all the coordinates of white pixels of M f .
The Figure 5 presents the schematic explanation of the formula (4).
The direction of rotation is counterclockwise in the case if α is positive and clockwise if α is negative [14]. After the form is aligned, the next step, named extraction of separate lines, is performed. In order to obtain separate lines, horizontal mask, M h , is analyzed. A set H h contains the values equal either to 1 or 0 is obtained. The number of values in the set is equal to the number of rows in M h . The value 1 corresponds to the row containing at least one white pixel, 0 corresponds to the row, where all the pixels are black. As a horizontal filter was applied, M h contains only horizontal lines. The Figure 6 presents the example of H h .
After obtaining H h the differences D h between every pair of neighboring elements are computed. The example is presented in Figure 7.
On the basis of D h , the coordinates of vertical borders of the fields can be obtained in the following way: first, two separate sets of order numbers starting from 0 should be assigned to the positive and negative pikes of D h .The set of borders ( , ) for i-th line can be obtained in the following way: − -value corresponding to k-th negative peak.
Having obtained vertical borders, separate fields of the i-th line can be extracted by calculating vertical histograms of the lines of M v , bounded by i-th set of borders ( , ) . The processing of the obtained histograms is performed in the way similar to extracting the vertical line borders, after that a set (χ l , γ l ), (χ r , γ r ) i for i-th field is obtained.
(χ l , γ l )are the coordinates of top-left corner and (χ r , γ r ) are the coordinates of bottom-right corner of i-th field.
In order to extract the characters, M f was subtracted from element-wise, F allowing obtaining the sample form without border boxes.
The character extraction is carried out from the separate fields. A histogram approach is used for that task as well. Figure 8 presents the extracted characters marked with colors. Sometimes, the characters cannot be extracted separately due to various factors, e.g.: joint handwriting, tilted letters, and bad scan quality. In those cases, the user has the possibility of correcting a separate field manually.

DESCRIPTION OF DATABASE Character images representation
Using the gathered PHSFs and developed application the Polish Handwritten Characters Database (PHCD) was created. One of the PHCD part contains the images with characters extracted from the forms.
Decimal integer numbers (codes) were assigned to the characters in the form in the following way: • digits (0-9) -0-9, The database is structured in the following way: the main directory is named phsf (acronym from Polish Handwriting Sample Form). It includes two subdirectories: characters and invocation. Each of these directories contains a png subdirectory. The png directory located in the characters directory contains directories named with the numbers assigned to the characters. Every file contains the picture with a single character. The png directory located in the invocation directory  is centered on the 32px on 32 px square. Figure 10 presents the examples of single character images stored in the PHCD.

Character numerical representation
Additionally, the PHCD contains the ocr_files directory where the data preprocessed in a form suitable for application in machine learning are collected. The motivation for this approach was the fact that loading separate png files into RAM is extremely time consuming and requires specific libraries for processing images.
The ocr_files directory of PHCD includes four following files:

CONCLUSION
The database of Polish handwritten characters has been developed in accordance with the procedure similar to the one applied for collecting the data contained in the NIST characters dataset [7]. The developed database contains the most important characters such as lower case and upper case letters, including the letters with Polish diacritics, digits and syntax characters. The data were gathered from extensive group of participants including students of various specialties such as computer science, electrical engineering, mechanical engineering, civil engineering, economics, logistics, management, mechatronics and mathematics. Wide range of academic specialties ensured diversity of handwriting.
The Python 3.6 programming language was used for developing character extraction application. The following libraries were applied: OpenCV, Pillow, numpy, PyQt5. The handwritten samples were prepared for use in machine learning in the following way: the characters were extracted, de-noised and scaled to specific size. Each character is stored in two ways: saved as a separate image and jointly with the other characters in .npy files. The second approach ensures significantly shorter loading time for the dataset.
The presented PHCD contains the labeled and fully prepared data, which can be used by researchers for developing the models of optical character recognition for the Polish language and other text recognition research. The presented method and developed tools can be used to build handwritten characters databases of other languages. PHCD is publicly available free of charge at http://cs.pollub.pl/phcd.