Arabic and American Sign Languages Alphabet Recognition by Convolutional Neural Network

Hearing loss is a common disability that occurs in many people worldwide. Hearing loss can be mild to complete deafness. Sign language is used to communicate with the deaf community. Sign language comprises hand gestures and facial expressions. However, people fi nd it challenging to communicate in sign language as not all know sign language. Every country has developed its sign language like spoken languages, and there is no standard syntax and grammatical structure. The main objective of this research is to facilitate the communication between deaf people and the community around them. Since sign language contains gestures for words, sentences, and letters, this research implemented a system to automatically recognize the gestures and signs using imaging devices like cameras. Two types of sign languages are considered, namely, American sign language and Arabic sign language. We have used the convolutional neural network (CNN) to classify the images into signs. Diff erent settings of CNN are tried for Arabic and American sign datasets. CNN-2 consisting of two hidden layers produced the best results (accuracy of 96.4%) for the Arabic sign language dataset. CNN-3, composed of three hidden layers, achieved an accuracy of 99.6% for the American sign dataset.


INTRODUCTION
Communication in a broader sense is the perception of a person about an expression from another person. It is essential to understand the physical and psychological needs of the people. Communicating by speaking a language is the fastest way for a human, but signs are also used in expressing some words such as yes, no, etc., either with the head movement or the hand gesture. Non-linguistic and Sign-based communication are developed for silent or faraway communication methods. In 1620, Juan Pablo Bonet set a dictionary of sign language and letters to communicate silently [1]. The fi rst development of sign language to help deaf and people with hearing impairment was by Martha's Vineyard, who created their sign language because they had recessive genes that made them suff er from deaf people in this town [2]. Although there were many attempts to defi ne sign language protocols, the most crucial work was by Charles Michel De L'Epee, a French Priest who established the fi rst school for the deaf and was free of charge in Paris [3]. He made great eff ort and progress in transforming the French language into a sign language dictionary with basic ideas and concepts instead of just letters. Then, in 1800, Thomas Hopkins Gallaudet developed American Sign Language (ASL) that was inspired by French Sign language [4]. Other languages have also developed their sign language, and the research in sign languages has expanded. Each country defi ned its sign language [5]. Arabic Sign Language was developed based on American and British Sign Language and the shapes of nature. The Arabic language contains many dialects, including Saudi, Egyptian, Libyan, Lebanese, Iraqi, Moroccan, etc. These countries developed sign languages according to their country dialects and customs [6]. Sign language is the best way of communication for Arabic and American Sign Languages Alphabet Recognition by Convolutional Neural Network deaf and hard-of-hearing people. Still, few people know it, especially ordinary people who believe that they do not need to learn it unless a relative whose hearing is impaired. Sign language varies according to the original language, and there is no international sign language like English which makes it even more difficult to communicate between different languages. According to the World Health Organization, the number of deaf people is on the rise, reaching 430 million people, and it is estimated that this number will rise to 700 million people by 2050 [7]. This number is enormous, and those people need to communicate with others. Sign language is the language expressed using different hand gestures, body movements, and facial expressions. These gestures refer to a sentence, word, letter, number, idea, or concept. As a result, it is difficult to expand the sign language to include all languages because there are similarities between gestures that explain various expressions, making communication between languages hard. Many techniques have appeared to communicate with deaf people, using sensors and gloves or through images of gestures. Machine learning and deep learning have been effective in extracting features from images, videos, and sounds. These methods can overcome the communication barriers between ordinary and deaf people. The American sign language alphabet gestures are illustrated in Figure 1, and Arabic sign language alphabet gestures are shown in Figure 2.
This paper has focused on identifying sign language gestures that correspond to letters in American and Arabic languages using the Convolutional Neural Network CNN algorithm, a deep learning algorithm.

RELATED WORK
The Block diagram in Figure 3 shows the flow of a typical sign language identification system. Multiple techniques are used to capture the data for sign language recognition, such as cameras, data gloves [8,9], and sensors like motion sensors, EMG sensors, or EEG sensors [10, 11, 12]. The data may include an image, a signal, or a video stream. Preprocessing step is required to remove any unwanted noise or signal by appropriate filters, cropping or resizing images, etc. In the case of images, sometimes segmentation is also applied to remove unwanted background. In the next step, relevant features are extracted from onedimensional signals (a time series from a sensor) or two-dimensional images. Fast Fourier transform [10,11,13], statistical features [8,14], or wavelet transform [15,16] are common methods to extract the useful features from the sensor signals or images. In case of images, Scale-invariant feature transform (SIFT) [17], Histograms of Oriented Gradients (HOG) features [18,19], or speeded up robust features (SURF) [20,21] are common methods to extract features. Features reduction techniques like principal component analysis (PCA) [22,23], linear discriminant analysis (LDA) [12,24], or independent component analysis (ICA) citealqat-tan2017towards, tharwat2020independent can be applied to improve the computation cost and to remove the irrelevant features. Various classifiers are used in the literature to classify the sign alphabets in different languages. These classifiers include Artificial neural network (ANN) [25,26,8], Support vector machines (SVM) [26, 27,28], hidden Markov model (HMM) [29,30], tree-based classifiers [31,32], K-nearest neighbor (KNN) [33] etc.
Chuan et al. [34] used a palm-sized leap motion sensor to capture the sign gestures. They applied the Support vector machine (SVM) and knearest neighbor (KNN) algorithms to classify 26 alphabets in American sign language. Two faculty members including a deaf person, collected the data using the leap motion sensor. Classification accuracy was 61.95% by KNN and 83.39% by the kernel SVM classifier. Abbas et al. [35] collected hand gesture images by the smartphone camera. Classification accuracy for all the alphabets was 92.5% using the SVM classifier.
Deep learning architectures [36,37] are not new in machine learning research and are getting popular in various real-life applications. For  [38] and recurrent neural networks [39] are widely used.
Flow diagram of automatic sign language classification using deep learning methods is shown in Figure 7. Feature extraction and classification are combined in deep learning architectures. For example, Convolutional layers in CNN extract the essential features from the images. Noor Tubaiz et al. [40] proposed an Arabic sign language detection system of 40 sentences using data collected from two DG5-VHand data sensor gloves, in addition to a camera used to record the 40 sentences. For classification, a modified K-Nearest Neighbor (MKNN) is used and achieved an accuracy of 98.9%. Depth sensors were used to capture the upper part of the body to recognize hand gestures showing Arabic sign language [41]. The dataset consists of five words with 143 signs examined by ten people. They used the support vector machine (SVM) classifier with two kernels (linear and radial) applied on four SVM models with different parameter settings. The result showed that the SVM with linear kernel had the highest accuracy of 97.059%. Reema et al. [42] applied a Support vector machine (SVM) on a dataset collected from 30 persons and for each person, 30 gestures of Arabic sign language alphabets (ArSL). They noted that each letter had an accuracy that varies according to the hand gesture type. The accuracy of SVM for all alphabets was 63.5%.
Deep learning with a convolutional neural network is widely used for sign language recognition. Bheda et al. [43] presented the CNN classifier to detect the American sign language (ASL) alphabet and digits from 0 to 9. They used a self-generated dataset by taking pictures from five people with different skin colors and different lights. Their CNN architecture consists of three groups of convolutional layers, max-pool layer, and dropout layer. Afterward, two other groups of fully connected and dropout layers are used. The classification accuracy was 82.5%.
Qing Geo et al. [44] proposed a two-stream CNN (2S-CNN) or Inception-ResNetv2 model that combines features of Inception and ResNet. Inception-ResNetv2 can extract features better and avoid overfitting. They trained the model on the ImageNet dataset and applied it to the American sign language ASL dataset. The 2S-CNN classifier gave the best classification accuracy of 92%.
Md Asif et al. [45] proposed a posture learning framework consisted of a convolutional layer and the concept of pooling with capsules network routing for sign language recognition. They have used the Alexnet pre-trained CNN model on the Kaggle American sign language dataset. The proposed framework gave a high classification accuracy of

MATERIALS AND METHODS
We have used two datasets, American sign language (ASL) [50] and Arabic sign language (ArSL) [51]. Examples from these datasets are shown in Figure 1 and Figure 2. The following sections describe both datasets.

American sign language (ASL) dataset
This dataset comprises 34727 images having a 28x28 dimension. These images are in grayscale between 0-255. The dataset contains gestures of the English letters A-Z without the letters J and Z. Because the gestures for these two letters, J and Z, are animated and not static. The distribution of classes is shown in Figure 5. The distribution of images is not equal for all the alphabets. The data is divided randomly into 60% training set (20283 images), 20% validation set (7172 images), and 20% testing set (7172 images).

Arabic sign language (ArSL) dataset
For Arabic sign language, a dataset named ArSL2018 [52] is used. This dataset consists of fifty-four thousand forty-nine gesture images for the 32 Arabic alphabets. The distribution of classes is shown in Figure 6. The dataset is collected from a group of participants of different ages. The images are resized to 32x32 pixels because they were not of the same size. Images are in grayscale between 0-255. After removing the noisy images, the remaining 41280 images are used. The distribution of classes is shown in Figure 6. The distribution is equal for all the alphabets. The dataset is divided randomly into 70% training set (28896 images), 30% testing set (12480 images). The training set is further split into training (20227) and validation (8669 images) sets.

Preprocessing of the Datasets
The data augmentation technique is used to increase the size of the training set. It generates Fig. 5. Distribution of classes of ASL variability to improve the generalization power of the model and minimizes the overfitting. Data augmentation is done by shifting, flipping, and zooming the images. Since the images are captured from one direction Horizontal Flip that reverses rows of the pixels of an image helps the model become insensitive to the direction of image capture [50]. After data augmentation, images are normalized by dividing them by 255.

Similarity of sign gestures in Arabic sign language and American sign language
There are some similarities between the gestures of the alphabets in English and Arabic sign languages. In American sign language, alphabets A, M, and S have similar signs. Similarly, (C and O) and (N and E) also have similar signs (figure 1). In the Arabic sign language, "Dhal" and "Zay" signs are identical, "Fa" and "Qaf" signs are also similar ( Figure 2 and Figure 7). The alphabets "Dal" and "Dhal" have a similar pointing to the right. Some alphabets in both sign languages are identical in sound and signs, such as the alphabet "Lam" and "L" in Figure 7a, "Sad" and "S " Figure 7b "Ya" and "Y" Figure 7c.

Convolutional Neural Network
A convolutional neural network (CNN) is used to detect objects, shapes and edges by a sequence of filters (kernel) consisting of trainable parameters which convolve on the input images to extract the features in it. American sign  Table 3. It has two Convolutional layers with 128 and 64 neurons and two max-pooling layers. The pool size is 3x3 for the first convolutional layer and 2x2 for the second convolutional layer. Settings of CNN-3 are described in Table 2. It has 3 Convolutional layers with 128, 64, 32 neurons and three max-pooling layers. The pool size of the first Convolutional layer is 3x3, 2x2 for the other two Convolutional layers. Arabic sign language (ArSL) images size are of different sizes. All the images are resized to 32x32 pixels. Images are converted to gray color. For this dataset, two CNN architectures, CNN-2 and CNN-3, same as the ASL dataset. Settings of both CNN-2 and CNN-3 are explained in Table 1 and Table 4.

RESULTS AND DISCUSSIONS
Two CNN architectures are applied to the two datasets. Different hyperparameter settings such as learning rate, epochs number, batch size, and optimizers (Adam and SGD) are tried to find the best settings. Figures 9 and 10 show the classification accuracy changes to different learning rates for the ASL and ArSl datasets, respectively. For ASL dataset, classification accuracy varies slightly. It then decreases for a higher learning rate ( Figure 9) whereas, for ArSL dataset, the change in the classification accuracy to various learning rates is not significant (Figure 10).  Table  6. For CNN-2, the learning rate of 0.001, batch size of 200, and Adam optimizer produced the best results. Classification accuracy for this setting is 96.4%, with a precision of 96.3% and recall of 96.7%. Classification accuracy for CNN-3 architecture with similar settings also produced the     same performance (96.4%.). The confusion matrix for CNN-2 architecture with Adam optimizer is shown in Figure 12. All of the alphabets are identified correctly, with few miss-classification for some alphabets. Alphabet Jeem is confused with alphabet Laa (14 instances), although the shape of signs is different. Similarly, the alphabet Ra is also confused on few instances with the alphabet Saad. A performance comparison between the classification accuracy of our CNN architectures and published results is given in Table 7. The table shows that CNN-2 and CNN-3 architectures with the settings mentioned above are better than the published results on these datasets.

CONCLUSIONS
Classification models based on CNN architectures are proposed in this paper for Sign language recognition to make communication between ordinary people and deaf people easier. These models are applied to two datasets: the American sign language and the Arabic sign language. After trying different settings, CNN architectures with a learning rate of 0.001, batch size of 200, and Adam optimizer produced the best results. As a result, the classification accuracy of 96.6% is achieved in the Arabic sign language dataset and 99.6% in the American sign language dataset. In the future, we will try to test these architectures on larger datasets of people. Moreover, our future considerations are time and space complexity optimization such that these architectures can be used in mobile phones.