Using AI for Caribbean Dialect Classification and Predictive Text
And Why Standard NLP Practices don’t work for Caribbean Slang
As a Trinidadian, it can at times be annoying whenever my phone attempts to autocorrect my favourite Caribbean slang. While some keyboards eventually adapt to allow me to use new words after stubbornly correcting what I type, this can be a frustrating process. What if a keyboard could immediately detect your dialect and adjust to suit, perhaps even suggesting useful slang along the way?
The use of Caribbean dialect in online communication is a means by which we consciously or subconsciously form identity [2][4]. It is only within recent years that the digital age has allowed us to concretize the use of dialectal text in day-to-day communication. It is therefore unfortunate that much of the typing technology Caribbean people use today is still geared towards Standard English.
I decided to take a deep dive into the world of machine learning to search for a solution to this problem. My aim was not to put out a production-level model, but just to carry out some exploratory work and show a proof of concept. For this project, I focused on two countries with two of the most distinctive Caribbean dialects there are (in my opinion at least): Trinidad and Tobago and Jamaica.
This project was split into three main tasks: 1. Data Collection, 2. Dialect Classification and 3. Predictive Text Generation. In this article, we will also see why some of the conventional practices in natural language processing (NLP) should be avoided, or at the very least why they aren’t as straightforward when dealing with Caribbean dialect.
Contents
- Data Collection
- Dialect Classification
- Predictive Text Generation for Caribbean Dialect
Data Collection
This was one of my biggest hurdles for getting started on this project. To train models that can classify Caribbean dialect or that can predict Caribbean phrases, a large enough source of data was needed. This posed many issues. I could not access any large pre-existing datasets containing only informal Caribbean-written text, so I had to make my own. Additionally, finding a way in which I could reliably distinguish among dialects for a large volume of data is not a trivial task. After scouring the web for a while in search of a good candidate for data, I finally settled on the YouTube comments of local-based entertainers in both Trinidad and Tobago and Jamaica. For this, I selected the following YouTube channels:
- Machel Montano and Buju Banton: two giants in Trinidad and Jamaican music respectively
- What Yuh Know and Jnel Comedy: two icons in Trinidad and Jamaica known for their great work in street-interview comedy
This dataset has its flaws since it assumes that all comments on a channel are of the dialect spoken by the content creator themselves. However, the themes of music and comedy seem to bring out the colourful dialects of Trinidadian and Jamaican viewers much more than other potential data sources like news articles, which typically use only Standard English. From manual inspection, many of the YouTube comments did indeed use the dialect of the YouTuber, as opposed to there being a heterogenous spread of dialects within each channel’s comments. Additionally, the popularity of these YouTube entertainers meant that I was not short on data to work with.
In order to retrieve these comments, I utilized the YouTube Data API and some Python code. You can find a detailed explanation of how to use the YouTube Data API in this article. With this code, I created two datasets: one which contained YouTube comments from the two music video channels, and the other for comments from the comedy channels. In total, I managed to retrieve just over 80 000 comments. It was then time for the machine learning fun to begin!
Dialect Classification
Deciding on the Dialect Classification Model
Creating a natural language processing model to classify Caribbean dialect comes with some unique challenges. Many basic approaches to text classification, such as the Naïve Bayes method, rely heavily on word frequency. This means that the models “guess” based on how common certain terms are in text. For applications like sentiment analysis, e.g. deducing whether movie reviews are positive or negative, the frequency of words like “good” or “bad” can often be enough to make reasonably accurate classifications.
However, in this case, word frequency does not have as much importance because we are seeking to classify the very structure of the text. While there is some benefit to be had by simply taking word frequency into account, it is far more important that the model “understands” how things are said, rather than what the comments are actually saying. To compound this challenge, Trinidadian and Jamaican creole are often considered lexically similar languages [1]. While a model may perhaps be able to easily distinguish between Standard English and Trinidadian creole based on the frequency of terms unique to Trinidad’s dialect, the task becomes more difficult when it is asked to distinguish between two non-standard dialects which share structural similarities.
I therefore decided to use the popular open-source library spaCy, which offers easily accessible state-of-the-art architectures for NLP. What I like about spaCy is that its models not only gain insight from word frequency but also from the context of words within sentences. This boost in “understanding” comes from the use of neural networks. For the TextCategorizer that I chose, spaCy makes use of a convolutional neural network (CNN) which allows a certain level of context to be encoded along with words. As such, words are no longer seen as isolated parts of a sentence, but as components of a wider picture. I should note that I used a blank English spaCy pipeline, meaning that there were minimal “pre-conceived notions” that the model had when it would be attempting classification.
Just for fun, I trained this model on the raw YouTube comment data with absolutely no preprocessing. This returned an accuracy of around 90%. However, I was aware that this high level of accuracy right off the bat was misleading. This now leads me onto how I preprocessed the comments for optimal (but realistic) accuracy.
Determining an Optimal Preprocessing Method
When doing NLP, there are often some recommended procedures which are carried out on the text before any training is done. Some of these standard practices include:
- Tokenization:
This is the process of splitting text up into “tokens” so that some kind of numerical value can be assigned to each one. For example, “Lewwe go an lime!” becomes: “Lewwe”, “go”, “an”, “lime”, “!”.
- Conversion of all text to Lowercase:
Uppercase letters rarely provide a significant amount of extra textual data so it is often more beneficial or efficient to make all letters lowercase.
- Lemmatization:
This is the process of replacing forms of a word with their root word. For example, “finished”, “finishes” and “finishing” would all be replaced by the root word “finish”. The intention behind this is to remove unnecessary forms of words which share the same core meaning.
- Stopword and/or Punctuation Removal:
Stopwords are words used very commonly in sentences that generally don’t hold the main substance of what a sentence is saying. Examples include “they”, “it”, “she”, “a”, “in” and “are”. Stopwords and punctuation are often removed to filter text down to the “important” key words.
I felt uneasy about implementing the latter two standard preprocessing techniques for dialectal data. I suspected that useful information for distinguishing between the dialects would be lost. Let me explain why.
Caribbean dialect cannot be easily lemmatized. Commonly, datasets containing large volumes of Standard English words are used to write programs which recognize and lemmatize text. However, consider, for instance, the Trinidadian term “limin” (roughly meaning “hanging out with friends”). This word is completely absent from standard English. If it could be lemmatized properly, its root word would be lime — a verb (not noun) which has little to do with fruit! Programmatically lemmatizing Caribbean dialect in a reliable manner is not something that can be done with existing software (to my knowledge), and would likely require a unique dataset of Caribbean vocabulary detailing how dialectal words change with tense and/or context. Relying on lemmatizers which are based on Standard English only would not suffice. While it is unclear whether properly lemmatizing my data would boost my model’s accuracy, I omitted lemmatization from preprocessing as such a task deserves a project of its own.
“She does help you plenty?” (Translation: “Does she help you a lot?”)
Here we see an example of Trinidadian dialect exhibiting what linguists would term the use of “immovable pre-verbal particles”[3]. If the question mark were removed from this sentence, it could look like a declarative statement (“She does teach you.”). Sentence structures like these where a particle, like the word “does” in this example, is fixed before the verb (“help”) are quite common in Trinidadian dialect. Imagine if the stopwords “she”, “does” and “you” were removed from this sentence, along with the question mark. We would be left with the phrase “help plenty”. Almost all semblance of structure belonging to Trinidadian dialect is lost at this point.
“Ah so it ah go inna this life” (Translation: “So it goes in this life”)
This is an example of Jamaican dialect. In Jamaican creole, the word “ah” (or “a”) is often used for what is called “clefting” in formal studies of linguistics[3]. Basically, this is the unusual positioning of regular words in a sentence for emphasis. A Standard English example would be use of the sentence “It was the dog who ate my homework.” rather than “The dog ate my homework.” Here, “it” is the cleft of the sentence. Why is this important? Let’s consider how the Jamaican sentence above could be preprocessed. Removing “so”, “it”, and “this” as stopwords leaves us with “Ah ah inna life”. With this, there is not much information left for dialect classification. Perhaps one familiar with Caribbean dialect could guess that it’s Jamaican because of the term “inna”, but the words “ah” and “life” are just as likely to be found within Trinidadian creole.
We see therefore that stopword removal (and possibly punctuation removal), may in essence strip the text of its structure, leaving only keywords. This is undesirable as the unique structure of Trinidadian and Jamaican dialect cannot easily be discerned after such preprocessing (at least by humans). It would be a waste of the spaCy model’s potential to understand text contextually.
I decided to create an objective test for the utility of the removal of stopwords and punctuation. I trained different spaCy models using the same hyperparameters, on the same data, but with different preprocessing methods applied. By default, I made all letters lower case, got rid of all emojis and removed certain terms, which I call “cheat terms”. “Cheat terms” are those which can easily reveal the dialect by directly identifying information about the YouTube channel — words like “Machel”, “Buju”, “Trini” and “Jamaican”. Not removing these terms initially was the cause of the model’s high accuracy in the absence of any preprocessing.
In the graphs below, the coloured bars represent which terms were removed in this dataset. Along with accuracy (number correct/total number), you would notice that I used three other statistical metrics. Sparing you the mathematical details, these metrics take into account imbalanced classes in the testing data (which was an issue with the music dataset especially, where there were more Trinidadian comments than Jamaican). The F1 score gives the best overall indication of the model’s success in this scenario.
As we can see, it appears that removing only punctuation gives the best results out of the four different preprocessing methods. Given my above discussion, it did not surprise me to see that models trained on data in which stopwords were removed consistently underperformed in comparison to those which did not. Personally, I was a bit on the fence as to whether the optimal preprocessing method would have involved removal of neither stopwords nor punctuation, or punctuation only, but the metrics clearly favoured the latter. The F1 score for the best model trained on the music videos dataset was about 0.75 while the F1 score for the best model trained on the comedy videos dataset was about 0.70 (1 being highest, 0 being lowest). While there is certainly much room to improve these scores, I believe they are sufficiently high to indicate that the model has at least “learned” a bit about the dialectal structures it classified.
To wrap up this section of my project, we have seen that a spaCy model was used so that words would be given some contextual meaning during training. Using examples of Caribbean speech, and using objective metrics, we have also seen why some of the common practices in NLP should not be blindly followed. Lemmatization for Trinidadian and Jamaican dialect is not easy to implement and stopword removal was shown to deteriorate training results of the spaCy model.
Predictive Text Generation for Caribbean Dialect
My main aim for this part of the project was to create a function that can predict words or characters that should follow an incomplete phrase of Caribbean dialect. My hope for this model is that it would be able to predict some common slang used in Trinidad and Jamaica — not just text completions that are in Standard English.
I chose to use Tensorflow to create a model that can generate predictive text on a character-by-character level. Unlike the spaCy library I used for dialect classification, I used Tensorflow this time around as it allowed me easier access to lower levels of abstraction for neural networks. I used a model architecture involving a Long Short Term Memory (LSTM) layer. This is a type of Recurrent Neural Network (RNN). The diagram below gives an idea of how a simple RNN should ideally work.
Each letter is assigned a unique numerical representation. Input gets fed into a hidden state, H. The hidden state is just a mathematical function, involving matrix multiplication, which is used to produce an output. This output can then be converted from a numerical form back to a text character. In addition to the output, this function feeds forward a value to the next hidden state. It is this process of feeding forward certain values that distinguishes RNN’s from basic neural networks. Sequences can be “understood” by RNN’s since each output is dependent not only on the current input value, but previous input values. There is a recurrence of data. In the above example with the word “Jamaica”, consider the letter “a”. This letter occurs twice as input, but the output value does not repeat. This is because a well-trained RNN considers the entire sequence of data to create an output value, rather than just a single input value.
One common issue that occurs with regular RNN’s is something called the vanishing gradient problem. This is the tendency of an RNN to “forget” information it encoded when that information is very far away in a sequence. One way to avoid this issue is to use LSTM which has a longer memory than a plain RNN (hence its name).
The training data I used was again my YouTube comment datasets (with emojis and non-ASCII characters removed). Each comment was split into subsequences with single characters used as the targets to be predicted. Below shows an example of how training data would have been formatted for the phrase “Ah gone” (Trinidadian creole for “I am leaving”).
Finally, I trained my model on this data then wrote a function that can predict the most probable characters that should come after a given input text. After playing around with this function for a while, here are some examples I found interesting:
- Trinidad and Tobago Prediction Examples:
- Jamaica Prediction Examples:
Those familiar with the YouTube channels I used for data may see their influence on the model’s prediction choices. I was pleasantly surprised by some of the predicted slang (and many were mildly amusing). However, not all predictions were understandable. Here are a couple examples where the predictions made little to no sense:
Conclusion
In closing, we see that I have been able to extract examples of Caribbean dialect from YouTube, create a spaCy model which can classify Trinidadian and Jamaican dialect, and create a Tensorflow model which can appropriately predict Caribbean slang. There is definitely room for improvement but I believe that I have achieved what I set out to do: show that it is indeed possible to create AI-based software which recognizes your dialect from your style of writing and which uses a predictive text algorithm tailored to Caribbean dialect.
Possible Improvements/Expansions
- Add more Caribbean dialects: finding a sufficiently large source of training data was a significant challenge for many Caribbean countries but it should not be difficult to expand this idea to any Caribbean creole.
- Implement explainable AI to improve our understanding of what machine learning models consider important when making classifications and predictions.
- Develop this project into a fully interactive keyboard application.
- Improve quality of data sources — perhaps by working with linguists to create a dedicated dataset of Caribbean dialectal data. This would be much more reliable than YouTube comments as we can be certain as to the proper classification for all dialects involved.
- Increase computational resources: getting access to better GPU and memory resources would allow me to train models with more complex architectures on larger volumes of data.
References
- Cresswell, A. (2018). 21st century language policy in Jamaica and Trinidad and Tobago. CORE. https://core.ac.uk/reader/287600893
- James, S. Trinidad English Creole Orthography: Language Enregisterment and Communicative Practices in a New Media Society. https://scholar.colorado.edu/concern/graduate_thesis_or_dissertations/p2676v823
- Kortmann, B. & Schneider, E. (2004). A Handbook of Varieties of English: A Multimedia Reference Tool. Volume 1: Phonology. Volume 2: Morphology and Syntax. Berlin, Boston: De Gruyter Mouton. https://doi.org/10.1515/9783110197181
- Moll, A. (2017). “ Diasporic Cyber-Jamaican: Stylized Dialect of an Imagined Community”. In Contested Communities. Leiden, The Netherlands: Brill. doi: https://doi.org/10.1163/9789004335288_007