The core component of the pipeline is the machine learning model. The aim of the model is to classify the sentiment of a text for any given aspect. This is challenging because sentiment is frequently meticulously hidden. Nonetheless, before we jump into the model details, we will look more closely into how we have approached this task in the past, what the language model is, and why it makes a difference.
The model is a function that maps input to a desired output. Because this function is unknown, we try to approximate it using data. It is hard to build a dataset that is clean and big enough (for NLP tasks in particular) to train a model directly in a supervised manner. Therefore, the most basic approach to solving this problem, and to overcoming the lack of a sufficient amount of data, is to construct hand-crafted features. Engineers extract key tokens, phrases, n-grams, and train a classifier to assign weights on how likely these features are to be either positive, negative or neutral. Based on these human-defined features, a model can then make predictions (that are fully interpretable). This is a valid approach, pretty popular in the past. However, such simple models cannot precisely capture the complexity of the natural language, and as a result, they can quickly reach the limit of their accurateness. This is a problem which is hard to overcome.
We made a breakthrough when we started to transfer knowledge from language models to down-stream, more specific, NLP tasks. Nowadays, it is standard that the key component of a modern NLP application is the language model. Briefly, a language model gives a rough understanding of natural language. In computational heavy training, it processes enormous datasets, such as the entire Wikipedia or more, to figure out relationships between words. As a result, it is able to encode words, meaningless strings, into vectors rich in information. Because encoded-words, context-aware embeddings, are living in the same continuous space, we can manipulate them effortlessly. If you wish to summarize a text, for example, you might sum vectors; compare two words, make a dot product between them, etc. Rather than using feature engineering, and the linguistic expertise of engineers (implicit knowledge transfer), we can benefit from the language model as a ready-to-use, portable, and powerful features provider.
Within this context, we are ready to define the SOTA model, which is both powerful and simple. The model consists of the language model bert, which provides features, and the linear classifier. From among a variety of language models, we use BERT because we can benefit directly from BERT’s next-sentence prediction to formulate the task as a sequence-pair classification. As a result, an example is described as one sequence in the form: “[CLS] text subtokens [SEP] aspect subtokens [SEP]”. The relationship between a text and an aspect is encoded into the [CLS] token. The classifier just makes a linear transformation of the final special [CLS] token representation.
import transformers import aspect_based_sentiment_analysis as absa name = 'absa/bert_abs_classifier-rest-0.1' model = absa.BertABSClassifier.from_pretrained(name) # The model has two essential components: # model.bert: transformers.TFBertModel # model.classifier: tf.layers.Dense # We've implemented the BertABSClassifier to make room for further research. # Nonetheless, even without installing the `absa` package, you can load our # bert-based model directly from the `transformers` package. model = transformers.TFBertForSequenceClassification.from_pretrained(name)
Even if it is rather outside the scope of this article, note how to train the model from scratch. We start with the original BERT version as a basis, and we divide the training into two stages. Firstly, due to the fact that BERT is pretrained on dry Wikipedia texts, we bias the language model towards more informal language (or a specific domain). To do this, we select raw texts close to the target domain and do a self-supervised language model post-training. The routine is the same as the pre-training but we need to carefully set up the optimization parameters. Secondly, we do regular supervised training. We train the whole model jointly, the language model and the classifier, using a fine-grained labeled dataset. There are more details about the model here and training here.