Out of Context: A New Clue for Context Modeling of Aspect-based Sentiment Analysis

Aspect-based sentiment analysis (ABSA) aims to predict the sentiment expressed in a review with respect to a given aspect. The core of ABSA is to model the interaction between the context and given aspect to extract the aspect-related information. In prior work, attention mechanisms and dependency graph networks are commonly adopted to capture the relations between the context and given aspect. And the weighted sum of context hidden states is used as the final representation fed to the classifier. However, the information related to the given aspect may be already discarded and adverse information may be retained in the context modeling processes of existing models. This problem cannot be solved by subsequent modules and there are two reasons: first, their operations are conducted on the encoder-generated context hidden states, whose value cannot change after the encoder; second, existing encoders only consider the context while not the given aspect. To address this problem, we argue the given aspect should be considered as a new clue out of context in the context modeling process. As for solutions, we design several aspect-aware context encoders based on different backbones: an aspect-aware LSTM and three aspect-aware BERTs. They are dedicated to generate aspect-aware hidden states which are tailored for ABSA task. In these aspect-aware context encoders, the semantics of the given aspect is used to regulate the information flow. Consequently, the aspect-related information can be retained and aspect-irrelevant information can be excluded in the generated hidden states. We conduct extensive experiments on several benchmark datasets with empirical analysis, demonstrating the efficacies and advantages of our proposed aspect-aware context encoders.


Introduction
With increasing numbers of comments on the Internet, sentiment analysis has attracted interesting interest from both research and industry. Aspect-based sentiment analysis is a fundamental and challenging task in sentiment analysis, which aims to infer the sentiment expressed in sentences with respect to given aspects. For example, there is a review: "The salad is so delicious but the soup tastes bad.", in which the opinion over the 'salad' is positive, while the opinion over the 'soup' is negative. In this case, aspects are explicitly included in the comments, and predicting aspect-based sentiment polarities of this kind of comments is termed as aspect term sentiment analysis (ATSA) or target sentiment analysis (TSA). There is another case where the aspect is mentioned but maybe not explicitly included in the comment. For example, "Although the dinner is expensive, waiters are so warm-hearted!". We can observe that there are two aspects mentioned in this comment: price and service with completely opposite sentiment polarities. Predicting aspect-based sentiment polarities of this kind of comments is termed as aspect category sentiment analysis (ACSA), and the aspect categories belong to a predefined set. In this paper, we collectively refer to aspect term and aspect category as aspect. And our goal is aspect-based sentiment analysis (ABSA) including ATSA and ACSA, which are both classification tasks.
Long Short-Term Memory networks (LSTM) (Hochreiter & Schmidhuber, 1997) is the most widely-used context encoder in ABSA task. Previous models using LSTM as context encoders can be roughly divided into four categories: (1) The first category models conduct joint modeling of concatenated contexts and aspects. Attention-based LSTM with aspect embedding (ATAE-LSTM) (Wang, Huang, & Zhao, 2016) and modeling inter-aspect dependencies by LSTM (IAD-LSTM) (Devamanyu, Soujanya, Prateek, Gangeshwar, Erik, & Roger, 2018) model the context and aspect simultaneously via concatenating the aspect vector to each context word embedding in the embedding layer before LSTM. (2) In the secondary category, the models only use context words as input when modeling the context. Interactive attention networks (IAN) (Ma, Li, Zhang, & Wang, 2017) and aspect fusion LSTM (AF-LSTM)" (Tay, Tuan, & Hui, 2018) model the context alone while utilizing the aspect to study the interaction between context and aspect in the attention mechanism.
(3) The methods in the third category additionally multiply a relative position weight to highlight the potential aspect-related context words. Recurrent attention network on memory (RAM) (Chen, Sun, Bing, & Yang, 2017) assigns relative position weights to context hidden states before the attention mechanism. (4) The fourth category models are more special. After LSTM layer, they employ graph convolutional networks (GCN) or graph attention networks (GAT) to leverage the syntactic information by encoding the context's syntax graph. ASGCN (Zhang, Li, & Song, 2019) utilizes GCN to enhance the hidden states with the syntactical connections. Although recent LSTM-based models may adopt different subsequent modules from the above models, their context modeling process falls into the above four categories.
LSTM, GCN, and BERT can be seen as the sequence-based context encoder, graphbased context encoder, and pre-trained context encoder, respectively. All of them can generate or modify the inner values of the hidden states, which cannot be achieved by subsequent modules such as aspect-specific attention mechanisms (Ma et al., 2017;. These context encoders are widely adopted in existing ABSA models and their generated hidden states are taken as input of subsequent modules. However, there is a question which has never been considered: Are these hidden states good enough?
We argue that the aspect-related information may be discarded and the aspect-irrelevant information may be retained in the hidden states generated by LSTM, GCN and BERT. The reason is that there is no aspect information introduced into the context modeling process of these context encoders, which cannot process the latent semantic space according to the aspect of current sample. In this paper, we term this problem as the aspect-agnostic problem in the context modeling process. And in Section 2 we depict this problem in details.
To solve the aspect-agnostic problem, we argue that the semantics of the given aspect should be explicitly introduced into context modeling process as a new clue out of the context. With the consideration of the given aspect, the aspect-aware context encoder can specifically retain useful information and eliminate aspect-irrelevant information in the generated aspect-aware hidden states which can improve ABSA. Specifically, we propose three streams of aspect-aware context encoders which are based on LSTM, GCN and BERT, respectively. Based on LSTM, we design an aspect-aware (AA) LSTM which augments vanilla LSTM with a novel aspect-aware (AA) mechanism. The AA mechanism includes three aspect gates, which are corresponding to the input gate, forget gate, and output gate of LSTM cell respectively. The aspect gates take the aspect vector and previous hidden state as input, producing a gate vector. The gate vector controls the fraction of disturbance from the given aspect added to the internal value of the three LSTM gates. In this way, when modeling the context, AALSTM can dynamically identify the aspect-related information as well as the aspect-irrelevant information. With the consideration of the given aspect, AALSTM can retain the useful information in generated hidden states and prevent the harmful information from fusing into generated hidden states. We propose the aspect-aware GCN by augmenting vanilla GCN with an aspect-aware convolution gate, which controls what and how much information from the neighbor nodes should be passed to the current node, regarding the specific aspect. As for BERT, we do not change its internal network architecture so as to preserve its strong language modeling capability. Skillfully utilizing the setting of the segment embedding and [SEP] token, we modify the input format of BERT to make BERT capture the aspect-aware intra-sentence dependencies when modeling the context. We propose three aspect-aware (AA) BERT variants whose differences lie in the input formats. Note that as AABERTs share the same parameters with standard BERT, they have no extra computation cost.
A preliminary version of this work has been presented in the conference paper (Xing, Liao, Song, Wang, Zhang, Wang, & Huang, 2019). The contribution of the previous version can be summarized as follows: • We discover the aspect-agnostic problem in ABSA task. To our knowledge, this is the first time that this problem is identified. To solve this problem, we propose a novel LSTM variant termed aspect-aware LSTM (AALSTM) to introduce the aspect into the process of context modeling.
• Considering that the aspect is the core information in this task, we fully exploit its potential by introducing it into the LSTM cells. We design three aspect gates to introduce the aspect into the input gate, forget gate and output gate in the LSTM cell. AALSTM can utilize aspect to improve the information flow and then generate more effective aspect-specific context hidden states tailored for ABSA task.
• We apply our proposed AALSTM to several representative LSTM-based models, and the experimental results on the benchmark datasets demonstrate the efficacy and generalization of our proposed AALSTM.
In this paper, we significantly extend our work from the previous version in the following aspects: • We discover that although GCN and BERT are widely adopted and have achieved promising performance in ABSA task, they also suffer the aspect-agnostic problem.
• To solve the aspect-agnostic problem in GCN, we propose the aspect-aware GCN (AAGCN) by augmenting vanilla GCN with a novel aspect-aware convolution gate to introduce aspect semantics into the graph convolution process.
• To solve the problem in BERT, we propose three aspect-aware BERT (AABERT) variants by skillfully modifying the input format of BERT. In this way, our AABERTs can model the intra-sentence dependency between the aspect and context words in a more appropriate way.
• We conduct extensive experiments to evaluate the proposed (Bi-)AALSTM, AAGCN and AABERTs on ABSA task. The results demonstrate that AABERTs can overpass vanilla BERT not only as a single model but also as the context encoder, and AAGCN can work well with both (Bi-)AALSTM and AABERTs. Equipped with our aspectaware context encoders, the baselines proposed several years ago can beat up-to-date models, achieving new state-of-the-art performances.
The remainder of this paper is organized as follows. Section 2 depicts the details of the aspect-agnostic problem. Section 3 summaries the recent studies on ABSA task; Section 4 elaborates the details of our proposed AALSTM and AABERTs; Section 5 introduces the details of experiments; Section 6 gives the evaluation results and analysis; Section 7 discusses the proposed aspect-aware encoders and further investigate their properties and advantages; Section 8 gives the conclusion of this work.

Aspect-Agnostic Problem
When modeling the context, LSTM cells are aspect-agnostic because no aspect information is introduced to the cell to guidance the information flow. Consequently, the generated hidden states contain the semantic information that is important to the whole review rather than the given aspect. This is because LSTM inherently tends to retain the important information for the overall semantics of the whole review in generated hidden states. However, considering the characters of ABSA task, a context word is valuable only if its semantics is helpful for predicting the sentiment of the given aspect. On the contrary, if the information of a context word is aspect-irrelevant, it may be noise information and harmful for the prediction of aspect sentiment, and its semantics should be eliminated in the semantic of 'so delicious' is contained in Figure 1: An example aiming to predict sentiment of soup Pizza is wonderful compared to the last time we enjoyed at another place, and the beef is not bad, by the way. semantic related to 'pizza' semantic related to 'beef' semantic of the whole review pizza beef Figure 2: An example aiming to predict sentiment of beef the context modeling process of the context encoder. LSTM cannot identify these two kinds of information when modeling the context for that no aspect information is considered in its cells. As a result, the aspect-related information may be already discarded and adverse information may be retained in the hidden state generated by LSTM. Specifically, the lack of the aspect information considered in LSTM cells may cause the following two issues. For a specific aspect, on one hand, some of the semantic information of the whole review context is useless. This aspect-irrelevant information would adversely harm the final context representation, especially in the situation where multiple aspects exist in one comment. This is because when LSTM encounters an important token for the overall sentence semantics, this token's information is retained in every follow-up hidden state. Consequently, even if perfect attention weights are produced by the attention mechanism, these hidden states also contain useless information in respect to the aspect. And the contained useless information is even magnified to some extent, for that the important tokens are assigned greater attention weights and these tokens may contain some useless information. On the other hand, the information important to the aspect may be not sufficiently kept in the hidden states because of their small contribution to the overall semantic information of the sentence. We define the above issues as the aspect-agnostic problem of LSTM in ABSA task. This is the first time this problem is discovered. Concretely, we take two typical examples to illustrate the aspect-agnostic problem.
The first example: "The salad is so delicious but the soup tastes bad.", as shown in Fig.1. There are two aspects (salad and soup) of opposite sentiment polarity. When inferring the sentiment polarity of soup, the token 'so delicious' which modifies salad is also important to the sentence-level semantics of the whole review and therefore LSTM will retain its information in subsequent context words' hidden states, including the hidden states of tastes and bad. Even if tastes and bad are assigned large attention weights by the attention mechanism, the semantics of 'so delicious' will still be integrated into the final aspect-based context representation and enlarged by the large attention weights. As a result, the adverse information from 'so delicious' will harm the prediction of the aspect sentiment of soup.

CLS SEP SEP
Aspect-aware inter-sentence dependencies Aspect-agnostic intra-sentence dependencies Figure 3: Illustration of the sentence pair modeling process of BERT1. For simplify, some intra-sentence dependencies between context words are omitted.
The second example: "Pizza is wonderful compared to the last time we enjoyed at another place, and the beef is not bad, by the way.", as shown in Fig.2. We can find that this review is mainly about pizza so LSTM cells will retain a lot of semantics that modifies pizza while less semantics about beef when modeling the context. When inferring the aspect sentiment of beef, as LSTM is aspect-agnostic, probably some key information related to beef is lost in the generated hidden states because of the relatively small contribution of beef -related semantics to the overall semantics of the review. GCN are conceptual similar to LSTM as they both achieve the message passing from other words/nodes to current word/node. In LSTM, the message passing is from previous words to current word, while in GCN the message is passed from neighbor nodes to current nodes. And LSTM uses gate mechanisms to control the information flow, while in GCN this is achieved by graph convolutions. In the graph convolution process, GCN does not know which nodes are aspect words nodes. Then the important aspect-related information may be discarded and aspect-irrelevant information may be retained. Hence, GCN suffers the same aspect-agnostic problem as LSTM.
As for BERT, BERT0 only takes the context as input and models the intra-sentence dependency without consideration of the given aspect. So it suffers the same aspect-agnostic problem as LSTM. In BERT1, the context is in position s 1 and the aspect is in position s 2 . As BERT1 models the concatenated context-aspect pair in the sentence-pair manner, it can extract aspect-related semantics from the context. Consequently, BERT1 is a strong baseline which is shown in Table 4. We attribute these improvements to BERT1's capability of capturing the inter-sentence dependencies of the context (s 1 ) and aspect s 2 . The sentence pair modeling process of BERT1 is illustrated in Fig 3. We can observe that BERT1 regards the context (s 1 ) and the concatenated aspect (s 2 ) as two individual sentences. This is because that the separator token [SEP] and segment embeddings of BERT1 thoroughly separate the context (s 1 ) and aspect (s 2 ) in the latent space. The inter-sentence dependencies captured by BERT1 is aspect-aware on account of the pre-training of the next sentence prediction task. Although the sentence-pair modeling works like an attention mechanism considering the aspect in position s 2 , there is not sufficient clue from the given aspect in the intra-sentence language modeling process of the context (s 1 ). However, one of key characteristics of ABSA task is that the aspect is contained in the context. As a result, the captured intra-sentence dependencies is general and aspect-agnostic, similar to the ones obtained by LSTM and BERT0. This causes that some useful intra-sentence dependencies between the aspect and its related context words (especially the sentiment trigger words) may be lost in the context modeling process of BERT1. Therefore, both BERT0 and BERT1 suffer the aspect-agnostic problem in ABSA task.

Related Work
Some traditional ABSA methods have achieved promising results, while they are laborintensive as they focused on feature engineering or massive extra linguistic resources (Kiritchenko, Zhu, Cherrt, & and, 2014;Wagner, Arora, Cortes, Barman, Bogdanova, Foster, & Tounsi, 2014). As deep learning achieved breakthrough success in representation learning (LeCun, Bengio, & Hinton, 2015), many recent ABSA approaches adopt deep neural networks to automatically extract features and generate the final aspect-based sentiment representation which is a dense vector fed into the classifier. Since the attention mechanism was first introduced to the neural machine translation field (Bahdanau, Cho, & Bengio, 2015), many sequence-based models utilize it to focus on the aspect sentiment trigger words for predicting the aspect's sentiment. The attention mechanism in ABSA takes the aspect vector and the hidden states of context words as input. Then it produces an attention vector to assign each context hidden state a weight according to its relevance to the given aspect.
The core of ABSA task is to model the interaction of the context and given aspect then extract the aspect-related semantics. (Ma et al., 2017) adopted two individual LSTM to model the context and aspect term. The proposed interactive attention mechanism can learn the interaction between aspect and context, extracting the aspect-related information. With the ability of extracting n-gram features, convolution neural network (CNN) (Kim, 2014) is applied to model the interaction between the context and aspect in some previous works (Xue & Li, 2018;Huang & Carley, 2018). (Huang & Carley, 2018) utilized parameterized filters and parameterized gates to incorporate aspect information into CNN. As they declare, it was the first attempt using CNN to solve aspect-based sentiment analysis task. There are also some memory networks (Sukhbaatar, szlam, Weston, & Fergus, 2015) (MNs)-based models (Tang, Qin, & Liu, 2016b;Tay, Anh, & Cheung, 2017;Wang, Mazumder, Liu, Zhou, & Chang, 2018). (Tay et al., 2017) modeled dyadic interactions between aspect and context using neural tensor layers and associative layers with rich compositional operators.  argues that for the case where several sentences are the same except for different targets, only relying on the attention mechanism is insufficient. It designed several memory networks having their own characters to solve the problem. Capsule network (Sabour, Frosst, & Hinton, 2017) is also exploited to tackle both sentiment analysis and aspect-based sentiment analysis tasks (Wang, Sun, Han, Liu, & Zhu, 2018;Wang, Sun, Huang, & Zhu, 2019;Chen & Qian, 2019). To solve the problem of lacking labeled data of the ABSA task, (Chen & Qian, 2019) transfers the knowledge of the document-level sentiment analysis task to aspect-based sentiment analysis. It designs an aspect routing approach and extends the vanilla dynamic routing approach by adapting it to the transfer learning framework.
Although previous works can improve ABSA task by tackling different issues, the effectiveness of the context modeling processes of LSTM, GCN and BERT for ABSA task has never been inspected. In this work, we discover the aspect-agnostic problem which is widely suffered by the context encoders of existing works. And we propose the aspect-aware encoders to tackle this problem, improving ABSA from a new perspective. Motivated by the observation and analysis of the aspect-agnostic problem, we propose to introduce explicit aspect semantics into the context encoder to make the context modeling process aspect-aware, as shown in Fig 4. In this paper, we propose two streams of aspect-aware context encoders as solutions, whose backbones are vanilla LSTM and BERT respectively. In the following sections, we will introduce the details of the proposed aspectaware context encoders.

Aspect-Aware LSTM
Vanilla LSTM utilizes three gates (input gate, forget gate, and output gate) to model the dependency within the input word sequence and retain the important long dependency in the forward context modeling process. We argue that the information of the given aspect should be introduced into LSTM cells to help regulate the information flow. Additionally, it is intuitive that in every time step the degree that the semantics of the given aspect is integrated into the three gates of classic LSTM should be dynamically adjusted according to the aspect information and the current semantic states. Therefore, we design three aspect gates that control how much the aspect vector is integrated into the input gate, forget gate, and output gate respectively. The aspect gate mechanism takes the previous hidden state and the aspect vector as input. In this way, AALSTM can dynamically optimize the information flow in LSTM cells according to the given aspect, then generate effective and beneficial aspect-specific hidden states for ABSA task. Figure 5 illustrates the overall architecture of our proposed AALSTM, which can be formalized as follows: (1) where x t denotes current input context word embedding, A is the aspect vector, h t−1 is the previous hidden state, h t is the hidden state of current time step, σ and tanh are sigmoid and hyperbolic tangent functions, stands for element-wise multiplication, a i , a f , a o ∈ R da stand for the aspect-input gate, aspect-forget gate and aspect-output gate respectively. They determine the extent of integrating the aspect information into the input gate, forget gate, and output gate respectively. AALSTM takes two strands of inputs: context word embeddings and the aspect vector. At each time step, the context word entering the AALSTM dynamically varies according to the sequence of words in the sentence, while the aspect vector is identical. Specifically, the aspect vector is the target representation in ATSA, and it is the aspect embedding in ACSA. Next, we describe the different components of our proposed AALSTM in detail.

Input Gates
The input gate I t controls how much new information from the input context word embedding can be transferred into the cell state. While the aspect-input gate a i controls how much the aspect information should be integrated into the input gate I t . The difference between I t in AALSTM and vanilla LSTM lies in the weighted aspect vector input into I t .
The aspect-input gate a i is computed by h t−1 and A (Eq. 1). h t−1 can be regarded as the previous semantic representation of the partial sentence which has been processed in the past time steps. Hence, the aspect-input gate a i is controlled by the previous semantic representation and the aspect vector A. In I t , the dynamically weighted aspect information a i A is added to the original internal value calculated by x t and h t−1 (Eq. 2). Thereby, the dynamically adjusted disturbance from the given aspect can guide I t to determine what information from the current input context word embedding should be transferred into the cell state.

Forget Gates
The forget gate f t abandons trivial information and retains key information from previous cell state C t−1 . The aspect-input gate a f controls how much the aspect vector should be integrated into the forget gate f t . The difference between AALSTM and vanilla LSTM in f t is the introduction of weighted aspect vector. And the aspect-forget gate a f is computed by h t−1 and A (Eq. 3). Therefore, the extent of the integration of aspect information into f t is decided by the previous semantic representation and the aspect vector A. In f t , a f A is added to the original internal value calculated by x t and h t−1 (Eq. 4). Thereby, the dynamically adjusted disturbance from the given aspect information can guide f t to select aspect-related information from the previous cell state and retain it in the current cell state.
In the meantime, aspect-irrelevant information is abandoned.

Candidate Cell and Current Cell
The candidate cell C t represents the alternative input content. The current cell C t updates its state by selecting important information from previous cell state C t−1 and the candidate cell C t . From Eq. 5 we can observe that there is two kinds of information contained in the alternative input content C t : the last hidden state h t−1 and current input context embedding x t . While the information in current cell state C t consists of the information from previous cell state C t−1 and candidate cell C t , as shown in Eq. 6. Considering that the information in h t−1 comes from previous cell state C t−1 , the only source of the information contained in cell state C t and hidden state h t is the input context word embeddings. So our proposed AALSTM only leverages the given aspect information to regulate the information flow in LSTM cells instead of fusing the aspect information into cell state nor hidden states.

Output Gates
The output gate o t controls what information of the current cell state should be output as the hidden state of the current context word. The aspect-output gate a o controls what fraction of aspect should be integrated into the output gate o t . The difference between our proposed AALSTM and the vanilla LSTM in o t is the integration of the weighted aspect vector into o t . And the aspect-output gate a o is computed by h t−1 and A (Eq. 7). Therefore, the degree of how much the aspect information is integrated into o t is decided by the previous semantic representation and the aspect vector A. In f t , the dynamically weighted aspect information vector a o A is added to the original internal value calculated by x t and h t−1 (Eq. 8). Thereby, the optimized disturbance from the given aspect information can guide o t to specifically select the appropriate information from the current cell state as the hidden state of the current input context word.

Aspect-Aware Graph Convolutional Networks
In ABSA, the function of GCN is encoding the local syntactic connections represented by the adjacent matrix derived from the syntax graph predicted by the off-the-shelf dependency parser. The graph convolution operation of vanilla GCN can be written as: where l denotes l−th GCN layer, N i denotes the set of i−node's neighbor nodes, d i is the degree of i−node in the syntax tree, W g is weight matrix and b g is bias.
To tackle the aspect-agnostic problem of GCN, we propose the aspect-aware GCN (AAGCN) by augment vanilla GCN with the aspect-aware convolution gate. The architecture of AAGCN is illustrated in Figure 6. Our AAGCN can be formulated as: where A denotes the aspect representation, W ac and b ac are the weight matrix and bias respectively. The input of aspect-aware convolution gate is the aspect representation and Layer l Figure 6: The architecture of our proposed aspect-aware GCN. The black dash line denotes the message passing from layer l to layer l + 1. The red dash line denotes the aspect semantics helps control what and how much information should be passed from layer l to layer l + 1.
the hidden state of the neighbor node. Considering the semantics of the specific aspect and the hidden state of the neighbor node, the aspect-aware convolution gate output a vector which determines what and how much information from the neighbor node should be transfered into the current node. In this way, AAGCN can aggregate the aspect-related information and eliminate aspect-irrelevant information harmful for ABSA in the process of aspect-aware graph convolution. Then AAGCN can generate better hidden states which can improve the performances of ABSA.

Aspect-Aware BERT
Before we demonstrate our proposed AABERTs, we firstly introduce input and output formats of BERT0 and BERT1, as illustrated in Fig. 7 (Devlin, Chang, Lee, & Toutanova, 2019). To keep the fine-tuning consistent with the pre-training, previous works adopted BERT (CLS) rather than BERT (Pool) and BERT (SEP) as a strong baseline. In this work, we not only study BERT (CLS) but also BERT (Pool) and BERT (SEP) for ABSA task. As for the Sentiment Reasoning Head, we simply use a fully connected layer, whose output is the sentiment class probability distribution formed in a 3-d vector.
[CLS] salad is not bad .
[  When designing AABERTs, the original language modeling capacity of pre-trained BERT should be preserved, so we do not modify its internal network architecture. However, to achieve the aspect-awareness in BERT, we have to introduce the semantics of the given aspect to the intra-sentence dependency modeling process of BERT. In this work, our intuition is to provide BERT with the signal of the given aspect without thoroughly separating the context and aspect as two individual sentences. To achieve this, we try to break the isolation of the context (s 1 ) and the concatenated aspect (s 2 ), providing a more appropriate signal of the given aspect for BERT. We modify the input format of BERT, more specifically, the setting of segment embedding and [SEP] token. Note that although there are many works aiming to improve BERT, most of them focus on designing different pre-training tasks (Liu, Ott, Goyal, Du, Joshi, Chen, Levy, Lewis, Zettlemoyer, & Stoyanov, 2019;Zhang, Han, Liu, Jiang, Sun, & Liu, 2019;Joshi, Chen, Liu, Weld, Zettlemoyer, & Levy, 2020) and position embeddings (Wang, Shang, Lioma, Jiang, Yang, Liu, & Simonsen, 2021), while the segment embedding and [SEP] token are yet to be studied. Next we introduce the proposed three aspect-aware BERTs whose differences only lie in the input formats.  The input and output format of AABERT1 is illustrated in Fig. 9. In AABERT1, the context word sequence is concatenated with the word sequence of the given aspect. They are not separated by [SEP] token and the segment embeddings of their tokens are identical. As a result, the input tokens of context and aspect are not separated in the embedding space. The intuition of AABERT1 is to provide the signal of the given aspect without separating the context and aspect. There are two clues of aspect-awareness for intra-sentence context modeling. First, considering the review is of one sentence, the punctuation mark (always '.') at the end of every review is a weak separator to hint that the tokens between it and the end mark ([SEP]) are the given aspect. Second, the co-occurrence of the aspect included in the context (s 1 ) and the aspect in position s 2 .

AABERT2
The input and output format of AABERT2 is illustrated in Fig. 10. The difference between AABERT2 and AABERT1 lies in the segment embedding. In AABERT2, the segment embeddings of aspect tokens are set different from the ones of context tokens. In the input embedding space, a signal is introduced to separate context and aspect. Therefore, there are three clues indicating the specific aspect: (1) the final punctuation mark of the review context; (2) the co-occurrence of the concatenated aspect and the one include in the context; (3) the different semantic signals of the segment embeddings. These three clues indicate the specific aspect of the current sample. And more importantly, the concatenated aspect is not fully regarded as a separated sentence from the context. Accordingly, the intra-sentence dependencies between the aspect and its related context words can be captured, then the generated hidden states can contain more useful aspect-related information.  Figure 11: Illustration of AABERT3 (CLS), AABERT3 (Pool) and AABERT3 (SEP).

AABERT3
The input and output format of AABERT3 is illustrated in Fig. 11. AABERT3 adopts the explicit separator token [SEP] to separate the context and aspect in the input concatenated sequence. However, AABERT3 does not set different segment embeddings for context tokens and aspect tokens to separate them in latent space. Hence, there are two clues of aspectawareness for intra-sentence context modeling: one is the [SEP] token and another is the co-occurrence of the aspect included in the context and the concatenated aspect in position s 2 . Similar to AABERT1 and AABERT2, the concatenated aspect is not fully regarded as a separated sentence from the context. Thus the intra-sentence dependencies between the aspect and its related context words can be captured,

Task Definition
We conduct experiments on the two cases of aspect-based sentiment analysis: aspect term sentiment analysis (ATSA) and aspect category sentiment analysis (ACSA). The former infers sentiment polarities of the given aspect term explicitly included in the context word sequence. The latter infers sentiment polarities of generic aspects such as service or food which may not be explicitly found in the context word sequence, and these generic aspects belong to a predefined set. In this paper, both of ATSA and ACSA are classification tasks, concretely, they are respectively the subtask 2 (ST2) and subtask 4 (ST4) in SemEval-2014 Task 4 (Pontiki, Galanis, Pavlopoulos, Papageorgiou, Androutsopoulos, & Manandhar, 2014).

Settings Datasets
We evaluate the performances of our models on SemEval 2014 (Pontiki et al., 2014) task 4 datasets which consist of laptop and restaurant reviews. These two datasets are widely used benchmarks in many previous works (Wang et al., 2016;Chen et al., 2017;Wang et al., , 2020, and consistent with them we remove the reviews having no aspect or the aspects with sentiment polarity of "conflict". Finally, the datasets consist of reviews with at least one aspect labeled with sentiment polarities of positive, neutral and negative. For ATSA(ST2), we adopt Laptop and Restaurant datasets; for ACSA(ST4), only the Restaurant dataset is available. Full statistics of the datasets are given in Table 1.

Evaluation
We adopt both Macro-F1 and Accuracy (Acc) to evaluate the models' performances. Generally, higher Acc can verify the effectiveness of the system while it biases towards the majority class, and Macro-F1 provides more indicative information about the average performance of all classes.
Following previous works Tang et al., 2020;Zhang & Qian, 2020), we train the models several times and report the average of best results of each run. Concretely, we train the models 10 times with individual random seeds and the epoch number is 30.

Base Models and Compared Models
Firstly, we compare our AALSTM with LSTM as single models. LSTM, LSTM (Pool) and Bi-LSTM denote that the last hidden state of LSTM, the pooling of all hidden states of LSTM and the pooling of all hidden states of Bi-LSTM are taken as the final representation for classification, respectively. And the same notation for AALSTM, as shown in Table 2. The output of Bi-AALSTM is the series of concatenated hidden states of two AALSTMs of different directions.
In Section 1, we divide existing LSTM-based ABSA models into four categories according to their context modeling process. In order to verify the superiority of the proposed AALSTM and AAGCN compared to vanilla LSTM and GCN, we choose some representative models as backbones and replace their original vanilla LSTM and GCN with our proposed AALSTM and AAGCN. We select one representative model from each of these categories for experiments. Accordingly, ATAE-LSTM (Wang et al., 2016), IAN (Ma et al., 2017), RAM , and ASGCN  are chosen as the representatives of the four categories because their architectures are novel and they are widely taken as compared models in previous works. We introduce the four LSTM-based backbones as follows: • Attention-based LSTM with Aspect Embedding (ATAE-LSTM). It concatenates the aspect embedding to the word embeddings of context words as the initial context word representation input into the LSTM layer. In the attention mechanism, the aspect embedding is utilized to produce the attention vector. We refer to this model as ATAE for short.
• Interactive Attention Networks (IAN). It models context and target separately. And in the interactive attention mechanism, for ATSA task, the context and target leverage the average of each other's hidden states to produce their attention vectors. The representations of the context and aspect are concatenated as the final aspectbased sentiment representations. For the ACSA task, the target modeling module is not available because the aspect category is not a word sequence as the target. So we use the aspect embedding to produce the attention weights of context words.
• Recurrent Attention Network on Memory (RAM). Aiming at ATSA task, it utilizes aspect-relative location to assign weights to original hidden states then produces the attention vector in the recurrent attention mechanism consisting of GRU cells. It can not be applied to the ACSA task for that the aspect category has no location information because it is not included in the word sequence of context.
• Aspect-specific Graph Convolution Networks. Aiming at ATSA task, it employs a GCN to encode the local connections of the syntax graph of context, and use an aspect-aware attention to extract the aspect-related semantics for classification. This model cannot be applied to the ACSA task either, for the same reason as RAM.
Except for evaluating BERT0, BERT1, and our three AABERTs as single models, we also compare them as context encoders in X+BERT experiments by replacing the original BERT encoder with AABERTs. We choose R-GAT+BERT (Wang et al., 2020), which is a recently proposed BERT-based model for ABSA, as our backbone for X+BERT experiments.
• Relational Graph Attention Network with BERT Encoder. Augmenting graph attention network (Veličković et al., 2018) with relation embedding, (Wang et al., 2020) propose relational graph attention network (R-GAT) which can capture the relations between the aspect and each context word via operating on the starshaped aspect-oriented dependency graph. R-GAT+BERT uses BERT1 to model context and aspect together in the sentence-pair manner and takes the output hidden state of [CLS] token as the aspect representation. For fair comparison, we leave out the baselines that utilizes external resources (Gao, Feng, Song, & Wu, 2019;Xu et al., 2019;Yang et al., 2021).

Implementation Details
For LSTM based models, we initialize all word embeddings by Glove vectors (Pennington, Socher, & Manning, 2014) and the out-of-vocabulary words' embeddings are sampled from the uniform distribution U (−0.1, 0.1). All embedding dimensions are set to 300. For BERTbased models, we adopt BERT-base uncased English version (Devlin et al., 2019). Initial values of all weight matrices are sampled from the uniform distribution U (−0.1, 0.1) and initial values of all biases are zeros. The batch size is set as 16. We minimize the crossentropy loss to train our models using the Adam optimizer (Kingma & Ba, 2015) with the learning rate set as 0.001 for LSTM-based models and 10 −5 for BERT-based models. To avoid overfitting, for LSTM based models, we adopt the dropout strategy on the input context wording embedding layer and the final aspect-based sentiment representation with p = 0.5; for BERT-based models, we adopt the dropout strategy on the BERT output hidden states with p = 0.3. Besides, weight decay strategy and L2 -regularization are respectively adopted for BERT and LSTM based models.
As for the aspect vector (A) input into AALSTM or AAGCN, we set it as follows: • ATSA (ST2): For AALSTM, generally we use the average of word embeddings of the target words as A, while for IAN, we adopt the average of the hidden states of the aspect words as A. For AAGCN, the average pooling of the hidden states of the aspect is taken as A.
For BERT-based models, the aspect category is regarded as a phrase just like the aspect term which is concatenated to the end of context in the input sequence.

LSTM
Experimental results of LSTM-based models are illustrated in Table 2. First of all, we can find that single AALSTM significantly outperforms vanilla LSTM and ATAE, and even close to IAN and RAM. On ATSA(ST2)-Res14 dataset, AALSTM overpass LSTM by 4.6% on F1 score. It is worth mentioning that ATAE, IAN, and RAM all adopt the attention mechanism to extract aspect-related information, while AALSTM and LSTM only model the context and take the last hidden state for prediction. The satisfying performances of single AALSTM prove that its generated last hidden state H contains better and more sentiment indicative information of the given aspect than the last hidden state generated by vanilla LSTM. Generally, the location of the tokens mentioning the given aspect is not at the end of the review context. So it is proved that after training, AALSTM is able to transmit the aspect-related semantics in different context words to the last hidden state along time steps. And we can observe that generally the pooling variants of LSTM and AALSTM perform better than the ones adopting the last hidden state for classification. We speculate that there is some important information which is not contained in the last hidden state, and the pooling variant can aggregate the important in all hidden states into the final representation, which leads to better performance. Besides, we can find that Bi-AALSTM obtains significant improvement via only concatenating two unidirectional AALSTM. This is because Bi-AALSTM can aggregate the aspect-specific semantics from both directions, and the final representation is more sufficient.
To further verify the superiority of AALSTM over vanilla LSTM, we replace the original LSTM in the four backbones (ATAE, IAN, RAM and ASGCN) with our proposed AAL-STM. In the implementation of our experiments, the only difference between original models and their AA or Bi-AA variants is the replacement of vanilla LSTM or Bi-LSTM. So the performance improvement can directly demonstrate the effectiveness of our proposed AAL-STM and Bi-AALSTM. We can observe that (Bi-)AALSTM can significantly improve the performances of IAN, RAM and ASGCN, especially on Macro-F1. On ATSA(ST2)-Lap14 dataset, IAN (AA) has 2.8% and 1.2% improvement on F1 and Acc, respectively. On ACSA task, IAN (AA) gets an improvement of 2.6% on F1 compared with IAN, and IAN (Bi-AA) outperform IAN (Bi) on ACSA(ST4) by 3% in terms of F1. IAN (AA) and IAN (Bi-AA) outperforms AS-Capsule by 3% on F1. Compared with AS-Capsule, IAN (AA) and IAN (Bi-AA) respectively obtain 2% and 3% improvement in terms of F1.
To analyze the significant improvements on F1, we dissect the performances of IAN and IAN (AA) on the three sentiment classes, and AS-Capsule is also compared on ACSA task. The comparisons are shown in Fig. 12. We can observe that AALSTM can improve IAN's performance on all sentiment classes. Especially, AALSTM can significantly improve the performance on Neutral class. And IAN (AA) overpasses AS-Capsule on Neutral class by a large margin. As shown in previous works (Wang et al., 2016;Tay et al., 2018;Wang et al., 2019), the sentiment prediction of Neutral class is much harder than the other two sentiment classes. This is caused by two main issues. First, as shown in Table 1, the Neutral class has much fewer samples, compared with the other two sentiment classes. This causes neural ABSA models do not have enough chances to learn how to effectively identify and extract the features that are important for correctly inferring the Neutral sentiment. Second, the key information about the given aspect with Neutral sentiment may be discarded in the context modeling process of vanilla LSTM, especially on the multi-aspect situation where there are other aspects with Positive or Negative sentiment. Compared with the tokens expressing Positive or Negative sentiments on other aspects in the same review, the tokens modifying Neutral aspect is prone to be neglected by vanilla LSTM cells to some extent. This is because the semantic information of the tokens modifying Neutral aspect seems too general against the overall semantic background of the context. The comparisons in Fig 12  proves that by taking the semantics of the given aspect into consideration when modeling context, AALSTM can effectively alleviate above two issues. However, we can find that AALSTM brings diminishing performances for ATAE model. ATAE represents a category of models that utilize aspect embedding concatenation (AEC) for the joint modeling of aspect and context. We suspect that the aspect information fused in the context word representations input into AALSTM may adversely affect the aspectaware mechanism's ability of identifying the sentiment indicative information of the given aspect and the noise information with respect to the given aspect (this conjecture is verified in Sect. 7.1).
To investigate the efficacy of our proposed AAGCN, we replace the vanilla GCN in ASGCN with it. We can observe that ASGCN (Bi, AAGCN) significantly outperforms ASGCN (Bi). This proves the superiority of AAGCN over vanilla GCN. The aspect-aware graph convolution process of AAGCN can aggregate more aspect-related semantics to current node from its neighbor nodes, and discard the useless information simultaneously. Therefore, AAGCN can generate better node representations, which improve performances. With the adoption of both Bi-AALSTM and AAGCN, ASGCN (Bi-AA, AAGCN) obtains even better performances, outperforming up-to-date state-of-the-art models by large margins. This proves that AALSTM can work well with AAGCN, generating much better hidden states which contain not only aspect-related semantics but also syntactic information. Additionally, we add AAGCN to RAM (Bi-AA) and we can find the performances are significantly improved, even overpassing ASGCN (Bi-AA, AAGCN). This is because the addition of AAGCN not only introduces syntactic information but also further regulates the information flow regarding the specific aspect.  We list the input formats of BERT0, BERT1 and our three AABERTs in Table 3 to help distinguish them. Experimental results of BERT-based models are illustrated in Table  4. We can observe that BERT0 (CLS) and BERT0 (Pool) obtain the worst results. This is because there is no aspect semantics introduced. And among other single BERT models, BERT1 (CLS) achieves the worst results on ATSA task, although it is widely used for sentence classification. Our proposed AABERTs show significant superiority to BERT1 on ATSA task. Remarkably, AABERT2 (Pool) even overpass some state-of-the-art models in terms of F1 on ATSA(ST2)-Lap14 dataset. This proves its strong capability of extracting aspect-related information from context words, which is because AABERT1 can effectively model the intra-sentence dependencies between the given aspect and context words. On ACSA task, we can find that BERT1 (SEP) slightly outperforming our proposed AABERTs. This is because aspect categories are not included in the context word sequence. In this case, AABERTs' advantage of intra-sentence dependencies modeling is hindered because the aspect is not a word sequence included in the sentence.

BERT
To evaluate the effectiveness of AABERTs as context encoders, we conduct experiments on recent proposed R-GAT+BERT by replacing the original BERT with AABERTs. Note that R-GAT+BERT is only available for ATSA task and in its original architecture, the BERT is BERT1 and the hidden state of [CLS] token is taken as the aspect representation. We can find that generally, R-GAT+AABERTs outperform R-GAT+BERT1 and R-GAT+BERT0, while on ATSA-Res14 dataset R-GAT+AABERTs slightly underperform to R-GAT+BERT1.
As R-GAT+BERT takes the hidden state of [CLS] token as aspect representation, we suspect the generated hidden states of [CLS] tokens may lose some important aspect-related information when aggregating the (aspect-related) context information. Since average pooling can retain the original information as much as possible, we propose R-GAT2+BERT which takes the average pooling of all context hidden states as aspect representations. The performances of R-GAT2+BERTs are demonstrated in Table 4. We can observe that the performances are improved via only changing the way of obtaining the aspect representation from CLS to Pool . In addition, this operation costs no extra time and computation at implementation.
Another interesting phenomenon is that the collaboration of R-GAT and AABERTs (BERT0/BERT1) does not bring improvements for AABERTs (BERT1) on ATSA(ST2)-Lap14 dataset. We suspect the reason is that the grammatical correctness of the reviews in ATSA(ST2)-Lap14 dataset is lower than the reviews in ATSA(ST2)-Res14 dataset, which was revealed in (Zheng, Zhang, Mensah, & Mao, 2020). And the R-GAT heavily relies on the precision of dependency graphs, which may be incorrectly parsed due to the ungrammatical sentences. As a result, the performances of R-GAT+BERTs on ATSA(ST2)-Lap14 dataset are not promising enough.
To evaluate the efficacy of combining AAGCN and AABERT, we design three models: AAGCN + AABERT3, ASGCN (AAGCN) + AABERT3 and RAM + AAGCN+ AABERT3, in which the original LSTM encoder is replaced with our proposed AABERT3 encoder. AAGCN + AABERT3 takes the average pooling of hidden states as the final representation for classification. We can observe that it achieves new state-of-the-art performances on ACSA(ST4). This proves that although AABERT have strong ability of context modeling and retain aspect-related semantics, AAGCN can further improve the performance by integrating syntactic information and further aggregating aspect-centric semantics. And we can find that RAM + AAGCN + AABERT3 achieves new state-of-the-art performance on ATSA(ST2). Although RAM is a model proposed in 2017, with the power of our proposed AAGCN and AABERT3, it beats the up-to-date models proposed in 2020 and 2021.
It is worth mentioning that all of the improvements of AABERTs are achieved by skillfully modifying the input format of BERT, which is pretty easy to implement in practice. And our proposed AABERTs do not cause extra computation and time cost because they have the same architecture and parameters with vanilla BERT. Duo to this, the improvements of AABERTs over vanilla BERT are not very significant. However, AAGCN can work well with AABERTs to generate high-quality hidden states which can significantly improve performances and lead to new state-of-the-art results.

Computation Efficiency of AALSTM-and AABERT-Based Models
From the results demonstrated above, we can observe than BERT has absolute superiority over LSTM: even the basic BERT1 significantly overperforms the best LSTM-based model RAM (Bi-AA) + AAGCN. However, this does not mean that LSTM-based models should be abandoned. In real-world scenarios, not only the model performance is important, but also computation efficiency is also crucial. Table 5 shows the Computation Efficiency of our proposed RAM (Bi-AA) + AAGCN and RAM + AAGCN + AABERT3. We can find that the AABERT-based model costs 3 times of training and inference time compared with the AALSTM-based model, and even requires 6 times of GPU memory, while it only surpasses the AALSTM-based model by about 5% on performance. Therefore, LSTM-based models is much more efficient on computation and costs much less resource. In some cases, especially the ones demanding fast inference while without adequate computing resources, BERT-based models are impractical to use. In contrast, LSTM-based models can effectively handle the task with much fewer parameters and faster inference speed.

Effect of Aspect Embedding Concatenation
In Section 6, we observe that the ATAE (AA) performs worse than ATAE. And we speculate the reason is the aspect embedding concatenation (AEC) adversely affects the aspect-aware gate mechanism of AALSTM, which is characterized by its ability of identifying the beneficial and adverse information with respect to given aspect at every time step. From Eq. 1, 3, and 7 we can find that the aspect gate mechanism is controlled by A and h t−1 which are concatenated together. The objective of the aspect-aware gate mechanism is to model the relation between the given aspect and the current semantic state then determine how much information from the given aspect should be integrated into the three gates of LSTM. But because of AEC, much information of the given aspect is fused into the generated hidden states. So A and h t−1 have much overlap with each other. This greatly increases the difficulty of modeling their relation. Moreover, from Eq. 2, 4, 8 we can find that the weighted A is added to the internal value of the three gates of LSTM. The object of this operation is to regulate the information flow of LSTM cells by adding some disturbance from A to the original internal value calculated by x t and h t−1 . Without AEC, there is little aspect information in x t and h t−1 , so the disturbance from A can effectively adjust the three gates of LSTM with respect to the given aspect. But if AEC is adopted, there will be much aspect information contained in x t and h t−1 . As a result, the influence of the disturbance from A is little, even invalid from the experimental results. To confirm our conjecture, we conduct a set of ablation experiments based on ATAE model to study the effect of AEC. Table 6 demonstrates the experimental results. We can find that if AEC is removed from ATAE (AA), the model's performance gets significant increase. Just because of the adoption of AEC, the advantage of AALSTM is blocked out. The evaluation results of this set of ablation experiments provide solid evidence that AEC has an adverse influence on AALSTM. Based on the experimental results we can come to the conclusion that the purity of input initial context word representations is critical for AALSTM to retain the important information and eliminate the adverse information according to the given aspect for ABSA task. The purity here means the input word embeddings to AALSTM should only contain the semantic information about the context word itself. In fact, the ability of identifying the important and adverse information in the overall semantic space of the whole review according to the given aspect is the foundation of AALSTM's aspect-aware gate mechanism and the reason why it can generate more effective hidden states tailored for ABSA task.

Investigation on the Effect of Aspect Numbers
In both real-world scenarios and the SemEval 2014 datasets, it is a common phenomenon that multiple aspects are mentioned in the same review. As revealed in previous work , the adverse interactive affect between different aspects contained in the same review limits ABSA models' performances to a large extent. The fundamental cause of the multi-aspect problem is that the sentiment indicative information of different aspects  is intertwined and interactively affects each other in the context modeling process. Even if a perfect attention mechanism is designed and learned, this problem can not be resolved. This is because that when predicting the sentiment of a given aspect, the harmful information for other aspects is contained in the generated hidden states, which can not be changed by the attention mechanism. In AALSTM, the aspect gate mechanism helps the three LSTM gates only retain the beneficial information and filter out the adverse information regarding the given aspect. In AABERTs, our designed input format can make BERT model the intra-sentence dependencies between the given aspect and their sentiment trigger words, retaining aspect-related and eliminating aspect-irrelevant information in generated hidden states. Theoretically, AALSTM and AABERTs can effectively alleviate the multi-aspect adversely interactive affect and improve ABSA model's performance in the multi-aspect situation. To prove this, we evaluate the models' performances on different test set groups divided based on the number of aspects existing in one context. Table 7 lists the statistics of the divided test sets. To eliminate the result uncertainty, we group the instances of #6, #7 and #13 subsets in the test set of ATSA-Restaurant into one subset: >5, and remove the #6 subset from the test set of ATSA-Laptop as well as #4 subset from the test set of ACSA-Restaurant. The performance comparisons is illustrated in Fig. 13 and Fig. 14. As illustrated in Fig. 13, AALSTM boosts IAN's performance in the multi-aspect situation. On different multi-aspect number classes, AALSTM is stably superior to vanilla LSTM. On ATSA-Restaurant test set, when #Aspect Number rises to 2 from 1 as well as from 4 to 5, IAN and IAN (AA) have dramatically opposite performance trends: IAN's performance drops while IAN (AA)'s performance increases. On ATSA-Laptop test set, when #Aspect Number changes to 2 from 1, IAN (AA)'s accuracy nearly remains unchanged, but IAN's performs worse significantly. Remarkably, on ACSA task, when #Aspect Number comes to 3, IAN's performance has a significant decline but IAN (AA)'s performance not only has no decline, it even improves, nearly linearly as #Aspect Number increases. From  Fig. 14, we can observe that on ATSA task the performance trends of R-GAT+BERT1 and R-GAT2+BERT2 are similar as #Aspect Number increases. However, the performance of R-GAT+BERT1 is fluctuated in the multi-aspect situation, while R-GAT2+BERT2 is more stable and generally performs better. On ACSA task, we can also observe that AABERT1 (CLS) is more stable than BERT1 (CLS) and obtains better results. Trained from scratch on massive open-domain general texts, BERT is designed to transfer the learned abundant linguistic knowledge to specific domain of down-stream task (Devlin et al., 2019). We design a set of cross-domain evaluation experiments to test BERT and our proposed AABERTs' abilities of learning general domain-independent sentiment semantic. Evaluation results are demonstrated in Table 8. We can find that AABERTs are more robust for domain transfer. There are two reasons. First, AABERTs can more effectively model the intra-sentence dependencies between aspect and their sentiment trigger words. Second, domain specificity is mainly characterized by the difference between aspects of different domains. Without thoroughly separating the context and aspect in the embedding space, the semantic space of context and aspect of AABERTs are more smooth.

Conclusion
In this paper, we discover and define the aspect-agnostic problem which widely exists in the context modeling process of ABSA models. Then we argue that the semantics of the given aspect should be considered as a new clue out of context when modeling the context. We propose three streams of aspect-aware context encoders: an aspect-aware LSTM (AALSTM), an aspect-aware GCN (AAGCN), and three aspect-aware BERTs (AABERTs) to generate the aspect-aware hidden states tailored for ABSA task. Specifically, AALSTM adopts the aspect-aware gate mechanism which dynamically adds the beneficial disturbance from the aspect to the original internal values of the three LSTM gates. In this way, the given aspect can dynamically help regulate the information flow in LSTM cells along time steps. As a result, AALSTM can retain the sentiment indicative information of the given aspect in generated hidden states and eliminate the useless information regarding the given aspect. We augment vanilla GCN with an aspect-aware convolution gate, which can regulate the information flow so as to aggregate the aspect-related information from neighbor nodes to current node and discard the uesless information. Based on BERT, we flexibly modify the settings of segment embedding and [SEP] token to avoid thoroughly isolating context and the concatenated aspect as two individual sentences. Compared with BERT, our AABERTs can effectively model the aspect-aware intra-dependencies in the context modeling process. Experimental evaluations verify that AALSTM, AAGCN and AABERTs significantly outperform their vanilla counterparts on ABSA task, demonstrating their capability of alleviating the aspect-agnostic problem. Additionally, experimental results show that our AAGCN works well with AALSTM and AABERTs, then our aspect-aware models achieve new state-of-the-art in both LSTM-models and BERT-based models.
To our knowledge, this is the first work that focuses on the shortcomings of context encoders and leverages the aspect as a new clue out of context for context modeling in ABSA task. In this work, we address ABSA task from a new perspective, and this first investigation paves the way to several aspect-aware extensions of the context encoder based on other architectures, such as Convolutional Neural Network, Memory Network, etc. Our work also provide insights to other NLP tasks, such as relation classification, which aims to predict the relation between two given entity include in a sentence. In the scenario of relation classification, generally the sentence encoder is not aware of the entity-pair. Therefore, the 'entity-pair agnostic' problem, which is a variant of the aspect-agnostic problem we point out in this paper, exists in the task of relation classification. In the future, we are glad to see more advances in solving the aspect-agnostic problem in ABSA task and and we are keen to explore our aspect aware mechanisms in tackling similar problems in other NLP tasks.