The truth is that we cannot truly assess the correctness of explanations by evaluating only a single token. Therefore, to make this verification more reliable, we need to do a second test (implementation is here). The aim is to predict the key pair of tokens, a pair that masked causes a change in the model prediction.
In the table below, we compare four pattern recognizers similarly as we have done in the previous test. For example, around 41% of the restaurant test examples have at least one key token pair (others we filter out). Of those, around 49% of cases, the chosen token pair from the basic pattern recognizer is the key pair of tokens. The basic pattern recognizer is still the most accurate but the advantage over other methods has been diminished. Note that this test covers the previous key token test, therefore, the results are correlated.
It is usually practically impossible to retrieve ground truth (too many combinations existing to check out), and this is the unfortunate implication of unknown model reasoning. We cannot say exactly how accurate pattern recognizers are (in most test cases) but we can still compare them. Below, we check the correctness of the basic pattern recognizer by comparing it against other methods (the restaurant domain). This is an alternative approach to measuring the performance of a pattern recognizer.
In matrices, on-diagonal values illustrate cases wherein recognizers behave similarly. In these cases, both recognizers choose a pair that masked, either flips a model decision (the bottom-right cell) or does not (the upper-left cell). Off-diagonal values are more revealing because they expose differences. The bottom-left cell counts examples wherein the basic recognizer uncovers a key pair correctly while another recognizer does not, and the other way around (the upper-right cell). The upper-right value is also helpful in estimating how precise (at most) is the basic recognizer. To sum up, the basic recognizer aims to maximize the bottom-left and minimize the upper-right values. From this perspective as well, the basic pattern recognizer stands out from other methods (more test results are here).