Causes of anti-efficient encoding
We explore the roots of anti-efficiency by looking at the behavior of untrained Speakers and Listeners. Earlier work conjectured that ZLA emerges from the competing pressures to communicate in a perceptually distinct and articulatorily efficient manner [Zipf, 1949, Kanwal et al., 2017]. For our networks, there is a clear pressure from Listener in favour of ease of message discriminability , but Speaker has no obvious reason to save on “articulatory” effort. We thus predict that the observed pattern is driven by a Listener-side bias.
Untrained Speaker behavior
For each i drawn from the power-law distribution without replacement, we get a message m from 90 distinct untrained Speakers (30 speakers for each hidden size in [100, 250, 500]). We experiment with 2 different association processes. In the first, we associate the first generated m to i, irrespective of whether it was already associated to another input. In the second, we keep generating a m for i until we get a message that was not already associated to a distinct input. The second version is closer to the MT process (see Section 2.4.2). Moreover, message uniqueness is a reasonable constraint, since, in order to succeed, Speakers need first of all to keep messages denoting different inputs apart.
Figure 2 shows that untrained Speakers have no prior toward outputting long sequences of symbols. Precisely, from Figure 2 we see that the untrained Speakers’ average message length coincides with the one produced by the random process defined in Eq. 3 where p = a1 .4 In other words, untrained Speakers are equivalent to a random generator with uniform probability over symbols.5 Consequently, when imposing message uniqueness, non-trained Speakers become identical to MT. Hence, Speakers faced with the task of producing distinct messages for the inputs, if vocabulary size is not too large, would naturally produce a ZLA-obeying distribution, that is radically altered in joint Speaker-Listener training.
Untrained Listener behavior
Having shown that untrained Speakers do not favor long messages, we ask next if the emergent anti-efficient language is easier to discriminate by untrained Listeners than other encodings. To this end, we compute the average pairwise L2 distance of the hidden representations produced by untrained Listeners in response to messages associated to all inputs.6 Messages that are further apart in the representational space of the untrained Listener should be easier to discriminate. Thus, if Speaker associates such messages to the inputs, it will be easier for Listener to distinguish them.
4Note that we did not use the uniqueness-of-messages constraint to define Pl.
Specifically, we use 50 distinct untrained Listeners with 100-dimensional hidden size. 7 We test 4 different encodings: (1) emergent messages (produced by trained Speakers) (2) MT messages (25 runs) (3) OC messages and (4) human languages. Note that MT is equivalent to untrained Speaker, as their messages share the same length and alphabet distribution (see Section 3.2.1). We study Listeners’ biases with max_len = 30 while varying a as messages are more distinct from reference distributions in that case (see Figure A3 in Appendix A.1.4). Results are reported in Figure 3. Representations produced in response to the emergent messages have the highest average distance. MT only approximates the emergent language for a = 1000, where, as seen in Figure 1 above, MT is anti-efficient. The trained Speaker messages are hence a priori easier for non-trained Listeners. The length of these messages could thus be explained by an intrinsic Listener’s bias, as conjectured above. Also, interestingly, natural languages are not easy to process by Listeners. This suggests that the emergence of “natural” languages in LSTM agents is unlikely, without imposing ad-hoc pressures.
Adding a length minimization pressure
We next impose an artificial pressure on Speaker to produce short messages, to counterbalance Listener’s preference for longer ones. Specifically, we add a regularizer disfavoring longer messages to the original loss:
L0(i, L(m), m) = L(i, L(m)) + α × |m| (4)
where L(i, L(m)) is the cross-entropy loss used before, |.| denotes length, and α is a hyperparameter. The non-differentiable term α ×|m| is handled seamlessly as it only depends on Speaker’s parameters θs (which specify the distribution of the messages m), and the gradient of the loss w.r.t. θs is estimated via a REINFORCE-like term (Eq. 1). Figure 4 shows emergent message length distribution under this objective, comparing it to other reference distributions in the most human-language-like setting: (max_len=30, a=40). The same pattern is observed elsewhere (see Appendix A.1.8, that also evaluates the impact of the α hyperparameter). The emergent messages clearly follow ZLA. Speaker now assigns messages of ascending length to the 40 most frequent inputs. For the remaining ones, it chooses messages with relatively similar, but notably shorter, lengths (always much shorter than MT messages). Still, the encoding is not as efficient as the one observed in natural language (and OC). Also, when adding length regularization, we noted a slower convergence, with a smaller number of successful runs, that further diminishes when α increases.
Symbol distributions in the emergent code
We conclude with a high-level look at what the long emergent messages are made of. Specifically, we inspect symbol unigram and bigram frequency distributions in the messages produced by trained Sender in response to the 1K inputs (the eos symbol is excluded from counts). For direct compa-rability with natural language, we report results in the (max_len=30,a=40) setting, but the patterns are general. We observe in Figure 5(a) that, even if at initialization Speaker starts with a uniform distribution over its alphabet (not shown here), by end of training it has converged to a very skewed one.
Figure 4: Mean length of messages across successful runs as a function of input frequency rank for max_len = 30, a = 40, α = 0.5. Natural language distributions are smoothed as in Fig. 1.
Figure 8(a) in Appendix A.2.1 for entropy analysis). We then investigate message structure by looking at symbol bigram distribution. To this end, we build 25 randomly generated control codes, constrained to have the same mean length and unigram symbol distribution as the emergent code. Intriguingly, we observe in Figure 5(b) a significantly more skewed emergent bigram distribution, compared to the controls. This suggests that, despite the lack of phonetic pressures, Speaker is respecting “phonotactic” constraints that are even sharper than those reflected in the natural language bigram distributions (see Figure 8(b) in Appendix A.2.1 for entropy analysis). In other words, the emergent messages are clearly not built out of random unigram combinations. Looking at the pattern more closely, we find the skewed bigram distribution to be due to a strong tendency to repeat the same character over and over, well beyond what is expected given the unigram symbol skew (see typical message examples in Appendix A.2). More quantitatively, across all runs with max_len=30, if we denote the 10 most probable symbols with s1, …, s10, then we observe P (sr , sr) > P (sr)2 with r ∈ J1, .., 10K, in more than 97.5% runs. We leave a better understanding of the causes and implications of these distributions to future work.
We found that two neural networks faced with a simple communication task, in which they have to learn to generate messages to refer to a set of distinct inputs that are sampled according to a power-law distribution, produce an anti-efficient code where more frequent inputs are significantly associated to longer messages, and all messages are close to the allowed maximum length threshold. The results are stable across network and task hyperparameters (although we leave it to further work to replicate the finding with different network architectures, such as transformers or CNNs). Follow-up experiments suggest that the emergent pattern stems from an a priori preference of the listener network for longer, more discriminable messages, which is not counterbalanced by a need to minimize articulatory effort on the side of the speaker. Indeed, when an artificial penalty against longer messages is imposed on the latter, we see a ZLA distribution emerging in the networks’ communication code.
From the point of view of AI, our results stress the importance of controlled analyses of language emergence. Specifically, if we want to develop artificial agents that naturally communicate with humans, we want to ensure that we are aware of, and counteract, their unnatural biases, such as the one we uncovered here in favor of anti-efficient encoding. We presented a proof-of-concept example of how to get rid of this specific bias by directly penalizing long messages in the cost function, but future work should look into less ad hoc ways to condition the networks’ language. Getting the encoding right seems particularly important, as efficient encoding has been observed to interact in subtle ways with other important properties of human language, such as regularity and compositionality [Kirby, 2001]. We also emphasize the importance of using power-law input distributions when studying language emergence, as the latter are a universal property of human language [Zipf, 1949, Baayen, 2001] largely ignored in previous simulations, that assume uniform input distributions.
ZLA is observed in all studied human languages. As mentioned above, some animal communication systems violate it [Heesen et al., 2019], but such systems are 1) limited in their expressivity; and
2) do not display a significantly anti-efficient pattern. We complemented this earlier comparative research with an investigation of emergent language among artificial agents that need to signal a large number of different inputs. We found that the agents develop a successful communication system that does not exhibit ZLA, and is actually significantly anti-efficient. We connected this to an asymmetry in speaker vs. listener biases. This in turn suggests that ZLA in communication in general does not emerge from trivial statistical properties, but from a delicate balance of speaker and listener pressures. Future work should investigate emergent distributions in a wider range of artificial agents and environments, trying to understand which factors are determining them.
Table of contents :
1.1 Universal language properties
1.2 Why neural networks?
1.3 Signaling Game
2 Word Length
2.1 Anti-efficient encoding in emergent communication
2.1.5 Supplementary Material
2.2 “LazImpa”: Lazy and Impatient neural agents learn to communicate efficiently
2.2.3 Analytical method
2.2.6 Supplementary Material
3 Word Order
3.2 Related Work
3.6 Supplementary Material
4 Semantic Categorization – Color Naming
4.2 Color-naming task
4.3 Evaluating the accuracy/complexity trade-off
4.4 Experiments and Results
4.6 Materials and Methods
4.7 Supplementary Material
5.4 Generalization emerges “naturally” if the input space is large
5.5 Generalization does not require compositionality
5.6 Compositionality and ease of transmission
5.8 Supplementary Material
6 General Discussion
6.1 Universal language properties in emergent languages
6.2 More interpretable AI
6.3 Future directions