While the experimental literature on ambiguity is vast, there are only few experimental papers looking at ambiguous signals as we do (beyond Epstein and Halevy, we are only aware of Fryer et al (2019)). Note though that our experiment has a distinctive feature not present in the previous experiments on ambiguous signals. In our setting, the nature of the ambiguity of the received signals (feedback) is endogenously shaped by the choice of subjects (if the Red urn is only chosen in state 1, there is no ambiguity as the feedback about Red urns is then clearly only informative about the composition of the Red urn in state 1; by contrast ambiguity seems somehow maximal if the Red urn is picked with the same frequency in the two states). This endogenous character of the ambiguity has no counterpart in the previous experiments on ambiguity, as far as we know. Our paper is related to other strands of literature beyond the references already mentioned. A first line of research related to our study is the framework of case-based decision theory as axiom-atized by Gilboa and Schmeidler (1995). Compared to case-based decision theory, in the valuation equilibrium approach, the similarity weights given to the various actions in the various states happen to be endogenously shaped by the strategy used by the subjects, an equilibrium feature that is absent from the subjective perspective adopted in Gilboa and Schmeidler.
Another line of research related to our study includes the possibility that the strategy used by subjects would not distinguish behaviors across diﬀerent states (Samuelson (2001), Mengel (2012) for theory papers and Grimm and Mengel (2012), Cason et al (2012) or Cownden et al. (2018) for experiments). Our study diﬀers from that line of research in that subjects do adjust their behavior to the state but somehow mix the payoﬀ consequences of some actions (the unfamiliar ones) obtained over diﬀerent states, thereby revealing that our approach cannot be captured by a restriction on the strategy space. Another line related to our study is that of the analogy-based expectation equilibrium (Jehiel (2005) and Jehiel and Koessler (2008)) in which beliefs about other players’ behaviors are aggregated over diﬀerent states. Our study diﬀers from that literature in that we are considering decision problems and not games. Yet, viewing nature as a player would allow to see closer con-nections between the two approaches. To the best of our knowledge, no experiment in the vein of the analogy-based expectation equilibrium has considered environments similar to the one considered here.
A related experimental literature includes a recent strand concerned with selection ne-glect. Experimental papers in this vein include Esponda and Vespa (2018), Enke (2019) or Barron et al. (2019). These papers conclude in various applications that subjects tend to ignore that data they see are selected. In our setting, the data related to Red are selected, and one can argue that subjects by behaving in agreement with the (generalized) valuation equilibrium do not seem to account for selection. Another related recent strand of experi-mental literature is concerned with the failure of contingent reasoning and/or some form of correlation neglect (see Enke and Zimmerman (2019), Martinez-Marquina et al (2019) or Esponda and Vespa (2019)). Some of these papers (see in particular Martinez-Marquina et al.) conclude that hypothetical thinking is more likely to fail in the presence of uncertainty, which somehow agrees with our finding that in the presence of aggregate feedback, subjects find it hard to disentangle the value of choosing Red in the two states.
There is a number of contributions comparing reinforcement learning models to belief-based learning models in normal form games. While some of these contributions conclude that reinforcement learning models explain better the observed experimental data than belief-based learning models (Roth and Erev 1998, Camerer and Ho 1999), others suggest that it is not so easy to cleanly disentangle between these models (Salmon 2001, Hopkins 2002, Wilcox 2006). Our study is not much related to this debate to the extent that we consider decision problems and not games and that subjects do not immediately experience the payoﬀ consequences of their choices (the feedback received concerns all subjects in the lab and subjects are only informed at the end how much they themselves earned). Relat-edly the feedback received about some possible choices is aggregated over diﬀerent states, which was not considered in the previous experimental literature. Despite these diﬀerences, relating Bayesian learning models to belief-based learning models, our results suggest that these perform less well than their reinforcement learning counterpart in our context, as in these other works.
Finally, one should mention the experimental work of Charness and Levin (2005) who consider decision problems in which, after seeing a realization of payoﬀ in one urn, subjects have to decide whether or not to switch their choices of urns. In an environment in which subjects have a probabilistic knowledge about how payoﬀs are distributed across choices and states (but have to infer the state from initial information), Charness and Levin observe that when there is a conflict between Bayesian updating and Reinforcement learning, there are significant deviations from optimal choices. While the conclusion that subjects may rely on reinforcement learning more than on Bayesian reasoning is somehow common in their study and our experiment, the absence of ex ante statistical knowledge about the distribution of payoﬀs across states in our experiment makes it clearly distinct from Charness and Levin’s experiment. In our view, the absence of ex ante statistical knowledge fits better the motivating economic examples mentioned above.
Background and theory
In the context of our experiment, this section defines a generalization of the valuation equilibrium allowing for noisy best-responses in the vein of the quantal response equilib-rium (McKelvey and Palfrey, 1995). We next propose two families of learning models, a similarity-based reinforcement learning model (allowing for coarse feedback on some alter-natives and an ambiguity discount attached to those)5 as well as a generalized Bayesian model (also allowing for noisy best -responses and a discount on alternatives associated to coarse feedback). The learning models will then be estimated and compared in terms of fit in light of our experimental data.
Quantal valuation equilibrium
In the context of our experiment, there are two states s = 1 and 2 that are equally likely. In state s = 1, the choice is between Blue and Red1. In state s = 2, the choice is be-tween Green and Red2. The payoﬀs attached to these four alternatives are denoted by vBlue = 0:3, vRed1 ; vRed2 and vGreen = 0:7 where vRed1 and vRed2 are left as free variables to accommodate the payoﬀ specifications of the various treatments.
A strategy for the decision maker can be described as = (p1; p2) where pi denotes the probability that Redi is picked in state s = i for i = 1; 2. Following the spirit of the valuation equilibrium (Jehiel and Samet, 2007), a single valuation is attached to Red1, Red2 so as to reflect that subjects in the experiment only receive aggregate feedback about the payoﬀ obtained when a Red urn is picked either in state s = 1 or 2. Accordingly, let v(Red) be the valuation attached to Red. Similarly, we denote by v(Blue) and v(Green) the valuations attached to the Blue and Green urns, respectively.
In equilibrium, we require that the valuations are consistent with the empirical ob-servations as dictated by the equilibrium strategy = (p1; p2). This implies that v(Blue) = vBlue, v(Green) = vGreen and more interestingly that v(Red) = p1 vRed1 + p2 vRed2 (1) p1 + p2 whenever p1 + p2 > 0. That is, v(Red) is a weighted average of vRed1 and vRed2 where the relative weight given to vRed1 is p1=(p1 +p2) given that the two states s = 1 and 2 are equally likely and Redi is picked with probability pi for i = 1; 2. Based on the valuations v(Red), v(Blue) and v(Green), the decision maker is viewed as picking a noisy best-response where we consider the familiar logit parameterization (with coeﬃcient ). Formally,
Definition: A strategy = (p1; p2) is a quantal valuation equilibrium if there exists a valuation system (v(Blue); v(Green); v(Red)) where v(Blue) = 0:3, v(Green) = 0:7, v(Red) satisfies (1), and
p1 = e v(Red) + e v(Blue)
p2 = e v(Red) + e v(Green)
It should be stressed that the determination of v(Red), p1 and p2 are the results of a fixed point as the strategy = (p1; p2) aﬀects v(Red) through (1) and v(Red) determines the strategy = (p1; p2) through the two equations just written.
We now briefly review how the quantal valuation equilibria look like in the payoﬀ spec-ifications corresponding to the various treatments. In this review, we consider the limiting case in which goes to 1 (thereby corresponding to the valuation equilibria as defined in Jehiel and Samet, 2007).
Treatment 1: vRed1 = 0:4 and vRed2 = 0:8
In this case, clearly v(Red) > v(Blue) = 0:3 (because v(Red) is some convex combi-nation between 0:4 and 0:8). Hence, the optimality of the strategy in state s = 1 requires that the Red urn is always picked in state s = 1 (p1 = 1). Regarding state s = 2, even if Red2 were picked with probability 1, the resulting v(Red) that would satisfy (1) would only be 0:4+0:8 = 0:6, which would lead the decision maker to pick the Green urn in state s = 2 given that v(Green) = 0:7. It follows that the only valuation equilibrium in this case requires that p2 = 0 so that the Red urn is only picked in state s = 1 (despite the Red urns being payoﬀ superior in both states s = 1 and 2). In this equilibrium, consistency (i.e., equation (1)) implies that v(Blue) < v(Red) = 0:4 < v(Green).
Treatment 2: vRed1 = 1 , vRed2 = 0:6
In this case too, v(Red) > v(Blue) = 0:3 (because any convex combination of 0.6 and 1 is larger than 0.3) and thus p1 = 1. Given that vRed2 < vRed1 , this implies that the lowest possible valuation of Red corresponds to 1+0:6 = 0:8 (obtained when p2 = 1). Given that this value is strictly larger than v(Green) = 0:7, we obtain that it must that p2 = 1, thereby implying that the Red urns are picked in both states. Valuation equilibrium requires that p1 = p2 = 1 and consistency implies that v(Blue) < v(Green) < v(Red) = 0:8.
Treatment 3: vRed1 = 0:1 , vRed2 = 0:9
In this case, we will show that the Red urns are not picked neither in state 1 nor in state 2. To see this, assume by contradiction that the Red urn would (sometimes) be picked in at least one state. This should imply that v(Red) v(Blue) (as otherwise, the Red urns would never be picked neither in state s = 1 nor 2). If v(Red) < v(Green), one should have that p2 = 0, thereby implying by consistency that v(Red) = vRed1 = 0:1. But, this would contradict v(Red) v(Blue) = 0:3. If v(Red) v(Green), then p1 = 1 (given that v(Red) > v(Blue)), and thus by consistency v(Red) would be at most equal to 0:1+0:9 = 0:5 (obtained when p2 = 1). Given that v(Green) = 0:7 > 0:5, we get a contradiction, thereby implying that no Red urn can be picked in a valuation equilibrium.
As explained above the value of v(Red) in the valuation equilibrium varies from being below v(Blue) in treatment 3 to being in between v(Blue) and v(Green) in treatment 1 to being above v(Green) in treatment 2, thereby oﬀering markedly diﬀerent predictions according to the treatment in terms of long run choices. Allowing for noisy as opposed to exact best-responses would still allow to diﬀerentiate the behaviors across the treatments but in a less extreme form (clearly, if = 0 behaviors are random and follow the lottery 50 : 50 in every state and every treatment, but for any > 0, behaviors are diﬀerent across treatments).
We will consider two families of learning models to explain the choice data observed in the various treatments of the experiment: A similarity-based version of reinforcement learning model in which choices are made on the basis of the valuations attached to the various colors of urns and valuations are updated based on the observed feedback, and a Bayesian learning model in which subjects update their prior belief about the composition of the Red urns based on the feedback they receive. In each case, we will assume that subjects care only about their immediate payoﬀ and do not integrate the possible information content that explorations outside what maximizes their current payoﬀ could bring. This is -we believe- justified to the extent that in the experiment there are twenty subjects making choices in parallel and that the feedback is anonymous making the informational value of the experimentation by a single subject rather small (it would be exactly 0 if we were to consider infinitely large populations of subjects and we are confident it is negligible when there are twenty subjects).
Similarity-based reinforcement learning
Standard reinforcement learning models assume that strategies are reinforced as a function of the payoﬀ obtained from them. In the context of our experiment, subjects receive feedback about how the choices made by all subjects in the previous period translated into black (positive payoﬀ) or white (null payoﬀ) draws. More precisely, the feedback concerns the number6 of Black balls drawn when a Blue; Green or Red urn was picked in the previous period as well as the number of times an urn with that color was then picked. Unlike standard reinforcement learning, payoﬀ obtained from some actions are coarse in our setting and hence similarity- based reinforcement. Accordingly, at each time t = 2; :::70, one can define for each possible color C = B; R; G (for Blue, Red, Green) of urn(s) that was picked at least once at t 1 :
U Ct = #(Black balls drawn in urns with color C at t 1) : (2)
#(an urn with color C picked at t 1)
U Ct represents the strength of urn(s) with color C as reflected by the feedback received at t about urns with such a color. Note the normalization by #(an urn with color C picked at t 1) so that U Ct is comparable to a single payoﬀ attached to choosing an urn with color C.
In other words, the value attached to color C at t is a convex combination between the value attached at t 1 and the strength of C as observed in the feedback at t. Observe that we allow the weight to be assigned to the feedback to be diﬀerent for the Red urns on the one hand and the Blue and Green urns on the other to reflect the idea that when a choice is better known as is the case for more familiar alternatives (here identified with urns Blue and Green) the new feedback may be considered as less important to determine the value of it. Accordingly, we would expect that F is larger than U , and we will be concerned whether this is the case in our estimations.8
Given that the feedback concerning the Red urns is aggregated over states s = 1 and 2, there is extra ambiguity as to how well BRt represents the valuation of Red1 or Red2 as compared to how well BGt or BBt represent the valuations of Blue and Green. The valuation equilibrium (or its quantal extension as presented above) assumes that BRt is used to assess the strength of Reds whatever the state s = 1; 2. In line with the literature on ambiguity aversion as experimentally initiated by Ellsberg (1961), it is reasonable to assume that when assessing the urn Reds, s = 1; 2, subjects apply a discount 0 to BRt.9 Allowing for noisy best-responses in the vein of the logit specification, this would lead to probabilities p1t and p2t of choosing Red1 and Red2
The learning model just described is parameterized by ( U , F , ; , BRinit). In the next Section, these parameters will be estimated pooling the data across all three treatments using the maximum likelihood method. Particular attention will be devoted to whether > 0 is needed to explain better the data, whether F > U as common sense suggests, as well as to the estimated value of and the obtained likelihood for comparison with the Bayesian model to be described next.10
Generalized Bayesian Learning Model
As an alternative learning model, subjects could form some initial prior belief regarding the compositions of Red1 and Red2, say about the chance that there are ki black balls out of 10 in Redi, and update these beliefs after seeing the feedback using Bayes’ law.
Let us call init(k1; k2) the initial prior belief of subjects that there are ki black balls out of 10 in Redi. In the estimations, we will allow the subjects to consider that the number of black balls in either of the two Red urns can vary between kinf and ksup with 0 kinf ksup 10 and we will consider the uniform distribution11 over the various possibilities. That is, for any (k1; k2) 2 [kinf ; ksup]2
To simplify the presentation a bit, we assume there is no learning on the urns Blue and Green for which there is substantial initial information. At time t + 1, the feedback received by a subject can then be formulated as (b; g; n) where b; g are the number of blue and green urns respectively that were picked at t, and n is the number of black balls drawn from the Red urns. In the robustness checks, we allow for Bayesian updating also on the compositions of the Blue and Green urns, and obtain that adding learning on those urns does not change our conclusion.
To further simplify the presentation, we assume that in the feedback subjects are ex-posed to, there is an equal number of states s = 1 and s = 2 decisions assumed by the subjects (allowing the subjects to treat these numbers as resulting from a Bernoulli distri-bution would not alter our conclusions, see the robustness check section for elaborations). In this case, the feedback can be presented in a simpler way, because knowing (b; g; n) now allows subjects to infer that m1 = 10 b choices of Red urns come from state s = 1 and m2 = 10 g choices of Red urns come from state s = 2. Accordingly, we represent the feedback as (m1; m2; n) where mi represents the number of Redi that were picked. Clearly, the probability of observing m1; m2; n when there are k1 and k2 black balls in Red1 and Red2 respectively is given by:
P r(m1; m2; njk1; k2) = n1 m1 n1
Further Description of the Experimental Design
The computerized experiments were conducted in the Laboratory at Maison de Sciences Economiques (MSE) between March 2015 and November 2016, with some additional ses-sions running in March 2017. Upon arrival at the lab, subjects sat down at a computer terminal to start the experiment. Instructions were handed out and read aloud before the start of each session. The experiment consisted of three main treatments which varied in the payoﬀs of the Red urns as explained above. In addition we had two other treatments referred to as controls in which subjects received state-specific feedback about the Red urns, i.e the feedbacks for Red1 and Red2 appeared now in two diﬀerent columns, for the two payoﬀ specifications of treatments 1 and 2. The purpose of these control treatments was to check whether convergence to optimal choices was observed in such more standard feedback scenarios.
Each session involved 18-20 subjects14 and four sessions were run for each treatment and control. Overall, 235 subjects drawn from the participant pool at the MSE -who were mostly students- participated in the experiment. Each session had seventy rounds. In all treatments, all sessions, and all rounds, subjects were split up equally into two states, State 1 and State 2. Subjects were randomly assigned to a new state at the start of each round. The subjects knew the state they were assigned to, but did not know the payoﬀ attached to the available actions in each state.15 In each state, players were asked to choose between two actions as detailed in Figure 1. The feedback structure for the main treatments was as explained above. For the control group, the information structure was disaggregated. We use this as a baseline to show that under simpler feedback structure, individuals learn optimally the best available option.
Subjects were paid a show-up fee of 5 e. In addition to this, they were given the opportunity to earn 10 e depending on their choice in the experiment. Specifically, for each subject, two rounds were drawn at random and a subject earned an extra 5 e for each black ball that was drawn from their chosen urn in these two rounds. The average payment was around 11 e per subject, including the turn-up fee. All of the sessions lasted between 1 hour and 1.5 hour, and subjects took longer to consider their choices at the start of the experiment.
We first present descriptive statistics and next present the structural analysis. In Figure 3, we report how the choices of urns vary with time and across treatments. Across all these sessions, initially, subjects are more likely to choose the Red urn than the Blue urn in state 1 and they are more likely to choose the Green urn than the Red urn in state 2. This is, of course, consistent with most theoretical approaches including the ones discussed above given that the Green urn is more rewarding than the Blue urn and the Red urns look (at least initially) alike in states 1 and 2.
The more interesting question concerns the evolution of choices. Roughly, in state 1, we see toward the final rounds, a largely dominant choice of the Red urns in treatments 1 and 2 whereas Red in state 1 is chosen less than half the time in treatment 3.
Table of contents :
1 General Introduction
2 Multi-state choices with Aggregate Feedback on Unfamiliar Alternatives
2 Related Literature
3 Background and theory
3.1 Quantal valuation equilibrium
3.2 Learning Models
3.3 Similarity-based reinforcement learning
3.4 Generalized Bayesian Learning Model
4.1 Further Description of the Experimental Design
4.2 Preliminary findings
4.3 Statistical estimations
4.4 Comparing the Reinforcement learning model to the data
4.5 Individual level heterogeneity
5 Robustness Checks
3 Multi-state choices with Aggregate Feedback on Unfamiliar Alternatives: a Followup
1 Introduction: Why a follow up?
2 Experimental design: How is it different?
3.1 Preliminary findings
3.2 Statistical estimations
3.3 Individual level heterogeneity
4 Endogenous Institutions: a Network Experiment in Nepal
2.1 Overview: Networks and Data
2.2 Experimental context
3.1 Preliminary findings and limitations
3.2 Statistical Estimation
4 The model
List of Figures
List of Tables