What is miscalibration and why is it important to economists?

Get Complete Project Material File(s) Now! »

Incentives to Learn Calibration

This chapter is a joint work with Guillaume Hollard.


In the past decades Economists and Psychologists documented a long list of bi-ases, i.e. substantial and systematic deviations from the predictions of standard economic theory. Many economists will argue that these biases only matter if they survive in an economic environment. In other words, if correct incentives are pro-vided, subjects should realize that they are making costly mistakes and then change the way they make such decisions in further decision tasks. In this chapter, we test this claim regarding a particular bias, namely miscalibration. We create an experimental setting that provides a lot of incentives (decisions have monetary con-sequences, successful others can be imitated, feedbacks are provided, repeated trials are used, etc). Finally, we test in a subsequent decision task whether subjects still display some miscalibration.

What is miscalibration and why is it important to economists?

Calibration is related to the capacity of an individual to choose a given level of risk. In a typical experiment designed to measure miscalibration, subjects are asked to provide subjective confidence intervals for a set of questions. For example, if the question is « What was the unemployment rate in France for the first trimester of 2007? » and the subject provides the 90% confidence interval [7%,15%], it means that the subject thinks that there is a 90% chance that this interval contains the correct answer. A perfectly calibrated subject’s intervals should contain the correct answer 90% of the time. In fact, a robust finding is that almost all subjects are miscalibrated. On average, 90% subjective confidence intervals only contain the
correct answer between, say, 30% and 50% of the time. 1 Glaser et al. (2005) found an even stronger miscalibration using professional traders.
Miscalibration is a bias having important economic consequences, since miscal-ibrated people suffer losses on experimental markets (Bonnefon et al., 2005, Biais et al., 2005). Furthermore, it is likely that such a pathology affects the behavior of real traders acting on real markets. Therefore, it does make sense for economists to try to reduce miscalibration and to study the best incentives to do so.
Lichtenstein & Fischhoff (1980) attempted to reduce miscalibration by providing subjects with feedback on their performance. They proved that 23 sessions, each lasting about an hour, were required to substantially improve subjects’ calibration. Several other psychologists have used various techniques to reduce miscalibration (Pickhardt & Wallace, 1974, Adams & Adams, 1958), with little success so far.

Miscalibration thus appears to be a very robust bias.

Since miscalibration is a bias linked to overconfidence, one could expect a dif-ference in how men and women are affected by it. Indeed, Lundeberg et al. (1994) found men are generally more overconfident than women, even if this depends very much on the task. In the financial matter, Prince (1993) and Barber & Odean (2001) found men to exhibit more overconfident behaviors than women. As far as calibra-tion per se is concerned, Pallier (2003)’s findings suggest that men are more confident in the accuracy of their knowledge. However, his way of measuring calibration is quite different from ours: his subjects have to answer multiple-choice questions and assess a level of confidence while ours have to provide confidence intervals at a spec-ified level. These two types of calibration tasks are sufficiently different to produce distinct results.
This chapter proposes to provide a maximum of incentives to reduce miscalibra-tion. As it constitutes one of the authors’ first attempt at experiment economics, the design is not as sound as it should have been. In particular, it does not allow one to disentangle the effect of the several incentives implemented. The main result is that our experimental setting slightly reduces overconfident miscalibration but only for males. It could be the case that different incentives have an opposite effect on subjects’ miscalibration which might explain the weakness of the results but our experimental design does not allow us to check for this possibility.
The remainder of the chapter is organized as follows. Section 2 presents the experimental design. Section 3 presents the results and section 4 discusses them.
Finally, Section 5 provides some concluding remarks.

Experimental design

The measure of miscalibration and associated overconfidence relies on a now standard protocol. Subjects have to provide 90% subjective confidence intervals for a set of 10 quiz questions. On average, perfectly calibrated subjects should catch the correct answer 9 times; if this is not the case, they are miscalibrated. The subjects are asked to estimate their hit rate. The difference between their estimated hit rate and their actual one is a classical measure of overconfidence. This protocol will thus serve as a benchmark for measuring miscalibration and overconfidence in our experiment.
The experimental subjects were divided into two groups. The subjects of the first group attended a training session and then performed a baseline treatment aiming at measuring their miscalibration according to the standard protocol. The principle of this training session is to offer a whole set of experimental incentives that enhance learning (monetary incentives, tournament, feedback, loss framing). The second group, the control group, performed the baseline treatment only.
At first glance, testing the effect of incentives seems possible by simply providing incentives for the basic miscalibration task used as a benchmark. This seems natural but cannot be implemented since there is no simple incentive scheme that rewards correct calibration. Think, for example, of an incentive scheme that would pay a high reward if the difference between the required percentage of hit rates, say 90%, and the actual hit rate (measured over a set of 10 questions) is small. A rational subject can use very wide intervals for 9 questions and a very small one for the remaining question. He is thus certain to appear correctly calibrated, while he is not. Cesarini et al. (2006) chose to provide incentives for the evaluation of the calibration task only (Subjects had to guess how many correct answers belong to the intervals provided by the subject and by his peers) and made miscalibrated subjects go through the task again. We chose to consider a task similar to the calibration task in which we can provide the necessary incentives. This task, described in the following section aims at making the subjects realize they have a hard time calibrating the level of risk they wish to take. After having completed this training task, subjects have to complete a standard calibration task for which we only provide incentives for the subsequent self-evaluation of how subjects did in the calibration task as in Cesarini et al. (2006). A control group who did not go through the training task also completed the calibration task to enable us to measure the effect of the training task.

The training period

In the training period, the participants were asked to answer a set of twenty questions: ten questions on general knowledge followed by ten questions on economic knowledge.
The set of questions used in the training period was composed of ten questions some of which were used in Biais et al. (2005)’s experiment plus 10 questions on economic culture. Half of the subjects had to answer the 20 questions in a given order, the other half saw the questions in reverse order. This enabled us to check for learning effects during the training period.
In this training period, the subjects were provided with a reference interval for each question that they could be 100% sure the correct answer belonged to. Subjects had to give an interval included in the reference interval. Each player received an initial endowment of 2000 ECUs (knowing that they would be converted into euros at the end of the experiment at the rate of 1 euro for 100 ECUs) before beginning to answer the questions but after having received instructions. They were told that 100 ECUs were at stake for each one of the twenty questions resulting in a loss framing. The payoffs are expressed in experimental currency (ECU).
According to this formula, the payoff is maximal and equal to 0 when the interval provided by the subject is a unique value, this value being the right answer to the question. In this case, the subject keeps the total 100 ECUs at stake for the question considered.
The payoff is equal to -100 (the subject loses the 100 ECUs) if the subject provides the reference interval and consequently takes no risk at all.
There is therefore a trade-off between the risk taking and the amount of ECUs a subject could keep if the correct answer fell inside his interval. High risk taking is rewarded by a small loss (the subject keeps most of the ECUs at stake) in the case where the answer belongs to the interval provided. Conversely, a subject who only takes little risk will only keep a few ECUs (meaning he would lose most of the ECUs at stake) even if the correct answer does belong to his interval.
Subjects had 60 seconds to answer each question, indicated by a timer. We applied this time constraint so as to not make the fastest subjects wait too long before switching to the next question as all subjects had to have answered a question before moving to the next one to enable us to provide feedback about the intervals provided for a given question. Nevertheless, we picked the time limit corresponding to the time it took for most subjects to answer a question in the pilot experiment where there was no time constraint. When time was up, if the subject had not validated his interval, the 100 ECUs at stake were lost and the next question was put.
Subjects received feedback providing them with the intervals chosen by all the participants (including themselves) ranked by width from the narrower to the wider as well as the payoff corresponding to each interval. They could infer from this feedback whether they had taken too much risk compared to the others. They could also see the ranking of everybody’s score after each question so as to trigger a sense of competition 2.
After they had answered all 20 questions, subjects were asked to write a comment about their strategy. They then received general feedback about the first step of the experiment.
People being miscalibrated, we expected them to realize it when they saw that the correct answer fell outside their interval less or more often than they had expected, which resulted in a loss of money. As a result, we expected them to better adjust the level of risk they wished to take for the next questions. For instance, a subject quite confident that he knows the answer who provides in consequence a narrow interval will be likely to be more cautious when he realizes he did not catch the right answer. On the contrary, a subject who decides to be safe and provides a wide interval for a question he thinks he knows the answer to, will tend to be less cautious for the next questions when he realizes he could have kept more ECUs by giving a narrower interval. They could also infer information about the right level of risk to take by looking at what others did and how it paid. This task has the advantage of being as close as possible to the training task and as a result would have provided a good measure of risk attitude in the context of the experiment. Alternatively, we could have made subjects take the test of Holt & Laury (2002).

The standard calibration task

In the next stage, the subjects who had participated in the training period were asked to answer a set of ten questions (five questions on general knowledge followed by five questions on economic knowledge) by giving their best estimation of the answer and then by providing 10%, 50% and 90% confidence intervals. Subjects in the control group had to complete the same task. After the pilot experiment was run, we removed and replaced the most difficult questions for which subjects seemed to have no clue about the answer.
Before the beginning, subjects were explained in detail what were 10%, 50% and 90% confidence intervals. They were also told that they would receive remunera-tion regarding this task but that they would only know how the remuneration was established later. There was no feedback between the questions and subjects could proceed at their own pace.

Evaluation of miscalibration

As in Cesarini et al. (2006), the remuneration for the calibration tasks depended on the evaluation the subjects were asked to make afterwards of their and the average subject’s performance during the calibration task. For instance, they were asked how many correct answers they thought fell inside the 10%, 50% and 90% confidence intervals they provided and how many correct answers fell inside the intervals given by the average subject.
After that, they were asked to make two choices between two bets. The first choice was between betting that at least one correct answer out of ten fell outside the subject’s 90% intervals and betting that at least one correct answer fell inside the subject’s 10% intervals. Each one of these bets would have a probability of 1-0.910, close to 0.65 to be true for a perfectly calibrated subject. The second choice was between betting that, out of three questions randomly drawn from the ten questions, at least one correct answer belonged to the subject 50% intervals (1-0.53=0.875) and betting that the correct answer to a randomly chosen question belonged to the 90% interval provided by the subject (0.9). Each successful bet was rewarded by 300 extra ECUs. These bets were bound to inform us on how people are aware of their miscalibration.

READ  Demand-Side Management and Demand Response


The experiment took place at the laboratory of experimental economics of the University of the Sorbonne (Paris 1) in July 2007. 87 subjects, most of whom were students, participated in the experiment. 53 students went through the training period before they completed the calibration task, while the control group was com-posed of 34 subjects. The average subject was 22.42 years old in the control group and 22.71 years old in the trained group. The proportion of men was respectively 41.18% and 41.5% in the control and the trained groups. The average earning was 11.16 euros. On average, subjects earned 10.62 euros including a 5 euros show-up fee in the control group and 14.24 euros (8.42 for the training period and 5.82 for the calibration task) with no show-up fee for the trained group.
In the following section, we distinguish between two measures of confidence. First, the difference between the actual hit rate and the required hit rate, for 10%, 50% and 90% confidence intervals. This difference measures the miscalibration. Second, the difference between the subject’s estimated hit rate and his actual hit rate. This second difference represents the confidence for the calibration task. It is thus another a measure of overconfidence.

General results on calibration

We find that the subjects from the control group exhibit a high level of mis-calibration. Indeed, a lot more than one correct answer out of ten belong to the 10% intervals while fewer than five correct answers out of ten fall inside the 50% confidence intervals and far fewer than nine correct answers out of ten fall inside the 90% intervals. The average hit rates in the control group at the 10%, 50% and 90% levels are respectively 2.03, 3.32 and 4.81 while the corresponding median hit rates are respectively 2, 3 and 5. T-tests show that the observed hit rates significantly (p<0.001 for the 3 tests) differ from the expected hit rates (respectively 1, 5 and 9 at the 10%, 50% and 90% levels). At the 10% level, people are found to be under-confident, meaning that they pro-vide too wide intervals. As a result, the correct answer belongs too often to the 10% intervals. This result was expected by Cesarini et al. (2006).
At the 50% and 90% levels conversely, subjects display overconfidence as their intervals are too narrow, this is all the more the case for 90% confidence intervals. The fact that far fewer than 90% of correct answers belong to the 90% confidence intervals of the subjects is in line with the results of Glaser et al. (2005).
A surprising feature is that, when asked to evaluate how many correct answers belong to their intervals, the average answers are respectively at the 10%, 50% and 90% levels: 3.47, 5.56 and 8.04 for the control group; subjects exhibit overconfidence for the calibration task, thinking that they were more cautious than they actually were (see Figure 1). Let us, nevertheless, observe that subjects do predict that their calibration is far from being perfect, otherwise their evaluations would have been 1, 5 and 9.
These results indicate that not only are people unable to adjust the width of their intervals to the risk level indicated (they are miscalibrated) but they are also unable to predict their bias correctly (they are over or under-confident). Nevertheless, they seem aware of the fact that they provide too wide 10% intervals and too narrow 90% intervals.
The choices of bets indicate that subjects are more aware of their 10% overcau-tious miscalibration than they are of their 90% overconfident miscalibration. Indeed, subjects chose much more often to bet that at least one correct answer fell inside of their 10% intervals than that at least one correct answer fell outside of their 90% intervals. However, given the actual hit rates, the first choice provided a 99.84% chance of winning while the second possibility offered a 91.77% probability of suc-cess. As for the second choice, half the subjects chose to bet that, out of three questions randomly drawn from the ten questions, at least one correct answer be-longed to the subject 50% intervals while the other half chose to bet that the correct answer to a randomly chosen question belonged to the 90% interval provided by the subject. Given the actual hit rates, the first bet provided a 74.27% chance of success, while the secong bet provided a 52.4% probability of winning. Once again, people seem not to be aware of the extent of their miscalibration at the 90% level.
To sum up, people seem to overestimate their underconfidence and underestimate their overconfidence.

The effect of training on miscalibration and confidence in calibration

The general picture

The main purpose of this chapter was to see whether a training period during which several incentives aiming at improving people’s calibration as well as decreas-ing overconfidence were provided would be efficient.
Trained subjects have only slightly higher hit rates at the 10%, 50% and 90% level than subjects from the control group (see figure 2). The differences in hit rates between the control and the trained group are not significantly different at any reasonable level. 3
3. The hit rates are respectively at the 10%, 50% and 90% levels 2.03, 3.32 and 4.81 for the control group and 2.40, 3.80 and 5.33 for the trained group. To get an idea of levels of miscalibration found in other studies, notice that Russo & Schoemaker (1992) obtained hit rates at the 90% level between 4.2 and 6.2, while Klayman et al. (1999) found 4.3. However, the level of miscalibration is obviously very sensitive to the set of questions used. Since half the questions we used were taken from Biais et al [2005], we can We find that the median 10% interval width is larger for the trained group than for the control group for 7 questions out of ten. For the 3 remaining questions, the median width of intervals is equal across treatments. Note that this goes in the sense of a worsening of the underconfident miscalibration observed at 10% as people tend to provide too wide intervals at 10%.
The same result is found when we compare median widths of 50% intervals (wider compare the level of miscalibration we found to those of that study. Using no incentive, the average 90% hit rate in their study is 3.6 while we find respectively 4.8 and 5.3 in our control (where subjetcs no they will get a payment but have to wait until the end of the calibration task to find out how it will be calculated) and training (where subjects are in the same situation and previously went through the training period) group. It therefore seems that the presence of incentives does increase hit rates. intervals in the trained group than in the control group for 7 questions, the reverse for 1 question and equal median intervals across treatments for the 2 remaining questions). As for 90% intervals, for six questions out of ten the interval width is larger for the trained group while the control group provided wider intervals than the trained group for 1 question. 4

A different impact between genders

This general picture masks some heterogeneity across subjects. We can control for several sources of heterogeneity. However, the gender variable captures almost all of it. We observe indeed that there is virtually no improvement in women’s calibration especially when we compare the median hit rates between the treatments while men increase their median hit rate by 0.5 point at the 50% level and by 1 point at the 10% and 90% levels (see Figure 3).
The difference in interval width between the control and the training treatments seems to be larger for men than for women, indicating that men learned more than women to reduce their overconfidence. Using a Wilcoxon-Mann-Whitney test, we find that 10% confidence intervals are significantly wider for the trained group respec-tively for five questions out of ten and zero question out of ten for men and women.
Let us notice that in the trained group both men and women had more than one cor-rect answer inside their 10% intervals exhibiting underconfident miscalibration. As a result, an increase of 10% intervals causes an aggravation of underconfidence. For 50% intervals, the width increases significantly between the control and the training treatments respectively for two and six questions out of ten. Finally, concerning 90% intervals, the difference is significant in three cases and four cases out of ten respectively for women and men.
It may be interesting to study the link between the « theoretical » distribution of hit rates of a perfectly calibrated subject (who has a 90% chance for an answer to fall into any of his 90% confidence intervals…) and the one we actually observe. We report two figures showing the theoretical and actual distributions of 90% intervals hit rates for women and men. Those figures make miscalibration very prominent. We then ran a two-sample median test, separately for women and men, on the distri-butions of hit rates in the control and the training groups. We find that our training has a significant effect on men’s 90% calibration (p=0.089) while no significant effect is found for women. Men’s 90% calibration is improved by our training which can be seen on figure 1 by the shift in the distributions of hit rates between the control and the training treatments. No effect is found for miscalibration at the 10% and 50% levels.

What happened during the training session ?

The training period seems to have had some impact on men but almost no effect on women. It could be interesting to use the results from the training period to get an insight into the nature of the learning process that arose. In order to be able to measure learning during the training period, the order of the 20 questions was reversed for half of the subjects.
It appears that during the training process, some learning took place. We mea-sured learning at this stage by comparing the width of the intervals provided for the same question by subjects from the two groups corresponding to the two orders of appearance of the questions. We found that there was a significant difference in the width of the intervals between the two groups for seven questions out of twenty, each going in the sense of longer intervals for the group who answered the question later in the training session. For example, the intervals provided for question 18 were wider for the group who had the regular order of questions than for those who had the reversed order (for whom question 18 was actually the third one they had to answer). It seems noteworthy that six out of the seven questions which subjects with more training answered with wider intervals were economic knowledge questions.
We regressed a variable equal to the interval width chosen over the interval width of the interval of reference on the intercept, a dummy indicating the gender (« Female »),the age, the level of education, a dummy indicating whether the question appeared early during the training session (« Exp »), the interaction between « Exp » and « Female » (« Exp*Female »), the ranking announced to the subject after he had answered the previous question (« Rank-1 »),the gap between the midpoint of the interval provided and the correct answer divided by the correct answer as a proxy for the ignorance (« Gap ») and dummies for the different questions (See Table 3). We added « Gap » in the regressors so as to take in the effect of knowledge on the choice of the interval width. Any residual effect of « Rank-1 » can therefore be attributed to competition, i.e. the effect of the announced rank on the decision to take more or less risk.

Table of contents :

0.1 Avantages et inconvénients de l’économie expérimentale
0.2 Différences hommes-femmes de confiance en soi, goût pour la compétition et besoin de relevé des défis: une revue de littérature des résultats expérimentaux existants
0.3 Les origines des préférences: inné ou acquis?
0.4 Implications en terme de politique économique
0.5 Plan de thèse
General Introduction 
0.6 Advantages and drawbacks of experimental economics
0.7 Gender differences in self-confidence, competitiveness and need for challenges: A review of experimental results
0.8 Nature or nurture: On the origins of preferences
0.9 Policy implications
0.10 Outline of the dissertation
1 Incentives to Learn Calibration 
1.1 Introduction
1.2 Experimental design
1.2.1 The training period
1.2.2 The standard calibration task
1.2.3 Evaluation of miscalibration
1.3 Results
1.3.1 General results on calibration
1.3.2 The effect of training on miscalibration and confidence in calibration
1.3.3 What happened during the training session ?
1.4 Discussion
1.5 Conclusion
2 Men too sometimes shy away from competition: The case of team competition
2.1 Introduction
2.2 Experimental Design
2.2.1 What Needs to be Controlled for
2.2.2 The Tasks
2.3 Results
2.3.1 Gender Differences in Performance and in Entry in the Individual Tournament
2.3.2 Gender Differences in Entry in the Team Tournament
2.3.3 Explanations for the Changes in Tournament Entry Between the Individual Tournament and the Team Tournament
2.4 Consequences on Efficiency of the Type of Competition
2.5 Conclusion
3 Group Identity and Competitiveness 
3.1 Introduction
3.2 Experimental Design
3.2.1 Identity sessions
3.2.2 Benchmark sessions
3.3 Results
3.3.1 Group identity building activities
3.3.2 The effect of social identity on performance, confidence and entry in the individual tournament
3.3.3 The effect of social identity on entry in the team tournament
3.3.4 Explanations for the changes in decision to enter the team tournament
3.3.5 How male vs female competition affects men and women’s competitive behavior
3.4 Discussion
3.5 Conclusion
General Conclusion 


Related Posts