Believability of virtual players
Powerful anti-cheat systems, as well as entertaining and engaging NPCs and virtual players, are some of the features that make the suc-cess of a video game. However, since the gaming industry is very com-petitive, companies are subject to severe time constraints and are ex-pected to deliver high quality results in a very short time. They can not necessarily take the risk of investing time and effort in developing inno-vative and sustainable solutions for all these components of a game. For this reason, the scientific community plays an important role in bringing new, advanced and original solutions to create value both for video game companies and their consumers.
The expectations of today’s gamers have evolved with improvements in game design. They now expect truly believable and realistic gaming envi-ronments with complex stories, characters and actions. Our work focuses on the believability of virtual players (or bots) in multiplayer video games. Unlike the realism of a character where the visual aspect is extremely important, its believability on the other hand, depends on its actions and strategies. Loyall (1997) clearly illustrates the difference between these two aspects with the example of the character of the Flying Carpet in the Disney animated film Aladdin: « It has no way of being realistic, it is a to-tally fantastic creature. In addition, it does not have many of the normal avenues of expression: it has no eyes, limbs nor even a head. It is only a carpet that can move. And yet, it has a definite personality with its own goals, motivations and motions. » When it comes to virtual players in video games, they are considered believable when the players have the impres-sion that it is controlled by another human player (Tencé, Buche, et al., 2010). Many researchers have made the point that players enjoy a game more if they believe that their opponent is another human represented in the game by an avatar, rather than a computer-controlled player. For example, in (Weibel et al., 2008), players who were convinced that they were playing against human opponents at the video game Neverwinter Nights (an online role-playing game), reported a greater sense of immer-sion, engagement and flow, as well as and greater enjoyment. In (Lim and Reeves, 2010), the researchers reported that the players exhibited greater physiological arousal when the opponent was introduced as a hu-man rather than a bot. Also, in (Soni and Hingston, 2008), bots trained using examples of human play traces were found to be more challenging and enjoyable opponents than the standard scripted bots.
Over the years, different approaches have been used for the imple-mentation of such bots. However most of the time, these bots were ei-ther not assessed, or they were evaluated using different protocols. Yet, in order to make improvements in the development of believable bots, a generic and rigorous evaluation needs to be set up, that would allow the comparison between new systems and existing ones. According to Clark and Etzioni (2016), “standardised tests are an effective and practical as-sessment of many aspects of machine intelligence, and should be part of any comprehensive measure of AI progress ». Although the evaluation of bots’ performance can be performed through objective measures (com-paring score or time spent to complete a level), the evaluation of bots’ believability is complex due to its subjective aspect.
The Turing test is widely considered as being a pioneering landmark for believability assessment (Marcus et al., 2016). Developed by Turing in 1950, it tests the ability of a chatbot to exhibit intelligent behaviour, indistinguishable from that of a human. In its standard interpretation, a human judge converses via text-only with a human confederate and a computer program. If by using only the responses to written questions, the judge can not reliably tell the chatbot from the human, it is said to have passed the test. A way of evaluating AI is to organise competitions. According to To-gelius (2016), the advantage of competitions is that they provide fair, transparent and reusable benchmarks. Most competitions in computer games are aimed at the development of superhuman-level opponents such as the famous chess program Deep Blue (M. Campbell et al., 2002) or the recent Go program AlphaGo (Silver et al., 2016). In recent years we have seen the emergence of competitions oriented toward the imple-mentation of human-like (or believable) opponents such as the 2K Bot-prize competition (Philip Hingston, 2009) or the Turing Test track of the Mario AI Championship (Shaker et al., 2013).
The BotPrize is particularly interesting as it has evolved significantly over the years. It was held annually between 2008 and 2014 (except in 2013) at the IEEE Conference on Computational Intelligence and Games. It is a variant of the Turing test (Turing, 1950) which uses the “Death-match » game-type mode of the video game Unreal Tournament 2004 (UT2004) developed by Epic Games, a FPS whose objective is to kill as many opponents as possible in a given time (and to be killed as few times as possible). The different versions of the BotPrize are described below:
First version: Its first two editions were held in 2008 and 2009 (Philip Hingston, 2009) and used the same protocol (as illustrated in Figure 2.1a). They were run in five rounds of ten minutes. In each round, each human judge was matched against a human confederate and a bot. The confederates were all instructed to play the game as they normally would. At the end of each round, the judges were asked to evaluate the two op-ponents on a rating scale (from “1: This player is a not very human-like bot », to “5: This player is human »), and to record their observations. In order to pass the test, a candidate (by candidate we refer to the entity being evaluated e.g. a bot or a human player) was required to be rated 5 (this player is human) from four of the five judges.
Logistically this competition was quite difficult to implement. There were two rooms: one with a computer for each server and for each con-federate, and another room with a computer for each judge. No communi-cation between the two rooms was possible other than by the organizers, or via game play. Spectators were able to come to the judges’ room to watch the games in progress. Second version: In 2010, a new design was implemented (see Figure 2.1b) (Philip Hingston, 2010), born from the desire to make the judging process part of the game. The organisation of this new version was much more simple since there was no need for confederates or a secret room. Only one server was running continuously, where human judges and bots could connect at any time. A weapon of the game (the Link Gun) was modified for the judging process. This weapon had two firing modes (one for each button of the mouse) that could be used to tag an opponent as being human or bot (the vote was final). If the judgement was correct, the result was the death of the target, if incorrect, the death of the judge’s avatar. Both bots and humans were equipped with the judging gun and could vote. This modification to the system introduced a bias in the evalua-tion process as the gameplay was adversely affected. Whereas before, players would move quickly in order to not present an easy target, in the new competition human players are easily spotted as they are tempted to stop and observe their opponents to make a judgement (Thawonmas et al., 2011). Furthermore, judges may be inclined to attempt to communicate through movements and shooting patterns (Polceanu, 2013). This kind of behaviour would not naturally occur in normal gameplay.
Assessment’s Characteristics Analysis
The application used for the evaluation process can be pre-existing or developed specially for the test. The implementation of a sample game can be necessary when no open-source games are available (Bernac-chia and Hoshino, 2014) but it needs to be well-thought-out in order to not introduce bias unintentionally. A good example from the domain of character believability is Mac Namee’s simulation of a bar (2004). Two virtual bars populated by autonomous agents who could buy/drink beer, talk to friends, or go to the toilet were used. In the first simulation, the agents had long-term goals, whilst in the second they selected a new goal randomly every time they completed an action. Mac Namee noticed a difference in the results probably due to cultural effects : for the Ital-ian subject, the random selection seemed more believable as for him as it was unrealistic to have agents returning to sit at the same table time after time, whereas for the other subjects (from Ireland), this behaviour seemed more believable. A bar environment was not necessarily an ideal choice for the evaluation as subjects had diverse expectations as to how a human would behave. This problem of cultural difference is well known to researchers interested in the development of virtual agents. Many approaches have been proposed to design agents that can adapt their behaviour to the cultural context to which they apply (De Rosis et al., 2004; Rehm et al., 2007; Lugrin et al., 2017). However, research on un-derstanding the cultural nuances of game players is lacking (Chakraborty and Norcio, 2009). Lee and Wohn (2012) showed that there is a small effect of culture on behaviours in social network games. They found that cultural orientations affect people’s expected outcomes (social interac-tion, recognition, relax, or relieve boredom), which in turn affects different usage patterns (giving and offering gifts to game friends, advancing in the game, customizing their avatar, publishing game status, . . . ). Further research on video games are required to examine cultures’ effects on motives and behaviour in the virtual world (Jackson and Wang, 2013). Moreover, for this study we can investigate the French population only so we decided not to focus on the possible effects that cultural differences could have on the evaluation of bots’ believability.
Judges’ and confederates’ expertise
The level of the judges is sometimes taken into account for the experiment. As it has been noticed by Mac Namee (2004), the experience of players in video games can introduce a difference between the subjects. In general, for an experienced player it will be quicker and easier to recognise a bot than for a novice player. For example, in Laird and Duchi’s paper (2001), only the expert player made no mistake in differentiating between bots and humans. Regarding novice players, they might not fully know the rules of the game or the set of actions available to the players which could make the whole experience too confusing and they would not be capable to sensibly evaluate the players’ behaviours (Daniel Livingstone, 2006). Another interesting element that has been taken into account in (Laird and Duchi, 2001; Acampora et al., 2012; Shaker et al., 2013) is the level of the confederates. They have a major role in the assessment as their behaviours directly influence the judges’ evaluation. For example, an expert-player confederate with high performance could easily be mistaken for a bot by non-expert players who are judging (Polceanu, 2013). On the contrary, novice-players confederates who are still learning how to play the game and how to use the controls might have behaviours that could be confused with a weak bot by expert players. Confederates should be provided with sufficient time for gaining control over the game rules and commands before starting the evaluation. Philip Hingston (2009) avoided these potential problems by choosing confederates who were all reasonable level of experience, i.e. neither expert nor novice.
Table of contents :
List of Figures
List of Tables
1.1 Different Types of Bots
1.2 Believability of virtual players
1.4 Manuscript Organisation
2 Related Works
2.1 Defining Believability
2.2 Assessing Believability
2.3 Assessment’s Characteristics Analysis
3 Blinding the Judges
3.3 Experiment Methodology
4 Influence of the Judges’ Expertise
4.1 Model Modifications
4.2 Experiment Methodology
5 Reporting Suspected Cheaters
5.2 Experiment Methodology
6 Conclusion and Future Work
6.2 Future Work
A Unreal Tournament 2004 Tutorial (in French)
B Questionnaire for the Experiment No. 1 (in French)
C Final Questionnaire for the Experiment No. 1 (in French)
D Questionnaire to Evaluate the Level of Expertise in the Experiment No. 2 (in French)
E Material for the Experiment No. 3 (in French)
E.1 Pre-Experiment Questionnaire
E.2 Screenshot of the Experiment in Process
E.3 Post-Experiment Questionnaire