Improving Dynamic NPC Behavior through Artificial Intelligence Techniques
Bryan
Boise, Idaho
This is a paper I wrote for an artificial intelligence class at BSU.
Abstract
In this article, I extend methods, constructed in previous articles, used to try and create human-like dynamic NPC behavior.
The techniques that are unique to this article include learning Action Responses, learning Action and Action Response Behavior Alignment (either positive or negative depending on the alignment of the local, community, and/or global environments), and using physical conditions, non-verbal communication, mood, historical average of action and response behavior alignment, political and religious affiliations, personality, etc, to determine the appropriate action/response to execute..
This multi-varied criterion make the actions and action responses very complex and varied, resulting in keeping a game interesting and dynamic, with possibilities of unexpected events and results.
Introduction
There have been numerous studies of how to try and create realistic human behavior in non-played characters (NPC’s) in computer video games. However, most of them focus on learning to choose optimal actions from a set of pre-defined actions. For example, the traditional method of defining behavior for NPCs in a game is through the use of scripting (Bakker, 2004). Scripted behavior is repetitive, non-interesting, and robotic. It is behavior that does not effectively mimic human-like, realistic behavior.
Whether or not the NPCs demonstrate realistic, human-like behavior is through the use of the Turing Test. The Turing test is a test that is used to measure the ability for a machine to demonstrate intelligence. The basic idea is that human judge, or in the sense of a computer game, a human player, will play the game, engaging in communication and personal contact with NPCs, and experiences the same or similar contact with another character, also played by a human. The player tries to distinguish the human-played character from the NPC, and if the player is unable to do so, then the NPC passes the Turing test (Turing Test, Wikipedia).
In this article, it is proposed to use human-like characteristic, environment conditions, and current NPC condition of life to (1) determine whether an action, or an action response, should be positive or negative as an Action Behavior Alignment, or an Action Response Behavior Alignment and (2) use the Action (or Action Response) Behavior Alignment to determine the actions (or action response) of the NPC, again using human-like characteristic, environment conditions, and current NPC condition of life in determining from what action, or action response, should be selected.
Furthermore, non-verbal communication is taken into consideration and is recommended as being included in the action/response space. Non-verbal communication is a major part of human-like communication, and excluding it from actions/responses would greatly limit the games’ ability to contain a complex character interaction and would decrease the level of interest experienced while playing a game.
Including non-verbal communication would introduce the ability to infer the intentions behind actions, greatly increasing the complexity of character interaction, where characters do not make decisions based solely on what is said. Because of the great role non-verbal communication can play in increasing a game’s quality of play, it will be covered in some detail within this article.
The action, or action response, should be selected from an action, action response, space that reflects actions, or responses, relating to the current state for the NPC. Actions are different from responses in the manner and order in which a series of action execution and response are dealt. Actions are usually initiating activities, and responses are executed as a result of the initiated activity.
In a complex game, most actions will be the result of some type of historical action, and therefore most types of actions can be thought of as an action response. However, a random attack from one NPC on another because of the mood or environmental stress of the attacking NPC would be an example of an initiated action, and not an action response.
The process of learning to select and selecting Action Responses will follow the same structure as that of Actions. The purpose of including a separate type of action, the Action Response, is that the Action Response is directly related to the Action of another NPC and/or another group of NPC’s. Furthermore, the Action Response may be an individual NPC, or a group of NPC’s.
This could result in a recursive, possibly infinite, loop of actions and action responses. Some controls must be used to determine the limits, extent, and intensity of actions. Such controls may include the death of a perpetrator, the exhaustion of physical stamina, actions being monitored and controlled through the influence of other NPC’s, etc. It can become very complex very quickly, and therefore the more NPC controls that are implemented, the more controlled the results will be. However, the possibilities are only limited by the resources of the computer, which are becoming increasingly more available, and greater supply; therefore, a complex and thorough implementation of this method of dynamic NPC behavior is very probable.
The reinforced learning algorithm is a variation of the SARSA (State-Action-Reward-State-Action) algorithm, which “is an algorithm for learning a Markov decision process policy.” The algorithm will be used for both Actions and Action Responses. The rewards will be either positive or negative, and will be used in updating the current complex state of an NPC.
Action and Action Response Behavior Alignment
The amount of influence by environmental stress levels and political/social influences has on actions and action responses depend on the following attributes. The influence can be localized or personal, at a community level, or at a global level. The levels of influence have an order of precedence. Local is at the bottom, and global is at the top. This helps make the environments more realistic to life.
For example, imagine a world that is aligned positive at the global level. Positive is defined as treating NPC’s with respect, respect their property and personal rights set forth. At a specific community level in the global environment, there exists a large amount of political corruption. Therefore, this community level has a negative alignment, and criminal, negative actions, or actions in contrast to that of the global level, are considered good in the political community. However, a small group of politicians want to follow the global influence, and at their local level, their alignment is positive. Actions against the political community are negative, but are positive according to the global level. The NPC experience a small level of “guilt” for actions in contrary to the political community level, but would experience a much greater “guilt” for actions in contrary to the global level.
The influence of experience “guilt” diminishes with respect to the amount of actions that would promote a “guilty” response. This is in an attempt to introduce a sense of “consciousness” to the NPC’s, and in a manner that models displayed human-like behavior.
The attributes themselves are also used in selecting actions and action responses that should be part of the action or action response space for a specific NPC state.
The attributes are not listed by order of importance or weighted influence.
Attribute set:
• Sex - static
• Age – Dynamic. Mostly influenced at young or old ages. After adulthood, the effect should not be great until middle-age or old-age has been reached. Depends on the expected lifespan of an individual.
• Property/Wealth/Equipment - Dynamic
• Political Status/Association – Can be dynamic or static
• Social Status/Association – Can be dynamic or static – Impoverished, Middle-Classed, Rich, loner, friends here and there, popular, etc.
• Health – Stamina, physical strength, resistance to infections and diseases, eyesight, hearing, age – all of these should be considered dynamic traits.
• Physical Stature – Static – small-weak, athletic, strong, or unfit; medium- weak, athletic, strong, or unfit; large- weak, athletic, strong, or unfit
• Intelligence – Slowly Dynamic after adulthood reached – Dim-witted,
• Wisdom - Slowly Dynamic after adulthood reached – careful, reckless, thoughtful, spontaneous, etc.
• Trade – Dynamic or static
• Skill-level in career – dynamic to a point
• Self-esteem – High is less prone to Political/Social Influences, weak is more prone. Also, high is more prone to positive action responses, low is more prone to negative action responses
• Cultural and ethnic background - Static
• Religious belief/association – Dynamic or static
• Personality – Charisma, Speech craft, self-esteem(dynamic) - problem-solver, partier, hard-working, lazy, moody, violent, peaceful, friendly, hostile, take-charge, follow, serious, class-clown, etc.
• Mood – Dynamic – happy, sad, grumpy, angry, pleasant, stressed, at ease, serious, jovial, discouraged, overwhelmed, confident, scared, etc.
• Existing Alignment influence for alignments in control at local, community, or global levels – depends on the situation – can be either positive or negative – This affects the level of “guilt” an NPC could experience for going against this alignment, and the amount of influence diminishes with respect to the amount of actions taken in contrary to a particular alignment.
• Average Positive Actions or Responses - Dynamic
• Average Negative Actions or Responses – Dynamic
Determining the Behavior Alignment
The behavior alignment is determined through the use of a learned Artificial Neural Network (ANN). The learning, or training, of the ANN is performed using Back-propagation. For each training case, the expected output of either positive or negative will be known, and used to determine error. The range is -1 to 1, and a 0 is always an error, as a Behavior Alignment can not be both positive and negative.
The actual resulting value will be used as an intensity of the action or response, a -1 being the most intense negative action/response, and a 1 being the most intense positive action/response.
The inputs to the ANN will be the attribute set described in the previous section. Each of these attributes will be weighted with respect to the actual influence an attribute has in determining a Behavior Alignment. This will require some serious fine-tuning in order to get a “well-oiled” machine out of the neural network characteristics (i.e. input weights, number of hidden nodes, number of layers, etc.)
As an example, if a neural network were to have 3 layers, the following would be the pseudo-code for the algorithm (referenced from Wikipedia’s Back propagation):
Initialize the weights in the network (to the default initial determined values for each attribute)
Do
For each action/response e in the training set
- O = ANN output (network, e); forward pass
- E = Expected output for e
- Calculate error (E – O) at the output units
- Compute delta_wi for all weights from hidden layer to output layer; backward pass
- Compute delta_wi for all weights from input layer to hidden layer; continued backward pass
- Update the weights in the network
Loop until all examples classified correctly or stopping criterion satisfied
Return to the network
A visual example of a possible ANN with the described input attributes is shown in the following:
Actions and Action Response
There are two high-level categories of actions and action response, positive and negative. Action responses can be triggered by actions direct at the individual NPC, NPC-A, or by actions directed at an NPC, or group of NPC's, in a determined proximity to the NPC-A. Thus, individual or concurrent action responses may be triggered at any given time, depending on the type, content, and nature of the directed action.
Actions can be directed at a single NPC, or a group of NPC, and is determined by the current state of the initiating NPC, the consequence, and the resulting state. The response will be determined likewise. The difference is that an action is a first-strike concept, and action responses taken as a result of first-strike and subsequent actions.
The responses need not be immediately after an action was initiated. To keep things interesting, depending on the action dealt, and the state of both parties, an NPC can respond swiftly, stealthily, openly, secretly, or not at all. With this in mind, an NPC should be able to plot and plan, ambush, manipulate, and blackmail. The ability to do so should be controlled by the state of the NPC, and their ability to control emotions. A hot-headed NPC with low wisdom and low intelligence might not have the control to react at a more “convenient” time, but might instead react immediately, harshly and rashly, with little or no regard for the consequences.
The idea is to have an environment in which the NPC population is in a continuous dynamic state, keeping the game-play interesting. Decisions a played-character may make can influence one, many, and even potentially all NPC's. Furthermore, a played-character should be able to stand and watch the dynamic behavior of those in “near” proximity (depends on the eyesight abilities of a character...?). The amount and intensity of the changes in behavioral action responses of NPC's in proximity should also be dynamic as a result of the amount and intensity of the actions dealt to them. If they are just walking, working, relaxing, playing, etc, and nothing is really happening to change their behavioral responses and/or behavior in general, then the amount and intensity of changes should be minimal.
Considerations of Non-Verbal Communication as Actions/Responses
With around 90% of interpersonal communication among humans existing in a non-verbal form, a low or minimized effort in implementing the use of non-verbal communication in NPC's drastically reduces the amount of realistic, life-like communication from the user or player with NPC's, and among NPC's themselves. (Adubato, pg. 13)
In other words, without the use of realistic non-verbal communication, only around 10% or so of the actual life-like communication is being transmitted or received. This provides an enormous area for improving the ability for a game to be life-like and interesting.
In an ideally, optimally designed game, a player or user should be able to pick out non-verbal communication and use it to make decisions and judgments regarding the NPC's mood, confidence, trustworthiness, composure, friendliness (or the lack thereof), etc. Without an NPC saying anything at all, a user could pick up important messages that would otherwise have to be communicated through some sort of verbal communication.
Furthermore, this same increase in communication could and should be applied to NPC interaction. Actions/responses would include non-verbal communication. For example, an NPC, NPC1, is approaching another NPC, NPC2, in an isolated environment (for the sake of simplicity). NPC1 changes its facial expression from a non-expressive state to an unfriendly, cold expression, with eyes narrowing and its jaw muscles clenching. NPC2 detects the non-verbal changes in the state of NPC1, and formulates an appropriate reaction to the unfriendly approach of NPC1.
Another example of NPC non-verbal communication could be where a player or user approaches an NPC and engages in a conversation with the NPC. As the player verbally communicates a question to the NPC, the NPC continuously shifts its gaze away from the player, avoiding direct eye contact, frequently shifting its stance, and wrings its hands constantly. The player then might inquire if the NPC was troubled by something, and from successfully picking up on the non-verbal communication, and reacting to it, a new mission or vital information for the game is introduced that otherwise might have been missed if the player was not paying close enough attention to the NPC's non-verbal communication.
Creating and recognizing non-verbal communication among NPC's in a computer video game is more easily accomplished than implementing a computer program or a type of robot to recognize actual non-verbal communication from a real living organism. In the computer game, attribute that is capable of non-verbal communication (such as changes in facial features, hand-motions, relaxed or cautious composures, etc.) could be represented as an input variable when calculating the state of an NPC.
For example, conditions of an NPC's mouth could be enumerated using specific values where a smile is a binary zero, a frown a one, a closed and tight mouth a two, an open and gaping mouth a three, etc.
The use of non-verbal communication also helps in trying to deduce the intentions behind actions and the verbalized communication. Being able to analyze the non-verbal communication would help in making calculated, interesting actions/responses more than just rash, impulsive actions/reactions that were unfounded because the true intentions behind the actions/responses or verbal communication by another NPC were miss-interpreted. The non-verbal communication gives the possibility of picking up an indication as to why an action/response was made, and can also warn of a pending action/response that is about to happen. The increase of character complexity in the game would help a game be much more interesting and replayable as opposed to a game consisting of a minimal amount of non-verbal communication.
How Actions and Action Responses are selected
First sets of actions and sets of action responses organized into positive or negative categories should be created. The type of action response should also relate to the type of actions. In determining what signifies positive or negative, that should depend on the nature of the existing political, social, and religious environment that is in control at the local, community, and/or global level(s). This means that the definition of positive and negative could also potentially be dynamic on a local or personal, community and/or global level(s), in which case actions and action responses could shift from being positive to negative, and vice versus. This could really keep things interesting.
The actions and action responses can be further broken down into being friendly/hostile, weak/mild/strong, cautious/bold/out-of-control, etc.
Here are some action/action response selection methods that could use the action response behavior alignment (ARBA) values for selecting action/action responses; either being positive or negative. The results of these methods will create the action or action response space.
• Decision-Trees - utilizing a forest, each tree consisting of proper actions, or action responses relating to the current state of the NPC, the NPC’s calculated Behavior Alignment, and actions that have occurred in proximity to the NPC or directly at the NPC (if any). The child leafs of the tree will be the actions/responses that will exist in the action/response space.
• Artificial Neural Networks – With the use of the current NPC state, the calculated Behavior Alignment, and also using the most recent actions (if any) that have occurred in proximity to an NPC, or directly at an NPC, as inputs, and then using priority weights to learn how to determine to which actions the NPC should respond and how. This same technique can also be used to select actions instead of action responses.
• Use an e-greedy policy to select the action or response with the largest estimated Q(s,a) with probability
1 – e and it then select randomly from all actions or responses with probability e. (Cutumisu, et. all 2008)
The learning should be ongoing, with rewards, punishments, and influence of law and control to influence learning and to influence determining which action response to select. The selected Actions result in creating the action space for the NPC for the existing state of the NPC.
Markov Decision Process
As described in the Wikipedia article for the Markov Decision Process, the following is an explanation of the formal description. I am including this description with the added variations that are necessary to make the process work for both actions and action responses.
A Markov decision process is a list of four objects, (S, A, Pa(.,.),Ra(.,.)), where
- S is the state space. This state will either be position or negative depending on the state of the NPC, and also the state of the local, community, and/or global environments.
- A is the action space or action response space. Whether the action or action response is positive or negative depends on the state of the NPC, and also the state of the local, community, and/or global environments. This will be part of the action or action response selection function.
- Pa(s,s’) = Pr(st+1 = s’ | st = s, at = a) is the probability that action a, or action response a, in state s at time t will lead to state s’ at time t+1
- Ra(s,s’) is the immediate reward (or expected immediate reward) received after transition to state s’ from state s with transition probability Pa(s, s’). This expected reward, or what I would rather call consequence, is not always what will actually happen. This means that each and every NPC can have individual and unique dynamic behavior, even if they choose to involve themselves in the actions of a collective group.
The goal is to maximize some cumulative function of the rewards, typically the discounted sum over a potentially infinite horizon:
Where is the discount rate and satisfies 0 < <= 1. It is typically close to 1.
Since the probabilities are unknown, the following function corresponds to taking action or action response a, and then continuing optimally (either positive or negative according to the current NPC action and/or action response behavior alignment):
Experience during learning is based on (s,a) pairs (together with the outcome s’); or in other words “The NPC was in state s, tried doing action a, or action response a in response to an action directed at the NPC, and state s’ occurred as a result.” An array Q of states and actions, or states and action responses, is created, and uses experience to update it directly. This is known as Q-learning (see Wikipedia reference for Q-learning). The details of these algorithms have been covered in previous journals and other sources of information that can be found on the Web, such as Wikipedia.
The definition of how the reward function is maximized is also dynamic, because the expected rewards or consequences will either maximize towards a positive or a negative result. Whether it is positive or negative depends on the state of the NPC, and also the state of the local, community, and/or global environments. This will be part of the reward function.
Extended SARSA (State-Action-Reward-State-Action) Algorithm
The details of the SARSA algorithm covered here can be referenced on the SARSA Wikipedia page. The details will be varied according to applying the algorithm to the variations in this article.
The algorithm’s name reflects the fact that the main function for updating the Q-value depends on the current state (S1) of the NPC, the action or action response the NPC selects, the reward or consequence that the NPC experiences for selecting the action or action response, the resulting state (S2) that the NPC will now be in after executing the action or action response, and the subsequent action or action response that the NPC will select as a result of the new state.
Q-values are part of what is called the Q-decomposition method. This method requires each NPC to indicate a value, from its specific perspective, or its specific current state, for every action/response. Each NPC gives an action/response value, and the actions that maximize the sum of the individual action/response values are selected. (Russell and Zimdars, 2003) This process of selecting actions/responses can be performed at a global, community, and/or local level.
Q-values represent the possible reward received in the next time step for taking action a, or action response a, in state s, plus the discounted future reward received from the next state-action, or state-action response, observation.
At each step of the learning process, the state of the NPC is determined, and an action/response is selected to be executed. The selected action/response is performed and based on the results, the reward and the resulting state is determined. The subsequent action is then selected, and the process repeats. (Cutumisu et. all 2008)
A SARSA NPC will interact with the environment and update the policy based on actions or action responses taken. This is known as an on-policy learning algorithm. The decision of the algorithm is reflected to the output in on-policy learning in reinforcement learning. (Cho et. all 2006)
Reflections on Dynamic Behavior Simulation
Calculating Behavior Alignment and selecting actions/responses with respect to environmental and physical stresses, consequences, and rewards can be compared to human-like mental activity. No human has direct and complete access into the mental state of activity of another. Mental activity is performed on an individualistic basis. However, the mental activity as a thought or decision is formulated through concurrent influence from everything with which and everyone with whom the human interacts, and has interacted.
Furthermore, the human's experience in its lifespan, and history of previous mental thought, activity, and decisions affect and influence the mental though, activity, and decisions at any given current state for that human.
In a similar manner, actions and/or responses are selected by an NPC on an individual basis through concurrent influence by the population and environment as a whole. To a bystander, the population's actions/responses can appear to be concurrent. In reality, however, the actions all NPC's are being influenced concurrently, but each NPC selects its actions/responses individually.
While the selection of actions/responses is individual, the selection is not unbiased. I again stress, that instead of executing one function, or one single algorithm to select the actions for a group or population of NPC's, each NPC should select its own actions. The concurrency of the population is the influence each NPC and the environment in which the NPC exists, whether at a local, community, and/or global level(s), has on each other.
This in affect creates an environment of Dynamic Behavior simulation for NPC's that more closely follows the model of human-like life and reality. The effect of influence that one or many actions, responses, and/or environmental attributes have on an NPC depends on the individual's state at the time of being influenced since each individual's experience can potentially be unique. However, if the intensity of influence is great enough, in terms of the actions/responses of other NPC's, environmental, political, and social influence at any level, individuals could potentially be influenced to select similar or identical actions. However, the actual selection of actions/responses by an NPC is still made individually. This creates a form of “free-agency” in the computer video-game environment, which follows realistic human-like behavior.
The significance of this type of individualistic NPC behavior brings an element of many unexpected results and possibilities in the game environment. The goal of keeping the game interesting through creating a more human-like experience is then achieved. The game may then be considered to be a more worthwhile purchase, as it could be replayed many times, and each time the characters in the game may act and react in a different manner, with unexpected twists and turns.
The individualism of Dynamic NPC Behavior should be the goal and direction for development of artificial intelligence techniques in computer video-games, since the individual development and calculated behavior of NPC will make the computer video-game experience much more fulfilling, dynamic, and interesting. With regard to the computer video-games that I have personally played throughout my life, and games for which I have read reviews, the more interesting the interaction there is with non-played characters, either being positive or negative in alignment, the more interesting and enjoyable I found the game to be. Otherwise, without interesting NPC interaction, the NPC's tend to annoy, being rather repetitive, their actions/responses being more static, and once the game was played through, I had minimal desire to ever play it again. Therefore, the goal of making a game interesting, and continuing to be interesting with each replay of the game is one that I most strongly encourage, and I think that most gamers have the same goal in mind when deciding whether a game would be worth their time.
Conclusions
Attempting to mimic human-like behavior through artificial intelligence methods can become very complex. However, with the advances in computer hardware and software, the complexity of artificial intelligence techniques has become less of an issue.
In this article, I have presented how to make non-played characters (NPC) in computer video games more human-like. This was accomplished through extending the SARSA machine learning algorithm to be used not just for actions, but also for action responses.
It was described that actions are more of a “first-strike”, initiated activity, and action responses are executed in response to an action that has occurred at some point in history.
Furthermore, in determining which actions and which action responses should exist in the action space for any given NPC state, I introduced the idea of Action and Action Response Behavior Alignment, which can either be positive or negative. The Behavior Alignment is either positive or negative with respect to the environment alignment at local, community, and/or global levels, in order of precedence respectfully.
Along with the idea of the Action and Action Response Behavior Alignment, non-verbal communication was discussed and how it can be used to increase the character complexity of NPCs. An example of increased NPC complexity is how non-verbal communication can be used for giving indication as to the intentions of actions/responses, and can also be used to introduce information, quests, or missions into the game that would be otherwise missed if a player is not paying close attention to the NPCs non-verbal communication.
The Behavior Alignment is learned using back-propagation in an artificial neural network. The resulting value is then used along with other state information to determine from which category of actions or action responses should be selected for inclusion into the action or action response space.
The action and action response categories are separated into two main groups of being either actions or action responses. They are further separated by being either positive or negative. And then each positive or negative category is subdivided into sub-categories, including the following examples: friendly/hostile, weak/mild/strong, cautious/bold/out-of-control, etc.
Following the SARSA method, the state of the NPC is determined, this includes determining the Behavior Alignment, and then the action or response is selected. The expected reward or consequence for the action/response and the resulting state is examined. If that is favorable, then the action is executed, and the resulting state is analyzed, and a subsequent action/response, if there will be one, is analyzed for execution.
References
[1] SARSA Machine Learning Algorithm January 3, 2009
http://en.wikipedia.org/wiki/SARSA
[2] Markov Decision Process April 20, 2009
http://en.wikipedia.org/wiki/Markov_decision_process
[3] Q-Learning – A reinforcement learning technique April 13, 2009
http://en.wikipedia.org/wiki/Q-learning
[4] Cho, B., Jung, S., Shim, K., Swong, Y., Oh, H. 2006 Reinforcement Learning of Intelligent Characters in Fighting Action Games In Proceedings of the Fifth International Conference on Entertainment Computing
[5] Cutumisu, M., Szafron, D., Bowling, M., Sutton, R. 2008 Agent Learning Using Action-Dependent Learning Rates in Computer Role-Playing Games In Proceeding of the Fourth Artificial Intelligence and Interactive Digital Entertainment Conference
[6] Sprague, N., Ballard, D. 2003 Multiple-Goal Reinforcement Learning with Modular Sarsa(0) Technical Report at the University of Rochester, Computer Science Department
[7] Backpropagation – A method of training artificial neural networks. March 26, 2009
http://en.wikipedia.org/wiki/Backpropagation
[8] Adubato, Steve. “Make the Connection – Improve your Communication at Home and at Work” Barnes & Noble, Inc. 2007
[9] Russell, Stuart, Zimdars, Andrew L. 2003 Q-Decomposition for Reinforcement Learning Agents. International Conference on Machine Learning
[10] Bakker, Jorn. 2004 Online Adaptive Learning in a Role Playing Game. Doctoral Thesis Cognitive Artificial Intelligence, University of Utrecht
[11] Turing Test – A test of a machine’s ability to demonstrate intelligence. May 12, 2009
http://en.wikipedia.org/wiki/Turing_test