2.4 Planning and decision-making models

KBIAs use planning and decision-making models to effectively solve problems. These models are needed by agents to formulate strategies and policies, make informed choices, and achieve desired goals.

2.4.1 Markov decision-making processes.

Markov Decision Processes (MDP) and Partially Observed Markov Decision Processes (POMDP) are models used in KBIA systems for planning and decision-making in uncertain environments and provide a structured framework for agents to make optimal or near-optimal decisions, given uncertainties and probabilities of transitions between states.

In MDP, the agent interacts with the environment during successive time steps. At each step, the agent makes a decision (chooses an action) to move from the current state to a new state. MDP are determined by five components [12 p.610].

Stand (S). The finite set of possible states in which the agent and the environment can be.
Actions (A). The finite set of actions an agent can perform.
Transition probabilities (T). Distribution of probabilities of transition from one state to another for a certain action.
Rewards (R). The immediate reward that an agent receives for a certain action in a certain state.
Discount factor (γ). A parameter that represents the preference an agent places on immediate rewards over future rewards.

Agents seek to find a policy that maps states to actions and maximizes the expected cumulative discounted reward. They can use algorithms such as dynamic programming, value iteration, and policy iteration.

POMDP extends MDP to situations where the agent's observations are uncertain or incomplete. In POMDP, the agent does not directly observe the true state of the environment, but receives observations that are probabilistically related to the underlying state. POMDP consists of the following components [12 p.645].

Stand (S). Same as in MDP - represent possible base states of agent and environment.
Actions (A). Same as in MDP - represent available agent actions.
Observation (O). The finite set of observations that an agent can obtain based on the agent's basic state and actions.
Transition probabilities (T). The same as in MDP - distribution of probabilities of transition from one state to another for a certain action.
Observation probabilities (Z). Probability distribution of observation taking into account a certain basic state and action.
Rewards (R). The same as in MDP - an immediate reward that an agent receives for a certain action in a certain state.
Discount factor (γ). Same as in MDP - a parameter that represents the preference an agent places on immediate rewards over future rewards.

POMDP introduces the concept of belief states, which are probability distributions of basic states given the agent's observations. The solution is to find a policy that reflects the state of belief in action, taking into account the uncertainty of observations, which requires drawing conclusions by reasoning over belief states (reason over belief states). To obtain close to optimal solutions, such approaches as point-based value iteration and particle filters are used.

2.4.2 Bayesian networks.

Bayesian networks (BN), also known as Bayesian belief networks or probabilistic graphical models, are a versatile and widely used knowledge representation and reasoning tool for planning and decision-making in KBIA systems. BNs allow agents to model and reason about the uncertainty of relationships and dependencies between variables in a probabilistic way [12 p.510].

The concept of BN.

A directed acyclic graph (DAG). A BN is represented as a DAG, where nodes correspond to random variables and directed edges represent probabilistic dependencies.
Nodes (random variables) . Each node of the graph represents a random variable that may or may not be observed.
Conditional probability tables (CPT). The CPT of each node determines the conditional probabilities given by its parents in the graph.
Bayes theorem. In BN, Bayes' theorem is used to update beliefs and draw conclusions based on observed evidence.
Propagation algorithms. To perform probabilistic inference in BN, algorithms such as variable elimination and junction tree are used, which allows for effective updating of beliefs.
Learning. BN can be trained on data using such methods as parameter estimation and structure learning.

BNs enable KBIA systems to reason about uncertainty and make informed decisions based on probabilities and to model causal relationships, helping agents to understand the impact of various factors on outcomes. There are efficient propagation algorithms for computing probabilistic inference even in large networks. BN models are especially valuable when working in situations where probabilistic dependencies and uncertainty play a significant role in decision making.

2.4.3 RL.

RL is an approach that can be applied in KBIA systems for planning and decision-making in dynamic and uncertain environments. RL allows agents to learn optimal or near-optimal policies and strategies through interaction with the environment, maximizing cumulative rewards. This approach is important in models for which explicit knowledge or predetermined rules are not sufficient, and therefore agents need to adapt their policies and strategies based on feedback from the environment [12 p.830, 15].

Concepts of RL.

Agent and environment. An agent interacts with the environment: it performs actions, and the environment responds by transitioning to new states and providing feedback in the form of rewards.
States. They represent the environmental conditions at a moment in time. They encapsulate all relevant information that an agent needs to make decisions.
Actions. A choice made by an agent to influence the environment. The agent's goal is to formulate policies and strategies that will determine the best course of action in each state.
Rewards. Numerical values that indicate the desirability of an agent's action in a certain state. Agents seek to maximize the aggregate rewards they receive over time.
Policy. A strategy that reflects states in action. The agent formulates policies to make decisions that optimize his long-term reward.
Value function. Estimates the expected cumulative reward that an agent can obtain from a given state by following a given policy.
Q-function. Known as the action-value function, it estimates the expected cumulative reward that an agent can obtain from a given state-action pair by following a given policy.

Methods that can be used by the agent in the RL process.

Q-learning. A model-free algorithm that forms an optimal Q-function through exploration and exploitation actions.
Policy Gradient Methods. Directly form a parameterized policy that maximizes the expected cumulative reward.
Actor-Critic Methods. Combine policy-based and value-based approaches, using an actor to update policies and a critic to estimate value functions.
Deep RL. Combines RL algorithms with deep neural networks to handle high-dimensional state spaces and complex environments.

RL allows KBIA systems to adapt to changing environments and form optimal strategies without having clear rules and interacting with the environment, making it suitable for scenarios where detailed knowledge is lacking. This approach allows solving complex decision-making tasks with a large state space and uncertain dynamics.

When modeling agents using RL methods, the following circumstances must be taken into account: finding a balance between exploration and exploitation; using a large number of interactions with the environment to form an effective policy; the development of appropriate reward functions may be critical to achieving desired behavior; suboptimal or unsafe policies may be generated before optimal or near-optimal policies are found, which can be a threat in security-critical applications.