## Abstract

Specialization and hierarchical organization are important features of efficient collaboration in economical, artificial, and biological systems. Here, we investigate the hypothesis that both features can be explained by the fact that each entity of such a system is limited in a certain way. We propose an information-theoretic approach based on a free energy principle in order to computationally analyze systems of bounded rational agents that deal with such limitations optimally. We find that specialization allows a focus on fewer tasks, thus leading to a more efficient execution, but in turn, it requires coordination in hierarchical structures of specialized experts and coordinating units. Our results suggest that hierarchical architectures of specialized units at lower levels that are coordinated by units at higher levels are optimal, given that each unit's information-processing capability is limited and conforms to constraints on complexity costs.

## 1 Introduction

The question of how to combine a given set of individual entities in order to perform a certain task efficiently is a long-lasting question shared by many disciplines, including economics, neuroscience, and computer science. Although the explicit nature of a single individuum might differ between these fields—for example, an employee of a company, a neuron in a human brain, or a computer or processor as part of a cluster—they have one important feature in common that usually prevents them from functioning isolated by themselves: they are all limited. In fact, this was the driving idea that inspired Herbert A. Simon's early work on decision making within economic organizations (Simon, 1943, 1955), which earned him a Nobel prize in 1978. He suggested that a scientific behavioral grounding of economics should be based on bounded rationality, which has remained an active research topic (Russell & Subramanian, 1995; Lipman, 1995; Aumann, 1997; Kaelbling, Littman, & Cassandra, 1998; DeCanio and Watkins, 1998; Gigerenzer & Selten, 2001; Jones, 2003; Sims, 2003; Burns, Ruml, & Do, 2013; Ortega & Braun, 2013; Acerbi, Vijayakumar, & Wolpert, 2014; Gershman, Horvitz, & Tenenbaum, 2015). Subsequent studies in management theory have been built on Simon's basic observation, because “if individual managers had unlimited access to information that they could process costlessly and instantaneously, there would be no role for organizations employing multiple managers” (Geanakoplos & Milgrom, 1991). In neuroscience and biology, similar concepts have been used to explore the evolution of specialization and modularity in nature (Kashtan & Alon, 2005; Wagner, Pavlicev, & Cheverud, 2007). In modern computer science, the terms *parallel computing* and *distributed computing* denote two separate fields that share the concept of decentralized computing (Radner, 1993)—the combination of multiple processing units in order to decrease the time of computationally expensive calculations.

Despite their success, there are also shortcomings of most approaches to the organization of decision-making units based on bounded rationality. As DeCanio and Watkins (1998) point out, existing agent-based methods (including their own) are not using an overreaching optimization principle but are tailored to the specific types of calculations the agents are capable of, and therefore lack in generality. Moreover, it is usually imposed as a separate assumption that there are two types of units, specialized operational units and coordinating nonoperational units, which was expressed by (Knight, 1921) as “workers do, and managers figure out what to do.”

Here, we use a free energy optimization principle in order to study systems of bounded rational agents, extending the work in Ortega and Braun (2011, 2013), Genewein and Braun (2013) and Genewein, Leibfried, Grau-Moya, and Braun (2015) on decision making, hierarchical information processing, and abstraction in intelligent systems with limited information-processing capacity, that has precursors in the economic and game-theoretic literature (McKelvey & Palfrey, 1995; Ochs, 1995; Mattsson & Weibull, 2002; Wolpert, 2006; Spiegler, 2011; Howes, Lewis, & Vera, 2009; Todorov, 2009; Still, 2009; Tishby & Polani, 2011; Kappen, Gómez, & Opper, 2012; Vul, Goodman, Griffiths, & Tenenbaum, 2014; Lewis, Howes, & Singh, 2014). Note that the free energy optimization principle of information-theoretic bounded rationality is connected to the free energy principle used in variational Bayes and active inference (Friston, Levin, Sengupta, & Pezzulo, 2015; Friston, Rigoli et al., 2015; Friston, Lin, Frith, & Pezzulo, 2017; Friston, Parr, & de Vries, 2017), but has a conceptually distinct interpretation and some formal differences (see section 6.3 for a detailed comparison).

By generalizing the ideas in Genewein and Braun (2013) and Genewein et al. (2015) on two-step information processing to an arbitrary number of steps, we arrive at a general free energy principle that can be used to study systems of bounded rational agents. The advantages of our approach can be summarized as follows:

There is a unifying free energy principle that allows for a multiscale problem formulation for an arbitrary number of agents distributed among the steps of general multistep processes (see sections 3.3 and 4.2).

The computational nature of the optimization principle allows explicitly calculating and comparing optimal performances of different agent architectures for a given set of objectives and resource constraints (see section 5).

The information-theoretic description implies the existence of the two types of units already mentioned, nonoperational units (selector nodes) that coordinate the activities of operational units. Depending on their individual resource constraints, the free energy principle assigns each unit to a region of specialization that is part of an optimal partitioning of the underlying decision space (see section 4.3).

## 2 Preliminaries

### 2.1 Notation

We use curly letters (e.g., $W$, $X$, $A$) to denote sets of finite cardinality—in particular, the underlying spaces of the corresponding random variables (e.g., $W$, $A$, $X$)—whereas the values of these random variables are denoted by lowercase letters: $w\u2208W$, $a\u2208A$, and $x\u2208X$, respectively. We denote the space of probability distributions on a given set $X$ by $PX$. Given a probability distribution $p\u2208PX$, the expectation of a function $f:X\u2192R$ is denoted by $\u2329f\u232ap:=\u2211xp(x)f(x)$. If the underlying probability measure is clear without ambiguity, we just write $\u2329f\u232a$.

For a function $g$ with multiple arguments (e.g., for $g:X\xd7Y\u2192R,(x,y)\u21a6g(x,y)$), we denote the function $X\u2192R,x\u21a6g(x,y)$ for fixed $y\u2208Y$ by $g(\xb7,y)$ (partial application), that is, the dot indicates the variable of the new function. Similarly, for fixed $y\u2208Y$, we denote a conditional probability distribution on $X$ with values $p(x|y)$ by $p(\xb7|y)$. This notation shows the dependencies clearly without giving up the original function names and thus allows writing more complicated expressions in a concise form. For example, if $F$ is a functional defined on functions of one variable, such as $F[f]:=\u2211xf(x)$ for all functions $f:X\u2192R$, then evaluating $F$ on the function $g$ in its first variable while keeping the second variable fixed, is simply denoted by $F[g(\xb7,y)]$. Here, the dot indicates on which argument of $g$ the functional $F$ is acting, and at the same time it records that the resulting value (which equals $\u2211xg(x,y)$ in the case of the example) does not depend on a particular $x$ but on the fixed $y$.

### 2.2 Decision Making

Here, we consider (multitask) decision making as the process of observing a world state $w\u2208W$, sampled from a given distribution $\rho \u2208PW$, and choosing a corresponding action $a\u2208A$ drawn from a posterior policy $P(\xb7|w)\u2208PA$. Assuming that the joint distribution of $W$ and $A$ is given by $p(a,w):=\rho (w)P(a|w)$, then $P$ is the conditional probability distribution of $A$ given $W$. Unless stated otherwise, the capital letter $P$ always denotes a posterior, while the lowercase letter $p$ denotes the joint distribution or a marginal of the joint (i.e., a dependent variable).

*agent*. An agent is rational if its posterior policy $P$ maximizes the expected utility,

### 2.3 Bounded Rational Agents

Given a world state $w$, the information processing consists of transforming a prior $q$ to a world-state specific posterior distribution $P(\xb7|w)$. Since $DKL(P(\xb7|w)\u2225q)$ measures by how much $P(\xb7|w)$ diverges from $q$, the upper bound $D0$ in equation 2.2 characterizes the limitation of the agent's average information-processing capability: If $D0$ is close to zero, the posterior must be close to the prior for all world states, which means that $A$ contains only little information about $W$, whereas if $D0$ is large, the posterior is allowed to deviate from the prior by larger amounts and therefore $A$ contains more information about $W$. We use the KL divergence as a proxy for any resource measure, as any resource must be monotone in processed information, which is measured by the KL divergence between prior and posterior.

Technically, maximizing expected utility under constraint 2.2 is the same as minimizing expected complexity cost under the constraint of a minimal expected performance, where complexity is given by the expected KL divergence between prior and posterior and performance by expected utility. Minimizing complexity means minimizing the number of bits required to generate the actions.

### 2.4 Free Energy Principle

*free energy*$F$ of the corresponding decision-making process. In this form, the optimal posterior can be explicitly derived by determining the zeros of the functional derivative of $F$ with respect to $P$, yielding the Boltzmann-Gibbs distribution,

### 2.5 Optimal Prior

### 2.6 Multistep and Multiagent Systems

When multiple random variables are involved in a decision-making process, such a process constitutes a multistep system (see section 3). Consider the case of a prior over $A$ that is conditioned on an additional random variable $X$ with values $x\u2208X$, that is, $q(\xb7|x)\u2208PA$ for all $x\u2208X$. Remember that we introduced a bounded rational agent as a decision-making unit that, after observing a world state $w$, transforms a single prior policy over a choice space $A$ to a posterior policy $P(\xb7|w)\u2208PA$. Therefore, in the case of a conditional prior, the collection of prior policies ${q(\xb7|x)}x\u2208X$ can be considered as a collection or ensemble of agents, or a multiagent system, where for a given $x\u2208X$, the prior $q(\xb7|x)$ is transformed to a posterior $P(\xb7|x,w)\u2208PA$ by exactly one agent. Note that a single agent deciding about both, $X$ and $A$, would be modeled by a prior of the form $q(x,a)$ with $x\u2208X$ and $a\u2208A$, instead.

Hence, in order to combine multiple bounded rational agents, we are first splitting the full decision-making process into multiple steps by introducing additional intermediate random variables (see section 3), which then will be used to assign one or more agents to each of these steps (see section 4). In this view, we can regard a multiagent decision-making system as performing a sequence of successive decision steps until an ultimate action is selected.

## 3 Multistep Bounded Rational Decision Making

### 3.1 Decision Nodes

Let $W$ and $A$ denote the random variables describing the full decision-making process for a given utility function $U:W\xd7A\u2192R$, as described in section 2. In order to separate the full process into $N>1$ steps, we introduce internal random variables $X1,\u2026,XN-1$, which represent the outputs of additional intermediate bounded rational decision-making steps. For each $k$, let $Xk$ denote the target space and $xk\u2208Xk$ a particular value of $Xk$. We call a random variable that is part of a multistep decision-making system a *(decision) node*. For simplicity, we assume that all intermediate random variables are discrete (just like $W$ and $A$).

### 3.2 Two Types of Nodes: Inputs and Prior Selectors

A specific multistep architecture is characterized by specifying the explicit dependencies on the preceding variables for each node's prior and posterior or, better, the missing dependencies. For example, in a given multistep system, the posterior of the node $X3$ might depend explicitly on the outputs of $X1$ and $X2$ but not on $W$, so that $P(x3|x2,x1,w)=P(x3|x2,x1)$. If its prior has the form $q(x3|x1)$, then $X3$ has to process the output of $X2$. Moreover, in this case, the actual prior policy $q(\xb7|x1)\u2208PX3$ that is used by $X3$ for decision making is selected by $X1$ (see Figure 1).

Specifying the sets $Xselk$ and $Xink$ of selectors and inputs for each node in the system then uniquely characterizes a particular multistep decision-making system. Note that we always have $(Xsel1,Xin1)=({},{X0})$.

### 3.3 Multistep Free Energy Principle

### 3.4 Example: Two-Step Information Processing

The cases of serial and parallel information processing studied in Genewein and Braun (2013) are special cases of the multistep decision-making systems introduced above. Both cases are two-step processes ($N=2$) involving the variables $X0=W$, $X1=X$, and $X2=A$. The serial case is characterized by $(Xsel2,Xin2)=({},{X1})$ and the parallel case by $(Xsel2,Xin2)=({X1},{X0})$. There is a third possible combination for $N=2$, given by $(Xsel2,Xin2)=({},{X0,X1})$. However, it can be shown that this case is equivalent to the (one-step) rate distoration case from section 2, because if $A$ has direct world state access, then any extra input to the final node $A=X2$ that is not a prior selector contains redundant information.

## 4 Systems of Bounded Rational Agents

### 4.1 From Multistep to Multiagent Systems

As explained in section 2.6, a single random variable $Xk$ that is part of an $N$-step decision-making system can represent a single agent or a collection of multiple agents, depending on the cardinality of $Xselk$, that is, whether $Xk$ has multiple priors selected by the nodes in $Xselk$. Therefore, an $N$-step bounded rational decision-making system with $N>1$ represents a bounded rational multiagent system (of depth $N$).

### 4.2 Multiagent Free Energy Principle

### 4.3 Specialization

Although a given multiagent architecture predetermines the underlying set of choices for each agent, only a small part of such a set might be used by a given agent in the optimized system. For example, all agents in the final step potentially can perform any action $a\u2208A$ (see Figure 2 and the example in section 4.4). However, depending on their indiviual information-processing capabilities, the optimization over the agents' priors can result in a (soft) partitioning of the full action space $A$ into multiple chunks, where each of these chunks is given by the support of the prior of a given agent $x$, $supp(p(\xb7|x))\u2282A$. Note that the resulting partitioning is not necessarily disjoint, since agents might still be sharing a number of actions, depending on their available information-processing resources. If the processing capability is low compared to the number of possible actions in the full space and if there are enough agents at the same level, then this partitioning allows each agent to focus on a smaller number of options to choose from, provided that the coordinating agents have enough resources to decide among the partitions reliably.

### 4.4 Example: Hierarchical Multiagent System with Three Levels

Given a world state $w\u2208W$, the agent in $X1$ decides which of the three agents in $X2$ obtains $w$ as an input. This narrows down the possible choices for the selected agent in $X2$ to two out of the six agents in $A$. The selected agent performs the final decision by choosing an action $a\u2208A$. Depending on its degree of specialization, which is a result of his own and the coordinating agents' resources, this agent will choose his action from a certain subset of the full space $A$.

## 5 Optimal Architectures

Here, we show how the framework we have described can be used to determine optimal architectures of bounded rational agents. Summarizing the assumptions made in the derivations, the multiagent systems that we analyze must fulfill the following requirements:

The information flow is feedforward. An agent in $Xk$ can obtain information directly from another agent that belongs to $Xm$ only if $m<k$.

Intermediate agents cannot be end points of the decision-making process. The information flow always starts with the processing of $W$ and always ends with a decision $a\u2208A$.

A single agent is not allowed to have multiple prior policies. Agents are the smallest decision-making unit, in the sense that they transform a prior to a posterior policy over a set of actions in one step.

The performance of the resulting architectures is measured with respect to the expected utility they are able to achieve under a given set of resource constraints. To this end, we need to specify (1) the objective for the full decision-making process, (2) the number $N$ of decision-making steps in the system, (3) the maximal number $n$ of agents to be distributed among the nodes, and (4) the individual resource constraints ${D1,\cdots ,Dn}$ of those agents. We illustrate these specifications with a toy example in section 5.2 by showcasing and explicitly explaining the differences in performance of several architectures. Moreover, we provide a broad performance comparison in section 5.3, where we systematically vary a set of objective functions and resource constraints in order to determine which architectural features most affect the overall performance. For simplicity, in all simulations, we are limiting ourselves to architectures with $N\u2a7d3$ nodes and $n\u2a7d10$ agents. In the following section, we start by describing how we characterize the architectures conforming to the three requirements.

### 5.1 Characterization of Architectures

#### 5.1.1 Type

In order to be able to reference the architectures resulting from *i–iii*, we label an *N*-step decision-making process with $N>1$ by a tuple $(i1,\u2026,iN-1)$ of length $N-1$ which we call the *type* of the architecture, where $iN-2$ characterizes the relation between the first *N* variables $X0,\u2026,XN-1$, and $iN-1$ determines how these variables are connected to $XN$. Note that the explicit mapping between an index and the corresponding relation of random variables is arbitrary.

#### 5.1.2 Shape

After the number of nodes has been fixed, the remaining property that characterizes a given architecture is the number of agents per node. For most architectures, there are multiple possibilities to distribute a given number of agents among the nodes, even when neglecting individual differences in resource constraints. We call such a distribution a shape, denoted by $[n1,n2,\cdots ]$, where $nk$ denotes the number of agents in node $k$. Note that not all architectures will be able to use the full number of available agents, most immanently the one-step rate distortion case (one agent), or the two-step serial case (two agents). For these systems, we always use the agents with the highest available resources in our simulations.

For example, for $N\u2a7d3$, the resulting shapes for a maximum of $n=10$ agents are as follows:

- •
[1] for ($-$1,), [1, 1] for (0,), and [1, 9] for (1,)

- •
[1, 1, 1] for (0, 0) and (2, 1)

- •
[1, 1, 8] for (0, 2), (0, 3), (0, 5), (2, 2)

- •
[1, 1, (2, 4)] and [1, 1, (4, 2)] for (0, 4) and (2, 4)

- •
[1, 8, 1] for (1, 0) and (1, 1)

- •
[1, 4, 4] for (1, 2)

- •
[1, 2, 7], [1, 3, 6], [1, 4, 5], [1, 5, 4], [1, 6, 3], [1, 7, 2] for (1, 3) and (1, 5)

- •
[1, 2, (2, 3)] and [1, 3, (3, 2)] for (1, 4),

where a tuple inside the shape means that two different nodes are deciding about the agents in that spot; for example $[1,1,(2,4)]$ means that there are eight agents in the last node, labeled by the values $(x1,x2)\u2208X1\xd7X2$ with $|X1|=2$ and $|X2|=4$. In Figure 4, we visualize one example architecture for each of the above three-step shapes, except for the shapes of type $(1,4)$ of which one example is shown in Figure 2.

Together, the type $(i,\cdots )$ and shape $[n1,\cdots ]$ uniquely characterize a given multiagent architecture, denoted by $(i,\cdots )[n1,\cdots ]$.

### 5.2 Example: Call Center

Consider the operation of a company's call center as a decision-making process, where customer calls (world states) must be answered with an appropriate response (action) in order to achieve high customer satisfaction (utility). The utility function shown on the left of Figure 5 can be viewed as a simplistic model for a real-world call center of a big company such as a communication service provider. In this simplification, there are 24 possible customer calls that belong to three separate topics—for example, questions related to telephone, Internet, or television—which can be further subdivided into two subcategories—for example, consisting of questions concerning the contract or problems with the hardware. (See the Figure 5 caption for the explicit utility values.)

Handling all possible phone calls perfectly by always choosing the corresponding response with maximum utility requires $log2(24)\u22484.6$ bits (see Figure 5). However, in practice, a single agent is usually not capable of knowing the optimal answers to every type of question. For our example, this means that the call center has access only to agents with information processing capability less than 4.6 bits. It is then required to organize the agents in a way so that each agent has to deal with only a fraction of the customer calls. This is often realized by first passing the phone call through several filters in order to forward it to a specialized agent. Arranging these selector or filter units in a strict hierarchy then corresponds to architectures of the form of $(1,4)$ or $(1,5)$ (see below for a comparison of these two), where at each stage, a single operator selects how a call is forwarded. In contrast, architectures of the form of $(2,4)$ allow for multiple independent filters working in parallel—for example, realized by multiple trained neural networks, where each is responsible for a particular feature of the call (say, one node deciding about the language of the call and another node deciding about the topic). In the following, we do not discriminate between human and artificial decision makers, since both can qualify equally well as information-processing units.

Assume that there are $n=10$ bounded rational agents available. Considering the given utility function, the architectures $(1,4)[1,3,(3,2)]$ (shown in Figure 2) and $(1,5)[1,3,6]$ (shown in Figure 4) might be obvious choices as they represent the hierarchical structure of the utility function. With an information bound of 1.6 ($\u2248log2(3)$) bits for the first agent and 0.1 bits for the rest, the optimal prior policies for $(1,5)[1,3,6]$ obtained by our free energy principle are shown in Figure 6. We can see that for this architecture, the choice $x1$ of the agent at the first step corresponds to the general topic of the phone call, the decisions $x2$ of the three agents at the second stage correspond to the subcategory on which one of the six agents at the final stage is specialized to, who then makes the decision about the final response $a$ by picking one of the four actions in the support of its prior.

We can see in Figure 7 on the left that a hierarchical structure as in $(1,5)[1,3,6]$ or $(1,4)[1,3,(3,2)]$ is indeed superior when comparing with the architecture $(2,4)[1,1,(2,4)]$, because there is no good selector for the second filter. We have also added two architectures to the comparison that have a bottleneck of the information flow at either end of the decision-making process, $(0,3)[1,1,8]$ and $(1,0)[1,8,1]$ (see Figure 4 for a visualization), which are performing considerably worse than the others: in $(0,3)[1,1,8]$, the first agent is the only one who has direct contact to the customer and passes the filtered information on to everybody else, whereas in $(1,0)[1,8,1]$, the customer talks to multiple agents; however, they cannot make any decisions and pass on the information to a final decision node who has to select from all possible options. Interestingly, as can be seen on the right side of Figure 7, when changing the resource bounds such that the first agent has only $D1=1$ bits instead of 1.6 and the second agent has $D2=0.5$ bits instead of 0.1, then the strictly hierarchical architectures $(1,5)[1,3,6]$ and $(1,4)[1,3,(3,2)]$ are outperformed by the architecture $(2,4)[1,1,(2,4)]$, because their first agent is not able to perfectly distinguish among the three topics anymore. This is an ideal situation for $(2,4)[1,1,(2,4)]$, since here, the total information processing for filtering the phone calls is split efficiently between the first two agents in the system.

Note that $(1,4)$ and $(1,5)$ do not necessarily perform identically (as can be seen on the right in Figure 7), even though the structure of the utility function might suggest that it is ideal for $(1,5)[1,3,6]$ to always have the optimal priors shown in Figure 6. However, this crucially depends on the given information-processing bounds. In Figure 8, we illustrate the difference between the two types in more detail by showing the processed information that can actually be achieved per agent in the respective architecture for an information bound of $D=(0.4,2.6,2.6,2.6,0.4,\cdots ,0.4)$. When the first agent in the hierarchy has low capacity, then the rigid structure of $(1,4)$ is penalized because the agents at the second stage cannot compensate the errors of the first agent, regardless of their capacity. In contrast, for $(1,5)$, the connection between the second stage and the executing stage can be changed freely, which leads to ignoring the first agent and letting the three agents in the second stage determine the distribution of phone calls completely. In this sense, $(1,5)$ is more robust to errors in the first filter than $(1,4)$.

### 5.3 Systematic Performance Comparison

In this section, we move away from an explicit toy example to a broad performance comparison of all architectures for $N\u2a7d3$, averaged over multiple types of utility functions and a large number of resource constraints (as defined below). In section 6.1, this is supplemented with an analysis of the architectural features that best explain the performances.

#### 5.3.1 Objectives

We compare all possible architectures for 12 different utility functions, ${Uk}k=112$, defined on a world and action space of $|W|=|A|=20$ elements, and we assume the same cardinality for the range of all hidden variables. Note that the cardinality of the target set $X$ for selector nodes $X\u2208Xsel$ is given by the number of agents it decides about. In particular, we consider three kinds of utility functions (one-to-one, many-to-one, one-to-many) that we vary in a $2\xd72$ paradigm, where the first dimension is the number of maximum utility peaks (single, multiple) and the second dimension is the range of utility values (binary, multivalued). The utility functions are visualized in Figure 9, where the three kinds of functions correspond to the three rows of the plot. A one-to-one scenario applies to a needle-in-a-haystack situation where each world state affords only a unique action and, vice versa, each optimal action allows uniquely identifying the world state, for example, an absolute identification task. A many-to-one scenario allows for abstractions in the world states, for example, in categorization when multiple instances are judged to belong to the same class (e.g., vegetables are boiled; fruit is eaten raw). A one-to-many scenario allows for abstractions in the action space—for example, in hierarchical motor control when a grasp action can be performed in many different ways.

#### 5.3.2 Resource Limitations

We are considering three schemes of resource constraints:

Same constraints for all agents

Same constraints for all agents but one, which has a higher limit than the other agents

Same constraints for all but two agents, which can have a different limit and have higher limits than all the other agents.

For the first constraint, we compare 20 sets of constraints ${D0,D1,\cdots}$ with $Di$ equally spaced in the range between 0 and 3 bits; for the second, we compare 39 sets in the same range but the high resource agent having 1, 2 and 3 bits; and for the third, we allow 89 sets with similar constraints than in the second constraint but additional combinations for the second high-resource agent.

#### 5.3.3 Simulation Results

The performance of an architecture is given by its expected utility with respect to a given objective and a given information bound as defined above. In Figure 10, we show which of the architectures won at least one condition, together with the proportion of conditions won by each of these architectures. We can see that $(2,4)[1,1,(2,4)]$ overall outperforms all the other systems (see Figure 4 for a visualization). When all agents have the same resource constraints, the architecture $(1,4)[1,3,(3,2)]$ is a strong second winner; however, this is not the case if one or two agents have more resources than the rest. It is not surprising that in these situations, the parallel case with one high-resource agent distributing the work among the low-resource agents, and even the case of a single agent that does everything by himself, are both performing well.

A better understanding of their performances under different resource constraints can be gathered from the remaining graphs in Figure 11. In the second row, we can see that the top three overall architectures also perform best for almost all utility functions when averaged over the information bounds. The last three graphs in Figure 11 show the expected utility of each architecture averaged over all utility functions for each information bound. We can see how the expected utility increases with higher information bounds for some architectures more than for others. The top three architecures perform differently for most of the bounds, with spans of bounds where each of them clearly outperforms the others.

## 6 Discussion

### 6.1 Analysis of the Simulations

Plenty of factors influence the performance of each of the given architectures. Here, we attempt to unfold the features that determine their performances in the clearest way. To this end, we compare the architectures with respect to the following quantities:

*Average specialization of operational agents*: The specialization (see equation 4.5) averaged over all agents in the final stage of the architecture*Hierarchical*: Boolean value that specifies whether an architecture is hierarchical or not, meaning that consecutive nodes are occupied by an increasing amount of agents*Agents with direct $w$-access*: The number of agents with direct world state access*Operational agents with direct $w$-access*: The number of agents in the last node of the architecture*Number of $w$-bottlenecks*: The total number of nodes that are missing direct access to the world state

As can be seen from Figure 12, we found that these architectural features explain the differences in performance quite well. More precisely, the architectures can be roughly grouped into different categories, indicated by slightly different color saturations in Figure 12). The poorest-performing group consists of architectures that have between one and two $w$-bottlenecks, and therefore have only a few agents with direct $w$-access; in particular, none of their operational agents has direct $w$-access. Moreover, in this group, most architectures are not hierarchical at all, and their operational agents have low specialization, with two exceptions that both have two $w$-bottlenecks.

The architectures with medium performance have maximally one $w$-bottleneck, and many of them are hierarchical. Here, systems that have operational units with high specialization are missing direct $w$-access, and the systems that have operational units with direct $w$-access have low specialization.

All architectures in the top group have many agents with direct world-state access, and they have no $w$-bottlenecks. Interestingly, the best six architectures are all strictly hierarchical. Moreover, the order of performance is almost in direct accordance with the average specialization of the operational agents.

Overall we can say that it is best to have as many operational units as needed to discriminate the actions well, as long as the coordinating agents have enough resources to discriminate among them properly. The architecture $(1,4)[1,1,(2,4)]$ has eight operational agents managed by two coordinating units, which need maximally two bits (for choosing among four agents) and one bit (for choosing among two agents) in order to perform well. Both of the other top three architectures, $(1,5)[1,3,6]$ and $(1,4)[1,3,(3,2)]$, have six operational agents, managed by three coordinating units, so that each of them needs maximally one bit. But compared to $(1,4)[1,1,(2,4)]$, there are fewer agents to spare for the operational stage. Hence, if the operational units have low resources, it is always a trade-off between the number of operational units and the resources of the coordinating ones.

Another way to see why the architecture $(1,4)[1,1,(2,4)]$ overall outperforms all the other high-ranked systems might be its lower average choice-per-agent ratio—the average number of options for the decision of each agent in the system. In $(1,4)[1,1,(2,4)]$, the second agent also directly observes the world state; moreover, the choice space of eight agents at the operational stage is split into two and four choices. Therefore, there are only $2+4+2010=2.6$ choices per agent on average, whereas for $(1,5)[1,3,6]$ and $(1,4)[1,3,(3,2)]$, there are $3+6+2010=2.9$.

### 6.2 Limitations of Our Analysis

The analysis we have presented provides only a rough explanation of the differences in performance. Which architecture is optimal depends a lot on the actual information bounds of each agent. In all of our conditions, we assumed that most agents have the same processing capabilities, which is why there is a certain bias toward architectures that perform well under this assumption (low variance in choice-per-agent ratio across the agents).

Due to the large number of Lagrange parameters in the free energy principle (see equation 4.2), the data generation was done by running the Blahut-Arimoto-type algorithm for 10,000 different combinations of parameters for each of the architectures, for each type of the three different types of resource limitations in section 5.3, and for each of the utility functions defined in section 5.3. For a given information bound, the corresponding parameters were determined by looking for the points with the highest free energy that still respect the bound. A better approach would be to enhance the global parameter search by a more finely grained local search. Another possibility is to use an evolutionary algorithm, where each population is given by multiple sets of parameters and the information constraints are built in by a method similar to Chehouri, Younes, Perron, and Ilinca (2016). This works well but requires significantly more time to process.

Since the Blahut-Arimoto type of algorithm is not guaranteed to converge to a global maximum, the resulting values for the expected utility and mutual information for a given set of parameters can depend on the initialization of the algorithm. In practice, this variation is small enough so that it influences the average performance over multiple conditions by only a negligable amount. However, direct comparisons of architectures for a given information bound and utility function should be repeated multiple times to make sure that the results are stable.

### 6.3 Relation to Variational Bayes and Active Inference

Another interesting interpretation of equation 6.2 is that here, the hidden variable $X$ can be thought of as an action causing observed outcomes $y$. This is close to the framework of active inference (Friston, Rigoli et al., 2015; Friston, Parr et al., 2017), where actions directly cause transitions of hidden states, which generate outcomes that are observed by the actor. More precisely, there the real-world process generating observable outcomes is distinguished from an internal generative model describing the beliefs about the external generative process (e.g., a Markov decision process). Observations are generated from transitions of hidden states, which depend on the decision maker's actions. Decision making is given by the optimization of a variational free energy analogous to equation 6.2, where the log likelihood is given by the generative model, which describes beliefs about the hidden and control states of the generative process. This way, utilities are absorbed into a (desired) prior (Ortega & Braun, 2015). There are several differences to our approach. First, the structure of the free energy principle of bounded rationality originates from the maximization of a given predefined external utility function under information constraints, whereas the free energy principle of active inference aims to minimize surprise or Bayesian model evidence, effectively minimizing the divergence between approximate and true posterior. Second, in active inference, utility is transformed into preferences in terms of prior beliefs, while in bounded rationality, prior policies over actions can be part of the optimization process, which results in specialization and abstraction. In constrast, active inference compounds utilities and priors into a single desired prior, which is fixed and does not allow separately optimizing utility and action priors.

## 7 Conclusion

In this work, we have presented an information-theoretic framework to study systems of decision-making units with limited information-processing capabilities. It is based on an overreaching free energy optimization principle that, on the one hand, allows computing the optimal performances of explicit architectures and, on the other hand, produces optimal partitions of the involved choice spaces into regions of specialization. In order to combine a given set of bounded rational agents, the full decision-making process is split into multiple decision steps by introducing intermediate decision variables, and then a given set of agents is distributed among these variables. We have argued that this leads to two types of agents, nonoperational units that distribute the work among subordinates and operational units that are doing the actual work in the sense of choosing a particular action that either serves as an input for another agent in the system or represents the final decision of the full process. This “vertical” specialization is enhanced by optimizing over the agents' prior policies, which leads to an optimal soft partitioning of the underlying choice space of each step in the system, resulting in a “horizontal” specialization as well.

In order to illustrate the proposed framework, we have simulated and analyzed the performances under a number of different resource constraints and tasks for all possible three-step architectures whose information flow starts by observing a given world state and ends with the selection of a final decision. Although the relative architecture performances depend crucially on the explict information-processing constraints, the overall best-performing architectures tend to be hierarchical systems of nonoperational manager units at higher hierarchical levels and operational worker units at the lowest level.

Our approach is based on earlier work on information-theoretic bounded rationality (Ortega & Braun, 2011, 2013; Genewein & Braun, 2013; Genewein et al., 2015). In particular, the $N$-step decision-making systems introduced in section 3 generalize the two-step processes studied in Genewein and Braun (2013) and Genewein et al. (2015). According to Simon (1979), there are three different bounded rational procedures that can transform intractable into tractable decision problems: (1) Looking for satisfactory choices instead of optimal ones, (2) replacing global goals with tangible subgoals, and (3) dividing the decision-making task among many specialists. From this point of view, the decision-making process of a single agent, given by the one-step case of information-theoretic bounded rationality (Ortega & Braun, 2011, 2013) described in section 2, corresponds to the first procedure, while the bounded rational multistep and multiagent decision-making processes introduced in sections 3 and 4 can be attributed to the second and third procedures.

The main advantage of a purely information-theoretic treatment is its universality. To our knowledge, this work is the first systematic theory-guided approach to the organization of agents with limited resources in the generality of information theory. In other approaches, more specific methods, tailored to each particular focus of study, are used instead. In particular, bounded rationality has usually a very specific meaning, often being implemented by simply restricting the cardinality of the choice space. For example, in management theory, the well-known results by Graicunas from the 1930s (Graicunas, 1933) suggest that managers must have a limited span of control in order to be efficient. By counting the number of possible relationships between managers and their subordinates, he concludes that there is an explicit upper bound of five or six subordinates. Of course, there are many cases of successful companies today that disagree with Graicunas's claim; for example, Apple's CEO has 17 managers reporting directly to him. However, current management experts think that the optimal number is somewhere between 5 and 12. The idea of restricting the cardinality of the space of decision making is also studied for operational agents. For example, Camacho and Persky (1988) explore the hierarchical organization of specialized producers with a focus on production. Although their treatment is more abstract and more general than many preceding studies, their take on bounded rationality is very explicit and based on the assumption that the number of elementary parts that form a product, as well as the number of possibilities of each part, are larger than a single individual can handle. Similarly, in most game-theoretic approaches that are based on automaton theory (Neyman, 1985; Abreu & Rubinstein, 1988; Hernández & Solan, 2016), the boundedness of an agent's rationality is expressed by a bound on the number of states of the automaton. Most of these non-information-theoretic treatments consider cases when there is a hard upper bound on the number of options, but they usually lack a probabilistic description of the behavior in cases when the number of options is larger than the given bound.

The work by Geanakoplos and Milgrom (1991) uses “information” to describe the limited attention of managers in a firm. But here, we use the term more informally, and not in the classical information-theoretical sense. However, one of their results suggests that “firms with more prior information about parameters $\u2026$ will employ less able managers, or give their managers wider spans of control” (Geanakoplos & Milgrom, 1991, p. 207). This observation is in line with information-theoretic bounded rationality, since by optimizing over priors in the free energy principle, the required processing information is decreased compared to the case of nonoptimal priors, so that less able agents can perform a given task or, similarly, an agent with a higher information bound can have a larger choice space.

In neuroscience, the variational Bayes approach explained in section 6.3 has been proposed as a theoretical framework to understand brain function in terms of active inference (Friston 2009, 2010; Friston, Levin et al., 2015; Friston, Rigoli et al., 2015; Friston, Lin et al., 2017; Friston, Parr et al., 2017), where perception is modeled as variational Bayesian inference over hidden causes of observations. There, a processing node (usually a neuron) is limited in the sense that it can only linearly combine a set of input signals into a single output signal. Decision making is modeled by approximating Bayes' rule in terms of these basic operations and then tuning the weights of the resulting linear transformations in order to optimize the free energy (see equation 6.2). Hence, there, the free energy serves as a tool to computationally simplify Bayesian inference on the neuronal level, whereas our free energy principle is a tool to computationally trade off expected utility and processing costs, providing an abstract probabilistic description of the best possible choices when the information-processing capability is limited.

In the general setting of approximate Bayesian inference, there are many interesting algorithms and belief update schemes—for example, belief propagation in terms of message passing on factor graphs (see Yedidia, Freeman, & Weiss, 2005). These algorithms make use of the notion of the Markov boundary (minimal Markov blanket) of a node $X$, which consists of the nodes that share a common factor with $X$ (so-called neighbors). Conditioned on its Markov boundary, a given random variable is independent of all other variables in the system, which allows approximating marginal probabilities in terms of local messages between neighbors. These approximations are generally exact only on tree-like factor graphs without loops (Mézard & Montanari, 2009, theorem 14.1). This raises the interesting question of whether such algorithms could also be applied to our setting. First, it should be noted that variational Bayesian inference constitutes only a subclass of problems that can be expressed by utility optimization with information constraints. In this subclass, all random variables have to appear either in utility functions (they have to be given as log likelihoods) or in marginal distributions that are kept fixed—see, for example, the definition of the utility in the inference example above where $U(a,x1,x2,y)=logp(y|a,x1,x2,S)$ compared to the utility functions of the form $U(w,a)$ used throughout the letter that leave all intermediate random variables $X1,\u2026,XN-1$ unspecified. Second, while it may be possible to exploit the notion of Markov blankets by recursively computing free energies among the nodes in a similar fashion to message passing, there can also be contributions from outside the Markov boundary—for example, when the action node has to take an expectation over possible world states that lie outside the Markov boundary. Finally, it may be interesting to study whether message-passing algorithms can be extended to deal with our general problem setting and at least to approximately generate the same kind of solutions as Blahut-Arimoto, even though in general, we do not have tree-structured graphs.

There are plenty of other possible extensions of the basic framework introduced in this work. Marschak and Reichelstein (1998) study multiagent systems in terms of communication cost minimization, while ignoring the actual decision-making process. One could combine our model with the information bottleneck method (Tishby, Pereira, & Bialek, 1999) and explicitly include communication costs in order to study more general agent architectures—in particular, systems with nondirected information flow. Moreover, we have seen in our simulations that specialization of operational agents is an important feature shared among all of the best-performing architectures. In the biological literature, specialization is often paired with modularity. For example Kashtan and Alon (2005) and Wagner et al. (2007) show that modular networks are an evolutionary consequence of modularly varying goals. Similarly, it would be interesting to study the effects of changing environments on specialization, abstraction, and optimal network architectures of systems of bounded rational agents.

## Appendix: Proof of Equation 3.5

## Acknowledgments

This study was funded by the European Research Council (ERC-StG-2015-ERC Starting Grant, Project ID: 678082, “BRISC: Bounded Rationality in Sensorimotor Coordination”).

## References

*Lecture Notes in Computer Science: Vol. 6830. Artificial General Intelligence*