Hey, y’all! I realize it’s been awhile since my last post; classes and work started up again so I’ve been pretty busy. I’ve also been trying to figure out a topic to talk about for awhile now, but to no avail. Everything I had thought up or attempted I ended up scraping due some reason or another — usually the math didn’t end up working quite like how I expected and so I didn’t want or feel qualified to talk about the subject.

Well, I finally figured out what I wanted to discuss: a neat little thing called a Markov chain, and a series of games i’ve been playing a lot of recently. Now, for anyone not too familiar with the Mass Effect Trilogy, you assume the role of Commander Shepard, a soldier for humanity’s space-faring military, the Alliance. In this universe, you travel across the galaxy, dealing with intergalactic politics, unethical scientific experiments, hordes of synthetic armies…the works, really. Subsequently, you grapple with a lot of moral dilemmas, such as whether or not synthetic races aka machines — which we would consider as highly advanced artificially intelligent beings (AI) — deserve to be treated the same as sentient, organic creatures. I am not going to attempt any sort of philosophical argument to get you to think one way or another about this particular question, but I am going to talk about some of the mathematics at play regarding these fictional AI.

Now, those familiar with the series will remember their first friendly encounter with AI in-game as EDI (otherwise known as the **E**nhanced **D**efense **I**ntelligence) that was installed aboard the Normandy SR-2. Her AI capabilities are vastly different than the consensus-reaching software of the Geth; in short, her singular personality allows her to develop “preferences”. That is, she is able to tailor her programmed priorities/reward mechanism with updated information through the use of machine learning principles in a way that makes her seem more like organic beings (aka: more “human” though in the Mass Effect universe, there are many other races besides humans. Anyway, I digress…).

How exactly does this work? How are such decisions within machines made and what allows them to alter the decisions they make in the first place? Can this be truly developed to create dynamical, quick-thinking, ever-evolving machines?

Also: what the heck are Markov chains? Let’s start there:

**Markov chains** are integer-time (implicitly, discrete) processes, for which a random variable (r.v.), *X*_{n} has the “Markov property”. That means that as long as you have information on the current state of your system, *X*_{n}, any other information about the past is completely irrelevant for predicting the next state, *X*_{n+1}. Let’s talk about this within the context of an example:

Remember that salarian who got kicked out of The Flux at the Citadel for basically rigging gambling terminals? That game called “quasar”? We can model a similar situation in a gambler’s ruin problem. Consider a game of high-stakes quasar in which on any turn you win 500 credits with probability p = 0.4 or you lose 500 credits with probability 1 – p = 0.6. Suppose further that you adopt the rule that you quit playing if your fortune reaches *N* credits (of course, if your fortune reaches 0 credits, then that Volus manager is gonna kick you out, but hear me out here).

Let *X*_{n} be the amount of credits you have after *n* plays. Your fortune, *X*_{n}, has the Markov property described above. To check this for the gambler’s ruin chain, let us make a note of the fact that if you’re still playing at time *n*, i.e.; your fortune *X*_{n} = *i* with 0 < *i* < *N*, then for any possible history of your wealth *i*_{n-1}, *i*_{n-2}, …, *i*_{1}, *i*_{0},

since to increase your wealth by one increment (that is, another 500 credits) you have to win your next bet.

Now, Markov chains have a specified **transition matrix**, p(*i*, *j*),if for any *j*, *i*, *i*_{n-1},…,*i*_{0},

This example is very simplified, as the transition probability does not depend on time *n*. Intuitively, we can sort of surmise that the transition probabilities give the rules of the game; it’s the basic information needed to describe a Markov chain. In this particular example of high-risk quasar, the transition probability has,

When *N* = 2500 (that is, 5 bets of 500 credits) the matrix is:

So now you might be wondering: what does a game of high-stakes quasar, a cheating Salarian, and EDI’s preference mechanism have in common? Let’s say we have a Markov chain where each state, *i*, is associated with a **reward**, *r*_{i}. As the Markov chain proceeds from state to state, there is an associated sequence of rewards that are not independent of one another, but rather are related by the statistics of the Markov chain. The concept of rewards in each state (which are actually more associated with the transitions between states, though whose results are seen in each consequent state) is what allows EDI to make the decisions she does.

Within EDI’s software, there is something of a “decision maker” or “control” that analyzes and modifies both the transitional probabilities and the rewards expressed at each state of the Markov chain. This decision making algorithm attempts to maximize the expected reward, but is typically faced with compromising between immediate rewards and long-term rewards (which arise from choosing transition probabilities that lead to high-reward states). This is the basis behind EDI’s acquired “preferences”.

Ok, cool. So what if there is more than one reward to choose between? i.e.; instead of a single *r*_{i} for each state *i*, there is a choice between some number *K*_{i} of different rewards, *r*_{i}^{(1)}, *r*_{i}^{(2)}, … *r*_{i}^{(Ki)} and a corresponding choice between *K*_{i} different sets of transition probabilities, {P_{ij}^{(1)}; 1 < j < M}, {P_{ij}^{(2)}; 1 < *j* < *N*},…,{P_{ij}^{(Ki)}; 1 < *j* < *N*}. At each time *n*, EDI’s control, given X_{n} – *i*, selects one of these K_{i} possible choices for state *i*. Seems easy enough. Now, the set of rules that are utilized in EDI’s control in selecting an alternative at each time is called a **policy**. Basically, EDI considers the expected aggregate reward over *m* steps of the Markov chain as a function of the policy used by her control. Here’s an idea of what to picture,

Now, what are these set of rules? How does a policy come about? Is there a way to optimize various policies in order to be able to maximize rewards? The first two questions are pretty difficult to answer, and are more or less based on how EDI “ranks” her preferences. The optimization falls onto what we call **shortest-path problems**; as the name suggests, we basically decrease the amount of states a system enters in order to reach the desired state the quickest, thus decreasing the time required. It’s analogous to how we, as humans, tend to perform activities which require the least amount of time and effort, in order to expend less energy. i.e.: we are naturally lazy beings, and so is EDI.

There is more to be said about policy optimization and dynamic programming, but i’d like to make that a separate post, which will end up being much more rigorous. Hopefully, this gives you a better idea of how EDI (and perhaps other synthetics in the ME universe) operate, and gives you a bit of background about the plethora of science that is touched upon in the game.

Like always, thanks for reading!

Very well done.

LikeLike

Thanks, Dr. Singh Kohli! i’m a new reader to your blog (and am enjoying your posts so far!)

LikeLike