The Mathematics of Linguistics

Hey, y’all! It’s certainly been awhile. I’ve been busy the entire semester but with the lull in between Fall & Spring terms, I figured I would try to write something interesting I had been thinking about for some time.

Those that know me, know that I am absolutely crazy about natural language applications — pieces of technology that utilize your speaking voice to control some aspect of the device. It has a wide range of utility, that I’ll leave to a different post which will discuss examples of these pieces of technology. This post is about a topic a bit abstracted from that: the underlying ideas that allow you to make use of these applications come from computational linguistics, a unique & really cool subfield within the scope of AI.

Development of Linguistics as a Scientific Field

The idea of linguistics being a subject of mathematical research comes as a surprise to a lot of people I talk to that do not do much computer science type work; they have no idea that there are a number of a mathematical models used directly in the field of linguistics, including n-grams & ideas of entropy.


Basic wordcloud of Wikipedia page on linguistics. Generated using Mathematica.

More indirectly, foundational ideas from both of these direct applications of probability theory come, perhaps unsurprisingly, from the logic developed by Cantor & reconciled by Russell in set theory, as well as from various mathematicians during the Golden Age of Logic, specifically Gödel & his Incompleteness Theorems; These ideas concerned linguistic patterns of mathematical axioms, & set the stage for analyses of everyday languages as communicative systems of which they exist at any given point in time, regardless of their history, with ‘axioms’ being replaced with ‘grammars’.

A Basic Idea of Set Theory

Set theory provided a needed versatility when identifying patterns at higher levels of abstraction, with the first whole theory being developed by Georg Cantor, who relied on Boole’s algebraic notations when working with syllogisms. The theory begins by establishing an arithmetic for sets:

If A and B are sets, let AB denote the set consisting of all members common to A and B, and let A + B denote the set consisting of all members of A together with all members of B.

This is just a formalization of Boole’s work, the only difference being that here, A & B refer to any set, not just those arising as a consequence of propositional logic. To avoid confusion, a contemporary notation was developed to where AB becomes A ∩ B (read as ‘A intersection B’), & A + B becomes A ∪ B (read as ‘A union B’). The arithmetic developed by Boole still holds in this generalization, except for the axioms involving the object 1, because there is no need for these kinds of objects in set theory. Sets of objects are subjected to the same type of properties – the same type of arithmetic – that can, but not necessarily, preclude other objects from being categorized in the same set.


Sets are intuitively represented as Venn diagrams. The left is the union of both sets, whereas the right is the intersection of the given sets.

Small sets of objects are typically listed explicitly, while larger or infinite sets use ellipses e.g.:

  • {1, 2, 3}, a set consisting of the elements 1, 2, & 3
  • {1, 2, 3, …}, the infinite set of all natural numbers, also represented as ℕ

The introduction of sets allows us to examine patterns in varying levels of abstraction: the natural numbers are within the set of integers, which are within the set of rational numbers, which are within the set of real numbers, which are within the set of complex numbers. We typically refer to sets within sets as subsets. In notation, this is

ℕ ⊂ ℤ ⊂ ℚ ⊂ ℝ ⊂ ℂ

Mathematicians have used this idea to answer questions regarding the nature of numbers; to answer a question like, “what is a number?” we have to look out how members within certain sets of numbers can be described by the sets that precede it. That is, how properties of numbers within a lower abstracted set can be combined in some fashion as to represent elements within higher abstracted set. At the lowest level – the set of natural numbers – can be described in terms of axioms (namely, you can construct the notion of the set of natural numbers using solely the empty set – the set with no elements – denoted as ∅).

Inconsistency & Reconciliation with Axiomatic Set Theory

For all the usefulness of set theory, there was a major flaw in the framework. Before I discuss what the flaw was, it is more apt to discuss why having it is such a major deal.

Of all things that can be wrong with a particular axiom system – & you would hope there is none – inconsistency is definitely the worst. It is possible to work with axioms that are hard to understand, we see this every day: the Peano axioms that make up the set of natural numbers are not widely known by most people, yet we use natural numbers constantly. It is also possible to work with axioms that are counterintuitive which we, again, see every day (shoutout to statisticians). It is even possible to work with axioms that do not accurately describe the system you are intending to represent, since these axioms could be useful elsewhere in mathematics. But inconsistent axioms? No way, José.

The particular inconsistency, found by Bertrand Russell, was one relating to properties. Recall that I said that elements satisfying particular properties can be grouped together into sets. Well what happens when you have properties that are not properties of the elements within sets but properties of sets themselves? Is the set a member of itself? Or, what happens, if certain sets do not fulfill this property, are they not members of themselves? That is to say:

If set R satisfies property P, but set W does not, can R ∈ R if W ∉ W? This is a contradiction in the definition of a set itself. Russell looked at this question in a more nuanced way, that I will not go into detail here, but arrived at the same impasse.

The solution? The development of axiomatic set theory, which added new axioms to the ones that already existed from before. It is not as simple & succinct as Cantors original theory, & so it was only with reluctance that it was abandoned, but it just goes to show that even the most intuitive ideas need to be critically examined for flaws.

Gödel’s Incompleteness Theorem

A similar type of foundational flaw was found in the axiomatic approach to mathematics itself. The reason we use axioms as building blocks for particular mathematical structures is because it makes it possible for us to separate the ideas of being able to prove something & something being objectively true. If you had a proposition that was provable, then you can use a sequence of arguments to deduce whether or not the proposition was true, as long as your axioms were assumed true. These ideas being separated allowed mathematicians to avoid dealing with the major philosophical implications that “objective truth” entails.

Now, since figuring out which axioms to use is crucial to formalized mathematics, this implies – within the notion of formalization – that you will, at some point, find all the axioms you need to be able to prove something. This is the idea of completeness of an axiomatic system. But, as noted with Russell, finding all the axioms needed can be a really difficult undertaking. How do we know if we have found them all? How can we have a complete set of axioms?

In comes Gödel, who asserts that an axiom system must be incomplete, because there are questions that cannot be answered on the basis of the axioms. The big idea here is that consistency precludes completeness: if your axioms are consistent – that is they build a basic arithmetic by which you can perform known operations like addition, multiplication, etc. – then there exists statements which are true but not provable.


Kurt Gödel (1906-1978)

The proof of Gödel’s Theorem is better left to the entire books that are dedicated to its uses & abuses as it pertains to mathematical logic. A book I would recommend, if you are curious, is Gödel’s Theorem: An Incomplete Guide to Its Use and Abuse by Torkel Franzén.

Grammatical Approach to Linguistics

From what we have seen previously, it is not so farfetched then to examine the study of everyday language with the same axiomatic basis that we have seen before with the notion of sets. I suppose another question that arises in this examination is why: why would we want to analyze language with this paradigm? Let us look at an example:

  1. People love going to the movie theater.
  2. Cats are interested in mayonnaise and like resting in catacombs.
  3. Lamp soft because mom to runs.

It should come as no surprise that the third sentence is not proper English, but the first & second sentences are. The first & second sentences are correct, yet one is nonsensical. How can that be the case? Why can a nonsensical sentence still be classified as a genuine English sentence? It is because of the way in which the words are stitched together – the structure of the sentence.

Sentence structure – much like sets – are abstract constructs that contain elements adhering to specific properties. These elements are things you can explicitly point to, in this case, particular words. Because sentences are a level of abstraction above words, they cannot be explicitly pointed at when coming up with rules. The only thing you can do is observe the repetitive behavior expressed by sentences & then come up with a way to generalize that behavior. These patterns constitute rules that are much like the axioms we’ve encountered before; the type of “arithmetic” that would be applied to this set of axioms is the basis of a grammar for a language.

Inspired by the advancements of logic in the Golden Age, this new mathematically based linguistics attempted to reduce all meaningful statements to a combination of propositional logic & information conveyed by the five human senses. A step further, Noam Chomsky attempted to do what could not be done with set theory: to design a process of finding all axioms that described the syntactic structure of language.

  • DNP VP → S
  • V DNP → VP
  • P DNP → PP


  • DNP = definite noun phrase
  • VP = verb phrase
  • S = sentence
  • V = verb
  • P = preposition
  • PP = prepositional phrase

These are just a few of pieces of the formalism developed by Chomsky. Using the above grammar – & typically a parse tree – it is very easy to apply to the English lexicon, as each word corresponds to one of the given types. This formal grammar, which is built on axiomatic principles, captures some of the structure of the English language.


Parse tree of a simple sentence. The sentence being “the rat ate cheese”, with the sentence broken down into its constituents.

The Advent of Natural Language Processing

It should be noted that the use of parse trees is highly developed in computer science, & used extensively in natural language processing (NLP). Chomsky’s algebraic work in grammar has allowed computer scientists to break down human speech patterns in a way that machines can analyze, understand & ultimately interpret; NLP considers the hierarchal structure of language.


Amazon’s family of natural language devices. From left to right: the Tap, the Echo, the Dot. Each device utilizes Alexa Voice Services, Amazon’s unique natural language processor.

However, there are still major issues to be resolved within the field. One of these is the ambiguity that arises out of spoken language — English in particular. Consider the following basic sentence:

  • I didn’t take her purse.

Depending on where the emphasis is placed, this sentence conveys different messages – & thus it conveys different pieces of information. Emphasis is not easily reconciled within machines, & this is because that was not an aspect of language captured within Chomsky’s mathematical treatment of sentences; most processors do not know how to handle these kinds of ambiguities unless the ambiguity somehow disappears. Crudely, this is done thru the use of hard-coding particular linguistic patterns you want your machine to interpret. However, there have been strides to develop probabilistic methods of interpreting speech. This is largely thanks to the introduction of machine learning algorithms, for which most NLP techniques are now based.

Machine Learning in a Nutshell

Machine learning is a discipline in its own right, which I’ve touched upon in other posts (but have yet to provide a real in-depth treatment of). It is characterized by the examination of large data sets – dubbed training sets – & making statistical inferences based on these data. The more data you analyze, the more accurate your model will be. A popular application of machine learning is social media analysis.

There are entire texts dedicated to the subject, one of my favorites being this one, which combines definitions with visualizations.

Linguistics as Portrayed in Popular Culture

Perhaps the most recent portrayal of modern linguistics can be seen in the recent sci-fi blockbuster Arrivalwhich is about a linguistics professor attempting to converse with aliens. The movie is a nod to researchers at SETI, as well as Freudenthal, who developed Lincos — an attempt at the creation of a language based on perceived commonalities in mathematics that we would have with an alien species. His work, while pivotal, had some serious pitfalls, which you can read about in the attached article.


Dr. Louise Banks attempting to communicate basic words to alien visitors in the latest sci-fi film Arrival.

During the film, the audience was able to see some of the computational methods that modern linguists use in analyzing speech, & this technology is written about extensively by Stephen Wolfram (in fact, it is his technology they’re using in the film).

As a whole, the film helped to showcase linguistics as a highly technical & scientific field of study, as opposed to a degree that people get because they “speak like five languages or something” (actual quote by someone I know, who will remain nameless).

Not the Beginning, but Not the End Either

While linguistics has been around for centuries, it is only during the mid-20th century that it took a turn from the historical to the mathematical; the major advancements made by Gödel, set forth by Cantor, Russell & Boole (as well as Aristotle, technically) allowed the study of patterns to be applied to other fields that were before, outside the scope of mathematics. The advent of computers ushered in new questions about language & artificial intelligence, & are now bustling fields of research. Yet for all our progress, we have yet to understand fundamental questions about what makes language the way it is. These – & other questions – suggest we have a ways to go, but it would seem as if we have a good foundation to base our future findings off of. Though, if anything is certain about mathematics, it is that intuition does not always reign supreme; language is constantly evolving, & it is good practice to examine foundations for flaws or else we might just end up at the same philosophical standstill that many mathematicians have fallen victim to before.

Markov Chains and the Mass Effect Universe

Hey, y’all! I realize it’s been awhile since my last post; classes and work started up again so I’ve been pretty busy. I’ve also been trying to figure out a topic to talk about for awhile now, but to no avail. Everything I had thought up or attempted I ended up scraping due some reason or another — usually the math didn’t end up working quite like how I expected and so I didn’t want or feel qualified to talk about the subject.

Well, I finally figured out what I wanted to discuss: a neat little thing called a Markov chain, and a series of games i’ve been playing a lot of recently. Now, for anyone not too familiar with the Mass Effect Trilogy, you assume the role of Commander Shepard, a soldier for humanity’s space-faring military, the Alliance. In this universe, you travel across the galaxy, dealing with intergalactic politics, unethical scientific experiments, hordes of synthetic armies…the works, really. Subsequently, you grapple with a lot of moral dilemmas, such as whether or not synthetic races aka machines — which we would consider as highly advanced artificially intelligent beings (AI) — deserve to be treated the same as sentient, organic creatures. I am not going to attempt any sort of philosophical argument to get you to think one way or another about this particular question, but I am going to talk about some of the mathematics at play regarding these fictional AI.

Now, those familiar with the series will remember their first friendly encounter with AI in-game as EDI (otherwise known as the Enhanced Defense Intelligence) that was installed aboard the Normandy SR-2. Her AI capabilities are vastly different than the consensus-reaching software of the Geth; in short, her singular personality allows her to develop “preferences”. That is, she is able to tailor her programmed priorities/reward mechanism with updated information through the use of machine learning principles in a way that makes her seem more like organic beings (aka: more “human” though in the Mass Effect universe, there are many other races besides humans. Anyway, I digress…).


This is EDI, after she assumes the platform of the fake Dr. Eva Coré in Mass Effect 3.

How exactly does this work? How are such decisions within machines made and what allows them to alter the decisions they make in the first place? Can this be truly developed to create dynamical, quick-thinking, ever-evolving machines?

Also: what the heck are Markov chains? Let’s start there:

Markov chains are integer-time (implicitly, discrete) processes, for which a random variable (r.v.), Xn has the “Markov property”. That means that as long as you have information on the current state of your system, Xn, any other information about the past is completely irrelevant for predicting the next state, Xn+1. Let’s talk about this within the context of an example:

Remember that salarian who got kicked out of The Flux at the Citadel for basically rigging gambling terminals? That game called “quasar”? We can model a similar situation in a gambler’s ruin problem. Consider a game of high-stakes quasar in which on any turn you win 500 credits with probability p = 0.4 or you lose 500 credits with probability 1 – p = 0.6. Suppose further that you adopt the rule that you quit playing if your fortune reaches N credits (of course, if your fortune reaches 0 credits, then that Volus manager is gonna kick you out, but hear me out here).

Quasar gambling machines at The Flux

Quasar gambling machines at The Flux

Let Xn be the amount of credits you have after n plays. Your fortune, Xn, has the Markov property described above. To check this for the gambler’s ruin chain, let us make a note of the fact that if you’re still playing at time n, i.e.; your fortune Xn = i with 0 < i < N, then for any possible history of your wealth in-1, in-2, …, i1, i0,

Screen Shot 2015-09-04 at 7.50.31 PM

since to increase your wealth by one increment (that is, another 500 credits) you have to win your next bet.

Now, Markov chains have a specified transition matrix, p(ij),if for any j, i, in-1,…,i0,

Screen Shot 2015-09-04 at 7.51.24 PM

This example is very simplified, as the transition probability does not depend on time n. Intuitively, we can sort of surmise that the transition probabilities give the rules of the game; it’s the basic information needed to describe a Markov chain. In this particular example of high-risk quasar, the transition probability has,

Screen Shot 2015-09-04 at 7.55.48 PM

When N = 2500 (that is, 5 bets of 500 credits) the matrix is:

Screen Shot 2015-09-04 at 8.38.57 PM

So now you might be wondering: what does a game of high-stakes quasar, a cheating Salarian, and EDI’s preference mechanism have in common? Let’s say we have a Markov chain where each state, i, is associated with a reward, ri. As the Markov chain proceeds from state to state, there is an associated sequence of rewards that are not independent of one another, but rather are related by the statistics of the Markov chain. The concept of rewards in each state (which are actually more associated with the transitions between states, though whose results are seen in each consequent state) is what allows EDI to make the decisions she does.

Within EDI’s software, there is something of a “decision maker” or “control” that analyzes and modifies both the transitional probabilities and the rewards expressed at each state of the Markov chain. This decision making algorithm attempts to maximize the expected reward, but is typically faced with compromising between immediate rewards and long-term rewards (which arise from choosing transition probabilities that lead to high-reward states). This is the basis behind EDI’s acquired “preferences”.

Ok, cool. So what if there is more than one reward to choose between? i.e.; instead of a single ri for each state i, there is a choice between some number Ki of different rewards, ri(1), ri(2), … ri(Ki) and a corresponding choice between Ki different sets of transition probabilities, {Pij(1); 1 < j < M}, {Pij(2); 1 < j < N},…,{Pij(Ki); 1 < j < N}. At each time n, EDI’s control, given Xni, selects one of these Ki possible choices for state i. Seems easy enough. Now, the set of rules that are utilized in EDI’s control in selecting an alternative at each time is called a policy. Basically, EDI considers the expected aggregate reward over m steps of the Markov chain as a function of the policy used by her control. Here’s an idea of what to picture,

Screen Shot 2015-09-04 at 9.05.53 PM

A Markov decision problem in which the control can choose between two possible decisions in state 2 (K2 = 2), and has no freedom of choice in state 1 (K1 = 1). This situation depicts the familiar tradeoff between instant gratification (alternative 2, lower graph) and long-term gratification (alternative 1, upper graph).

Now, what are these set of rules? How does a policy come about? Is there a way to optimize various policies in order to be able to maximize rewards? The first two questions are pretty difficult to answer, and are more or less based on how EDI “ranks” her preferences. The optimization falls onto what we call shortest-path problems; as the name suggests, we basically decrease the amount of states a system enters in order to reach the desired state the quickest, thus decreasing the time required. It’s analogous to how we, as humans, tend to perform activities which require the least amount of time and effort, in order to expend less energy. i.e.: we are naturally lazy beings, and so is EDI.


There, there EDI. Being lazy isn’t so bad.

There is more to be said about policy optimization and dynamic programming, but i’d like to make that a separate post, which will end up being much more rigorous. Hopefully, this gives you a better idea of how EDI (and perhaps other synthetics in the ME universe) operate, and gives you a bit of background about the plethora of science that is touched upon in the game.

Like always, thanks for reading!