Using Python to Perform Lexical Analysis on a Short Story

One of the more interesting things going on in the Big Data world right now is some of the quantitative analysis being done on how people use the English language. This is widely known as Natural Language Processing—“NLP”—a field that can be thought of as a fusion between artificial intelligence and linguistics.

A useful framework for getting started in NLP is called the Natural Language Processing Toolkit—“NLTK”— a Python library that is full of different tips, tricks, and approaches towards performing basic NLP. It also comes with a free guidebook with example code and explanations which can help anybody get started from only a basic knowledge of programming.

NLP is interesting to me because it opens the door for anybody with some computer skills to perform a large-scale analysis on language of any kind. I’ve written about how I enjoy following the media industry because it’s so central to how people communicate with each other, and my interest in language comes from that same place. Harnessing the power of data analysis on language could be used to discover patterns in language that we hadn’t considered before.

There’s a lot of room for creativity here. An interesting application of this might be a project that collects political news articles from two different newspapers over the course of a month, and then, using various NLP statistics, compares and contrasts the two newspapers. Do the two newspapers use certain adjectives differently? What does that tell us about the “slant” of that newspaper? There are lots of interesting questions you could ponder.

I decided to pursue a small creative NLP project just for the hell of it. The idea was to perform basic NLP on a 2,955-word short story I wrote in a Creative Writing class last semester, and to just get a feel for what a basic exploratory analysis might look like. The short story is titled Ron Rockwell. I am going to include the Python code I wrote at each step with some added commentary, so you can follow along and reproduce the study if you want. To download all the code I used at once, just see the script I have on GitHub.

Before I do anything, I need to import some external libraries that we’ll need for the exercise, and I do this by running the following code.

I wrote my short story in Microsoft Word (download link). To convert it to a text file, I just copied & pasted the story from Word into a .txt in Notepad++, then manually deleted any strange characters that appeared in translation . Now we have an easy-to-work-with format: the entirety of Ron Rockwell in just a text file (download link). Let’s open it with Python.

Next, I’ll write a function that further preprocesses the text. The output is an object that consists of all the words found in Ron Rockwell as word “tokens“.

['\xef', 'that', 's', 'it', 'for', 'today', 'remember', 'homework', 'is', 'due', 'next', 'class', 'and', 'don', 't', 'forget', 'we', 'have', 'a', 'test']

The output displays 20 of our tokens.

Let’s define a short function to identify an introductory metric for our story. The Lexical Diversity represents the ratio of unique words used to the total number of words in the story.

0.3222842139809649

Let’s check our math. How many tokens are there, total?

3047

How many unique tokens are there in that set?

982

982/3047 is equal to a .32 Lexical Diversity score.

The FreqDist() function turns our set of tokens into a Frequency Distribution object, giving us the frequencies of all tokens in my story.

FreqDist: 'the': 108, 'and': 105, 'to': 98, 'a': 82, 'of': 69, 'ron': 61, 'in': 56, 'i': 47, 'was': 47, 'his': 45, ...

We can plot this frequency distribution:

figure_1

Not many of those words are useful for analysis. “The”, “and”, “to”, “a”, and “of” are all used in the English language to provide basic sentence structure. We want to eliminate these kinds of words and only focus on the words that have significance in giving us information about the story. These words are called “stopwords“, and we are going to eliminate them by importing a pre-made list of stopwords and eliminating them from our list of word tokens.

['\xef',
'today',
'remember',
'homework',
'due',
'next',
'class',
'forget',
'test',
'friday',
'everything',
've',
'learned',
'chain',
'rule',
'll',
'office',
'wednesday',
'twelve',
'two']

How many total words are left?

1542

How many unique words made the cut?

876

Let’s turn our stopword-free list of tokens into a Frequency Distribution object.


FreqDist: 'ron': 61, 'molly': 25, 'calculus': 16, 'students': 15, 'class': 13, 'time': 13, 'mathematics': 10, 'really': 10, 'wasn': 10, 'know': 9, ...

Now we’re getting somewhere. Let’s see a list of the top 20 most common non-stopwords in my short story.


[('ron', 61),
('molly', 25),
('calculus', 16),
('students', 15),
('class', 13),
('time', 13),
('mathematics', 10),
('really', 10),
('wasn', 10),
('know', 9),
('life', 9),
('office', 9),
('professor', 9),
('took', 9),
('college', 8),
('wanted', 8),
('felt', 7),
('get', 7),
('like', 7),
('things', 7)]

This makes a lot more sense. The top two most common words are the names of the two main characters, Ron and Molly. The third most common word, “calculus”, describes how those characters are connected: Ron is the professor of Molly’s calculus class. The next four most common words, “students”, “class”, “time”, and “mathematics” seem to make sense in this context.

We can also plot this frequency distribution:

figure_2

OK, that’s pretty neat. Now let’s take a measure of how rich our vocabulary was with the stopwords removed.

0.5680933852140078

That’s higher than the .32 score we saw earlier. Removing all those common stopwords like “a” and “the” is responsible for this increase.

Ok, so in order to save some of this data for later (in this case, the top 20 most common non-stopwords), we’ll turn that data into a list, which we’ll then use to write out a .csv file.

Finally, we can take this data and export it as a .csv file for further analysis in R or Excel.

You can download the files here as proof that this worked:

tokens.csv
newfile.csv

So that’s that. Obviously I didn’t do anything groundbreaking, but I did set up a platform for processing language files into tokens for NLP analysis. And we discovered that, after removing the stopwords, the three most common words were the names of the story’s two main characters (“Ron” and “Molly”), and the primary characteristic they have in common (“calculus”). SPOILER ALERT: the short story is about Ron, Molly, and their relationship.

So, while I’ve haven’t done much more than rookie-level NLP here, I think this kind of analysis can deliver some cool insights if it gets applied to an interesting dataset.

Facebook, Social Media, and The News Industry

EDITED TO ADD: As of May 13 2015, Facebook has started testing this feature on their website

Every Friday in our Strategic Management class, students from the class give a short presentation about their take on a business-related article they saw in the news recently. I like this part of the class because it’s a chance for students to share something that interests them, as opposed to other school presentations where, in most cases, you have to present on something you don’t care about.

Anyway, I did my presentation on the NYT report that Facebook is in talks with content publishers, including BuzzFeed, The New York Times, and National Geographic (among others) to agree on a deal where these content publishers would host their content inside of Facebook’s ecosystem in exchange for a cut of the advertising revenues from Facebook.

The media industry is at the heart of information flow and communication; essentially, media is the “first draft of history”, and it shapes our perception of the modern world. I find the industry to be especially interesting in today’s world since the internet has changed so much of it, so quickly. When I was younger, people still read newspapers and magazines. A typical citizen’s news diet might have consisted of just The New York Times (a newspaper), CNN (a TV network), and The New Yorker (an upscale magazine).

Today, that entire paradigm is almost completely gone. Easy internet access has shattered the barrier to producing and sharing content. Fred from North Dakota could write a blog post and get 100k page views if enough people found his take to be worth sharing. Social media—in particular, Facebook—has completely changed the way people consume journalism and other content. No longer do we rely on a small number of content publishers to deliver us everything we need to know. Instead, we read what our friends and colleagues are sharing, and then share the content that we find interesting.

Sharing is the key, here. For the longest time, newspapers like The New York Times were making enough money from advertising such that they didn’t need to worry as much about what their readers wanted to see. They were the gatekeepers of information: the editors decided to report and focus on whatever they thought to be newsworthy. They’d certainly be ashamed to print silly low-content articles like “18 Cats Who Need To Get A Grip”. But things have changed. Now, more and more, the people are deciding what is newsworthy by sharing content with their social networks.

There’s a couple of interesting ways to look at this. The first is the social theorist/political perspective. Is it, generally, a good thing that the people are deciding what is newsworthy? Should the masses have so much responsibility? If you believe in the power of our collective consciousness and are against censorship, then you might think this is a good thing. But there’s an argument to be made that leaving newsworthy stories to the discretion of the people could eliminate the incentive for important kinds of investigative reporting, such as Watergate-type reporting on political scandal. Sometimes, the medicine tastes bad, but you still need to take it.

The other way is from a business perspective, where the primary assumption has to be this: we can’t change what’s happening, so we have to adjust to it. And right now, there’s no sign that social media is going to lose influence on content distribution in the near future. Which is the real reason why these companies are “giving in” to the power of Facebook by letting them host their content: they have no other choice. Not doing so is an existential threat. If on one afternoon, Facebook’s algorithm decides to bury content from The New York Times, The Times’ core business would immediately be in major danger.

So it’s going to be interesting to see what happens to journalism now that social media has won this battle. Will the public become more conscious of the so-called “important things” like government, politics, science, and current events? Or will celebrity gossip, sports talk, and other low-content stories become even more pervasive than they already are?

These are interesting questions to think about while looking at the media industry in the coming years.

I’ll leave you to ponder EPIC: 2014, a short video made by Robin Sloan in 2004. It looks at how the internet changed the world in its early years, and then what might happen in the next ten. It didn’t quite nail the details, but the big-picture predictions are almost startlingly accurate.

Credit to Ben Thompson over at Stratechery, who has influenced my thinking on this subject. His analysis of this topic was my inspiration for choosing it in the first place.

Be Specific, Be Precise

Try and think of a person in your life that really pisses you off.

Maybe it’s somebody at work. Maybe it’s some douchebag in your fraternity. Or maybe it’s some guy named Dave who spews totally ridiculous political opinions like they’re gospel.

Now, think about why you can’t stand this person. But actually *really* think about it. Each time you think of a reason, try to challenge yourself and see if you can get more specific. Let’s call this “The ‘Being Specific’ Game” (clever, I know).

Anyway, let’s play the game and try to understand why we disagree with Dave.

This, for example, would be a bad way to disagree:

1. “I can’t stand Dave because he’s a stupid liberal.”

This is better:

2. “Dave posts a lot of things on Twitter about stopping war and criticizing America’s anti-terror programs. I don’t agree with how he seems to be so oblivious to danger.”

However, we can still improve that. How about:

3. “I can see where Dave is coming from when he criticizes war and anti-terror programs. He probably feels bad about all of the death and destruction that comes with war. But several independent studies have shown that our tactics have had [insert concrete mathematical results], and I think that these benefits outweigh the costs of our war and anti-terror initiatives.”

Let’s look at what’s going on in each of these examples.

The first example resorts to vicious name-calling without offering any insight to the discussion. Even if Dave is in the wrong, there’s no attempt at understanding or compromise. Instead, we go on the offensive by calling him a “stupid liberal”. The sad part about this situation is that it signals to Dave that we’re completely disregarding the possibility that he has any redeeming human qualities simply because we disagree with him.

The second example digs a little deeper. It addresses the source of disagreement, and then continues to communicate the source of disagreement, which has to do with the level of danger (from terrorists, presumably) each person perceives. What this example fails to do is provide any evidence or reasoning to address that point. Instead, we assume that we’re correct, which is why we can’t understand why Dave might feel differently.

The third example goes a few steps further. It begins with an attempt to see things from Dave’s perspective, which shows compassion and sets the tone for a serious discussion without making things personal. Then it counters the central point of contention by citing precise facts and figures that support the conclusion. And there’s nothing long-winded about taking this approach; in our example, it only took three sentences of effort.

This is all basic stuff that most people learned in school. Remember being told, over and over, to “be specific” while answering essay prompts?

However, this never became fully clear to me until I was shown Paul Graham’s hierarchy of disagreement:

Paul Graham's Hierarchy of Disagreement

Paul Graham’s Hierarchy of Disagreement

This chart demonstrates that there are tons of different ways to disagree with people, and each method signals something completely different to your audience. If your goal is to be a dick, then maybe name-calling is the way to go. If you want to be a shifty politician when you know your position is weak, then maybe you’ll use charismatic contradiction to convince people to vote for you.

But let’s play the game again. Why do you want to be a dick? “Because I can’t stand that person.” Why can’t you stand that person? “Because his opinions are stupid.” Why are his opinions stupid?

(…you probably see where this is going, especially if you’ve ever been through any kind of therapy.)

There are two reasons why I wanted to share my thoughts on this topic:

1. Exploring your feelings and reasons for believing certain things is an avenue towards personal growth and connection with other people. Disagreement doesn’t have to come with hate, anger, and (in some cases) violence. Disagreement can lead to understanding, which is a step towards peace and away from violence. This is the position taken by people like Ghandi and Martin Luther King, Jr.

2. Letting people get away with vague and imprecise reasoning can have dangerous consequences. This happens a lot in politics. Think about all of the increased measures of mass surveillance and “enhanced interrogation” we’ve justified in the name of “national security”, without any evidence to show that these methods are effective at achieving their goals. “State secrets” are really just a way for people in power to avoid being specific about their actions. Is that because being specific about their actions would spark outrage? It’s hard to say.

We need more educated and civil discussion about the important issues of our time. The rise of mathematics and data analysis in discourse makes me hopeful for the future, simply because these are more precise methods of communication than we’ve used in the past. But more people need to buy into this attitude. That means demanding people to be specific and to be precise about their thoughts and feelings, starting with yourself.

And the only way to spread the message is to keep tweeting, keep writing and keep talking.

(Just remember the rules of rocking the boat.)

Ultra 2015 Trip Report

The Main Stage at Ultra 2015

The Main Stage at Ultra 2015

It’s funny—each time you go to a music festival, your tastes change. This was my third Ultra in as many years, which made me the veteran among my group of friends. Most of my group was going for the second time, and one guy was making his rookie appearance.

The majority of my group was primarily concerned with posting up at the main stage, and who can blame them? The big names and the light show are hard to pass up—after all, it’s the safest choice, and nobody is going to give you a hard time for making that choice. But just like with everything else in life, taking intelligent risks comes with the highest potential reward, and this was my approach to Ultra 2015.

FRIDAY – DAY 1

It was sunny and hot as we made our way through security and entered the gates a bit after 5 PM. Our crew went directly to the Main Stage to catch the end of 3LAU’s performance, followed by future house hotshot Oliver Heldens (who is almost two years younger than me, by the way). I was pleasantly surprised by both of these sets. I was drinking beer, the sun was shining and it seemed like it was going to be a great first day.

However, late into Heldens’ set, dark clouds rolled in and the Florida palms started blowing. I checked my phone to discover a severe weather warning for hail and heavy rain for the next 45 minutes. I told my friends the story and we sprinted over to the Megastructure in hopes of finding a safe haven before the rain started. Fortunately for us, we arrived a few minutes before the torrential downpour and stayed mostly dry as we enjoyed some techno beats from Adam Beyer and Ida Engberg for about 45 minutes.

After the skies seemed to clear up, we made our way back to the Main Stage to see the last part of Dash Berlin, who had the crowd dancing to Darude’s classic track Sandstorm. Just before sunset, Afrojack took over the decks, and the Main Stage light show took over the sky. It seemed like it was going to be a great night…

…but only a few tracks into the set, the wind picked up again, and I checked my phone to see a massive green spot approaching South Florida on the radar map. One of my buddies and I decided to beat it and run for cover again, while the rest of our group took the risk and kept their spot at the Main Stage.

Sure enough, the rain started up again, and it continued to pour for about two and a half hours. Any part of the festival with the slightest bit of cover turned into a crowded, sweaty mess. I decided to get a $12 mega-burrito, because I didn’t feel like being both wet and hungry.

The rain stopped around 10:15, right as GTA’s set started at the Worldwide Stage where I had been seeking cover. He injected some energy into the crowd with his signature trap-style music, and after dealing with miserable weather for hours, people seemed excited to dance themselves dry. I know I was.

GTA at the Worldwide Stage

GTA at the Worldwide Stage

Around 11, our group met back up at the Megastructure to check out the legendary Carl Cox. The Megastructure got an upgrade this year; there’s a new state-of-the-art laser light system, and the back and side walls have a lot more surface area covered by screen. The outstanding visuals allow you to totally immerse yourself in the audio-visual experience, and that’s what we did for 45 minutes as I drifted into another world—until we left to beat the exit rush.

Carl Cox

Carl Cox

SATURDAY – DAY 2

After heading to Denny’s for a major-league lunch, we hit the Metro and got into the festival by close to 4 PM. Our group again started out at the main stage, and we saw performances from W&W and Fedde Le Grand. I wasn’t particularly impressed with any of them, and my displeasure ended up on Twitter:

I had originally planned on going over to Stage 7 at 8:15 to catch a performance from the leader of the Pure Trance movement, Solarstone. However, after getting bored of the predictable big-room acts on the Main Stage, I made my move at 6:10 and told my friends to text me a little later. I came to the festival to enjoy good music, so I made the executive decision to fly solo and see the acts I really wanted to see.

It was a fantastic decision.

Stage 7 was probably the smallest stage in the entire festival; it was built on an elevated platform above the bathrooms, so Ultra kept a limit on the number of people on the stage at any given time. I believe the exact number was 227 guests, so anytime there were 227 people on Stage 7, you had to wait in a line that wouldn’t move until other people left the stage.

It’s a good thing I discovered this at 6:15, as opposed to later in the night.

Anyway, I grabbed a Red Bull, mixed it with the vodka pouches I smuggled in, and waited in line for a few minutes before I got on to catch the start of Seid van Riel’s set. Sied is a trance artist who I’d heard good things about, though I’d never gotten around to experiencing much of his music. But boy oh boy, was I impressed with the up-tempo and uplifting sounds from this man, and I knew I had made a great decision.

Next up was Jochen Miller, but I needed to use the restroom and get more drinks because I didn’t want to be waiting in line for Solarstone. After doing both of those things, I got back up to hear the second half of Miller’s big-beat trance/house music as the sun went down and the beautiful Miami skyline lit up.

View from Stage 7

View from Stage 7

Both Solarstone and his protégé Giuseppe Ottaviani absolutely hit it out of the park, just as I had hoped they would. Solarstone opened with his remix of London Grammar’s “Wasting my Young Years”, just as he did for the last trance show I attended (Future Sound of Egypt: 350 in New York, NY). I had forgotten just how much fun it is to lose yourself in the lights and the music at 138 beats per minute.

Stage 7

Stage 7

I met up with my buddy and his girlfriend towards the end of Giuseppe’s set and went over to the Live Stage to get a good spot for Bassnectar. I had heard great things about Bassnectar’s live sets, even though his hard-charging music normally isn’t exactly my cup of tea. Bassnectar certainly delivered the goods, dropping bomb after bomb, including Seven Lions’ “Nepenthe”, which is a personal favorite of mine that I never expected to hear at a show.

All in all, Day 2 was a great day filled with great music, even if I had to do most of it by myself.

Bassnectar

Bassnectar

SUNDAY – DAY 3

Sunday at Ultra means A State of Trance in the Megastructure.

Once again, our crew loaded up at Denny’s for a pre-festival meal. This time, I splurged an extra few bucks on a Philly Cheesesteak omelet. This was a big day, and I needed a big meal to start it off the right way.

We got to the festival around 2:30ish, and there were noticeably fewer people in line than usual. I guess everybody was struggling to get started after two days of hard partying. However, this resulted in noticeably tougher body searches upon entrance to the festival. Fortunately, I wasn’t doing drugs for this festival, and I probably wouldn’t have gotten caught even if I was, but I thought it was interesting that they were giving the most thorough inspections on the final day of the festival.

So, we got inside and immediately split up—my Miami friends went to the Main Stage, while I went with my friends from out of town to take a walk and explore the festival.

Sunday was a gorgeous day, with clear sunny skies and a perfect temperature hovering around 76 degrees. We walked out behind the underground-themed Resistance Stage towards the bay side of the park. Apparently, this is where all the police were hanging out while off duty. I guess that this is where they would take you if you got arrested at the festival.

We enjoyed a beer next to Biscayne Bay, as we checked out the cruise ships hanging out in the Port of Miami.

Chilling by Biscayne Bay

Chilling by Biscayne Bay

We walked around to the other side of the festival towards the Live Stage and the UMF Radio Stage. I got a nice shot of the Main Stage without much of a crowd.

Mainstage

Mainstage

Over by UMF Radio, there was a guy doing some painting as the festival raged on.

Painting at a Music Festival

Painting at a Music Festival

We parked ourselves by the Live Stage for a little to enjoy the weather. I made another homemade Vodka RedBull, while my buddy went to the merchandise store to grab some souvenirs.

Gorgeous Sunday in Miami

Gorgeous Sunday in Miami

Around 4, we went down to the Megastructure, where we would be parked for about the next six hours. Andrew Rayel was finishing up, so I grabbed some more beers to accelerate the drinking pace.

Miami native Markus Schulz was next on the decks, and he brought some big-room trance/house to truly get the party started.

A few drinks later, Andrew Bayer brought his AnjunaBeats vibe to the Megastructure, to set the stage for the most anticipated act of the day…

..ERIC PRYDZ

Eric Prydz

Eric Prydz

What’s special about Prydz is that his sound is unique to the point where you could classify all of his progressive house and techno music into its own genre. And while most DJs play tracks from a variety of artists, Prydz will play sets that comprise entirely of his own, unreleased and unnamed tracks. In the Megastructure, with the lasers and the sunset…it was perfect.

I could try to write more about how special these 90 minutes were, but you should just listen for yourself.

Mr. Armin van Buuren followed Mr. Prydz.

Back when I was a newbie to electronic music, Armin was my favorite artist. I saw him perform for the first time two years ago in the same exact spot, and since then I’ve seen him three more times, including one performance as Gaia. I have to admit that lately I’ve become disillusioned with Armin. He seemed to be moving more and more towards the pop-EDM sound, and away from the beautiful trance sound that made him so famous to begin with. So I was definitely skeptical of whether or not this set would be worth staying for. I actually missed the first 10 minutes to make one final beer run.

Armin van Buuren

Armin van Buuren

But Trance God Armin decided to show up this time, and it made my day. He brought the heat right out of the gates, and progressed into an up-tempo, 138 BPM finish, including a play of what I believe to be the track of the year, Will Atkinson’s “Numb the Pain”.

After Armin’s set, out group met up at the Live Stage to catch the first half hour of Kygo’s tropical house performance, which was a nice break after soaring at high speed with Armin. Then we made our way to the Main Stage to catch the last half hour of Skrillex and his clown show, including cameos from Diplo, Puff Daddy, and Justin Bieber.

TOP SETS
1. Eric Prydz – Sunday, ASOT: 700, Megastructure
2. Armin van Buuren – Sunday, ASOT: 700, Megastructure
3. Giuseppe Ottaviani – Saturday, Stage 7
4. Sied van Riel – Saturday, Stage 7
5. Bassnectar – Saturday, Live Stage

HONORABLE MENTION
Solarstone – Saturday, Stage 7
GTA – Friday, Worldwide Stage
Markus Schulz – Sunday, ASOT: 700, Megastructure
Oliver Heldens – Friday, Main Stage

Rocking The Boat

rocktheboat

One of the great paradoxes of human society is that nobody seems to doubt themselves, even though pretty much nobody has any idea what the hell they’re doing. This phenomenon is eloquently postulated as the Dunning-Kruger Effect:

“The Dunning–Kruger effect is a cognitive bias wherein unskilled individuals suffer from illusory superiority, mistakenly assessing their ability to be much higher than is accurate. This bias is attributed to a metacognitive inability of the unskilled to recognize their ineptitude” — Wikipedia

Wait, what was that?

Basically, it takes some level of competence for a person to realize that the extent of their knowledge has a limit.

“Real knowledge is to know the extent of one’s ignorance” – Confucius

Suddenly, it makes sense. The more you know, the more you realize how much you don’t know. Doubt is a predictor of knowledge and understanding. Confidence is for fools. What a concept!

“One of the painful things about our time is that those who feel certainty are stupid, and those with any imagination and understanding are filled with doubt and indecision.” – Bertrand Russell

The problem with this phenomenon is that, naturally, nobody wants to believe it. Human society has a tendency to want it to be one way, when in reality, it’s the other way. People can’t deal with uncertainty and nuance. We’d rather believe something completely ridiculous than accept the possibility that we’re simply incapable of comprehending it. I can think of no better example of Dunning-Kruger manifesting on a macroscopic scale than the existence of organized religion, which is simply a fairy tale that helps people deal with their mortality. But dogma isn’t limited to religion; it exists almost everywhere you look.

So what does this all mean?

First of all, it demonstrates the existence of opportunities to do things better. Just because we’re doing something a certain way doesn’t mean that it’s the optimal way of doing it. People generally take the path of least resistance and do things that are easiest for them. In business situations, there’s usually a ton of low-hanging fruit in terms of cutting costs, increasing sales, and making processes more efficient. You just have to do a little work in order to find the answer.

But knowledge alone isn’t enough to drive change. Most things worth doing involve groups of people—including stupid, ignorant people who suffer from illusory superiority. That means you have to convince these people that your way is the right way. And it’s not easy to get these people to question themselves and think about being wrong, as we’ve just seen.

So how do you get people to trust you and your super-special ideas? What’s the key to making shit happen?

(You actually think that I know the answer?)

Politics is a tough game. Play it too conservatively and you’ll never get your way. But rock the boat too much (relative to your level of power and status) and the group will chew you up and spit you out, because they don’t give a shit about you or your crazy ideas. Clawing your way through group dynamics is very much an art that can’t be solved with a mathematical model or a proof.

And so this is where you should spend most of your time refining your skills if you want to want to drive change in your organization. Understanding the problem and being enlightened is difficult and takes effort, but it’s less than half the battle. Influencing groups of people has been, and always will remain the key to getting things to go your way. Being right doesn’t entitle you to anything, no matter how frustrating and unfair that might be.

So you have to rock the boat.

Just be careful to not get thrown overboard.

America’s Royal Families

Recently, I fired a few hot political takes into the Twittersphere:

It seems that the 2016 presidential election news cycle is well underway, and I’ve been feeling a little uneasy about the fact that the early favorites for the Democratic and Republican nominations are Hillary Clinton and Jeb Bush.

While I have qualms with both candidates’ official stances on various positions, I’m more concerned about the emergence of what are clearly, at this point, political dynasties in the United States.

As most people know, Jeb Bush is the brother of former President George W. Bush, and the son of former President George H.W. Bush (who served eight years as Vice President under Ronald Reagan). Hillary Clinton is the wife of former President Bill Clinton. She also narrowly lost the 2008 Democratic nomination to President Barack Obama, and served as Secretary of State for five years in the Obama Administration.

Those facts should be enough to raise some eyebrows.

Here are more fun facts: the Republicans have not won a presidential election without a Bush on the ticket since 1972, when Richard Nixon won the presidency. The Republicans have not won a presidential election without a Bush or Nixon on the ticket since 1928. While I do study applied math, it doesn’t take much sophisticated analysis to detect a pattern here.

To be clear: this post is not a criticism of the Clinton family or the Bush family, but rather a criticism of political dynasties, the current political environment in the United States, and partisan politics.

Which is why it’s so alarming to me that we are staring at another Bush vs. Clinton race. As an analyst, my first instinct when I see a pattern like this is to ask questions and get to the bottom of it. What is it about these two families that make them such attractive executive candidates? Is it something genetic? Do they know the right people? Are they simply the most genuine and sincere families in America? What exactly makes them more “qualified” than any of the other highly educated, successful, high-profile Americans out there?

(Maybe the question we should be focusing on is this: What makes somebody qualified to be President of the United States in the first place?)

“Okay Steve, but who would you rather put in office? You’re really good at telling me who we shouldn’t elect, but give me some examples of people we should elect!”

I agree that it’s easier to shoot somebody down than it is to actually come up with a reasonable solution on your own. Though none of these people are running for president, I’d like to hear some of the reasons why we should elect a Bush or a Clinton instead of successful modern business executives like Paul Allen, Tim Cook, or Larry Page.

My big fear with these dynasties is that it only strengthens the emotions that drive partisan politics. The whole idea behind our Democratic Republic is that we give a vote to every citizen because that’s the fairest way to elect the candidates that represent our interests. Power is balanced and decentralized with checks, balances, and elections. And an important assumption behind it all is that citizens will vote for the candidates that best represent the country’s political views and interests.

But partisan politics makes it such that people aren’t voting on the issues as much as they are voting for their political party (we’d be better off calling them “political teams”). And political dynasties only make it more personal. Instead of rooting for a party, now you’re rooting for a family—a family that *probably* prioritizes their own interests over those of the voting population.

Nothing screams “centralization of power” like the existence of royal families. America was born out of contempt for the king. It’s eerie to see it end up like this.

(But it’s not too late.)

Using R & ggplot2 For My Research Project

(This is my first post via RMarkdown, which is a way of communicating code and graphs from R in a clean HTML format.)

One of my three-credit courses this semester is a Directed Study in which I work on a research project and hopefully come up with a report and some cool results . After taking a Computer Simulation Systems class last semester (with ARENA Simulation Software) and familiarizing myself with concepts in simulation and optimization of complex systems, I was fortunate enough to get the opportunity to work on another independent simulation project this semester.

The system I'm studying is a commercial parking lot with 100 spots that earns money in two ways: by selling a guaranteed parking space for a monthly fee, or by charging an hourly rate for customers who show up without a monthly pass. The goal of the study is to come up with a flexible management strategy for this system that ultimately maximizes profit.

The monthly parking pass is designed to, in exchange for a flat monthly fee, guarantee a parking space each day for subscribers. We'll call these subscribers “monthly” customers. Part of the management strategy involves determining how many spaces in the lot should be reserved exclusively for these monthly subscribers. We could reserve one for every subscriber, but that would probably leave some profit on the table as we don't expect these monthly passholders to show up every single day. Still, anytime a monthly subscriber is turned away due to a lack of spots, our company incurs a hefty expense since we failed on our end of the deal.

On the other hand, cars that show up to the lot without a monthly pass aren't guaranteed a parking spot. If a car shows up when the lot's unreserved spots are full, that car is turned away from the lot, and our simulation adds this to a statistic called “Unrealized Profit”. This group of customers are called the “hourly” customers.

Part of the reason we use simulations is to see how the entire system's behavior changes when we adjust only one parameter. If we want to see how the amount of monthly parking spots we reserve impacts our net profit, we can do that by holding everything fixed and running simulations for different numbers of reserved spots. When we get the results, we make plots to observe and understand the behavior.

Anyway, the point of this post is to make some cool plots in R. In order to build the plots, I turned to the ggplot2 library. While making simple plots in R isn't difficult, ggplot2 allows us to make beautiful plots with custom colors, fonts, and other nifty features.

After running our simulation and gathering the data, I opened it in R with the following code:

The first plot I developed is exactly what I just described above: we are comparing the relationship between Monthly Spots Reserved and Net Profit, while also varying the cost (Monthly Turnaway Cost) we incur for turning away monthly subscribers to see if that has any impact.

Here is the code, followed by the plot:

plot of chunk unnamed-chunk-2

As you can see, when we have a greater number of Monthly Spots Reserved, the Monthly Turnaway Cost doesn't have much of an impact on our Net Profit. However, when we reserve fewer spots, more and more monthly customers get turned away, increasing the impact that Monthly Turnaway Cost has on our profits. When the Monthly Turnaway Cost is $150, limiting the number of Monthly Spots Reserved looks like a terrible idea. However, when the Monthly Turnaway Cost is only $25, it becomes less of an issue. Of course, the business reality here is that when a monthly customer shows up to no parking spot, that's on us, so we probably want to pay more attention to the lines associated with larger Monthly Turnaway Costs.

The next plot will show the same thing, but instead of adjusting Monthly Spots Reserved, we keep that fixed and adjust the Hourly Arrival Rate, which describes how often hourly customers generally show up to the lot.

plot of chunk unnamed-chunk-3

When the Hourly Arrival Rate is small, fewer hourly customers show up to the parking lot, resulting in more spots available for the monthly subscribers and, ultimately, fewer monthly turnaways. However, when the Hourly Arrival Rate increases, there is less room for incoming monthly subscribers should all of the Monthly Spots Reserved be occupied, ultimately resulting in more monthly turnaways digging into our profits.

The ggplot2 package is great for adding artistic expression and personality to your reports and data visualizations. Visually appealing charts can go a long way in telling an effective story with your data.

The Philosophy of Learning From Data

bigdatadefinition

Rarely am I more excited to go to class than I am on Tuesday nights for my weekly Data Mining lecture. While I’m already familiar with some of the concepts, our professor is doing a really good job of tying everything together around the central philosophy of practical machine learning, which is to make predictions and generalize well.

We’ve been told from day one that we don’t care about p-values or traditional statistics—not that they aren’t important, but in this class we can actually keep score of how well our model predicts whatever it’s trying to predict. This represents a fundamental shift in how people perceive and understand the world around them. No longer are we simply testing hypotheses and running basic stats to judge their plausibility; instead, more and more, we are using machines that can learn from data.

A lot of people get tripped up in their everyday thinking by looking at situations as either black or white. Just because there’s only one actual story doesn’t mean that we can’t predict which scenarios were most likely to have happened using what we know from the past. All the time, I listen to people who claim to know *exactly* what happened in a scenario when it’s literally impossible for anybody to know for sure. A lot of these conversations happen on Saturday mornings after a late night out. Others involve government conspiracies. But this kind of thinking happens all the time, and if you look for it, you’ll start to see it everywhere. And it has a name too: Confirmation Bias.

“Confirmation bias is the tendency to search for, interpret, or recall information in a way that confirms one’s beliefs or hypotheses.”

You see, most of the time, us humans tend to find whatever we’re looking for, even when it’s not actually there. It seems like a silly thing to do, but it’s been essential to our survival for a long period of time. In order to achieve your goals, you need to rationalize to yourself that whatever steps you take to achieve those goals are necessary. If you start questioning your beliefs or your motives, then your goals might never get accomplished. People in science might have a problem with that, but people in business and other fast-moving enterprises (like a sports team or the military) really understand the value of good heuristics and guiding principles to make time-sensitive decisions.

Chess is a game where time is (usually) a factor in making decisions. A trait of strong chess players is that they consider only a small set of good moves when making decisions. This is because the advanced player has learned a general idea of what a good move looks like. Beginners, on the other hand, have no concept of what makes a good move. When deciding on a move, beginners are deciding between a massive set of possible moves, because they don’t have the heuristics to narrow down their selection space to moves that generalize well later in the game. As a result, the beginner will generally pick more bad moves than the stronger player who picks from a smaller set of quality moves.

What we’re actually doing by “learning from data” is building better heuristics for our decision-making. We’re getting better at generalizing in situations where the best choice is not obvious, which is the fundamental goal of machine learning. Humans are doing less and less interpretation and hypothesis testing because it’s not a winning strategy—the possible set of explanations for whatever problem you’re trying to solve is simply too complicated for you to rely upon intuitive reasoning. Instead, it’s best to collect and analyze massive amounts of data in order to observe what’s actually happening on a more detailed level. When you do this, you can quickly build better heuristics, and as a result, make better decisions.

Heuristics are moving out of our “gut” and onto our monitors. This is where the world is going, and it’s going to impact a wide range of industries in the coming years. There are data scientists who can easily build excellent models on topics for which they have almost no domain knowledge. And data hotshot Jeremy Howard is excited about the implications of AI progress, but he’s concerned that the labor market will take a pretty big hit. Fortune Magazine recently hypothesized an Algorithmic CEO. I might not know much, but if your plan is to become a high-performing and productive member of society, I’d pay attention to how data is going to impact your industry.

Decision-Making Breakdowns: Challenger Edition

I’m a big Twitter reader, and lately I’ve been seeing more and more examples of good journalism and storytelling that use data analysis and quantitative evidence to support a thesis. The IQuantNY tumblr is a great example of this. You can read more about it here, but basically, a guy named Ben Wellington is writing stories about the discoveries he makes while digging through public data provided by New York City. He’s not writing with academic jargon; he’s telling a story that can only be told with data skills, but in a way that’s accessible to anybody.

The pursuit of these skills has me taking a graduate-level Data Mining course this semester. This course is primarily about the analysis of data with statistical and machine learning techniques to build models and, at the end of the day, make predictions.

Because this is a business school class, the focus is on practicality; we want to know when to use certain models, and why they are useful to us. We’ve learned that sometimes, the purpose of building models is to use them as hard rules in the decision-making process. But in most cases, we use them as heuristics—general guides to provide context for solving a larger problem.

Tonight, we learned about generalized linear models, and in particular, logistic regression. The purpose of logistic regression is to build a model with input (predictor) variables that generates a probability as an output. That probability is the chance that a particular instance will fall into one of the two possible output spaces.

One example in tonight’s class was predicting space shuttle O-Ring erosion with outside temperature at launch time as our predictor. If you take a look at the graph below from data on 23 shuttle missions prior to the Challenger disaster, there seems to be a relationship between Temperature and O-Ring damage; as launch temperature gets colder, the probability of O-Ring damage increases.

challengerplot

Using logistic regression on this data, a model was constructed to predict the level of O-Ring erosion given temperature as an input. The model predicted that, at Challenger’s launch temperature (31⁰ F), the probability of one O-Ring failure was .99, and 5.95 out of 6 O-Rings were expected to fail. Using this model as a heuristic, you would get the idea that launching Challenger in that weather was a catastrophic risk.

challengermodel

This is interesting not just because of the logistic regression involved, but because it shows a clear and fundamental breakdown in the decision-making process at NASA. And though the Challenger disaster was almost thirty years ago, I’m going to guess that this still happens all the time for regular people–you know, people who aren’t rocket scientists.

An investigation into NASA indeed revealed a broken decision-making process. This has become a classic case-study in the consequences of group-think, which occurs when a group of people make a sub-optimal decision due to behavior that prefers avoiding conflict over critical and careful analysis.

This makes too much sense to me, and should to anybody who is part of any organization focused on achieving certain goals. Conflict is often frowned upon in our society, to the point that I’ll often choose to forgo conflict because it’s in my best interest. Engaging in conflict takes effort, and it can hurt feelings or damage relationships of the people involved. So it’s easy to see how group-think can plague group decision-making processes.

What’s the solution? I’m not sure. Increasingly, predictive modeling and machine learning are becoming commonplace and more accessible to businesses, sports teams, and other organizations. Because of this, we can build better heuristics, and ultimately make better decisions. But heuristics don’t make decisions—people do. If the powerful people in your organization refuse to listen to the data, your data mining skills won’t get you anywhere.

It’ll be interesting to see how long it takes managers in different sectors to embrace data-driven decision-making. I already wrote about Mike McCarthy and how his conservative playcalling took points off the board for the Green Bay Packers, ultimately costing them their season. It just goes to show you that there is a lot of low-hanging fruit to be picked through simple improvements in decision-making processes everywhere.

Blame McCarthy for Lighting Points on Fire

Yesterday, Packers coach Mike McCarthy made more than a few questionable in-game decisions during the course of his team’s epic collapse to the Seattle Seahawks, costing them what seemed to be an all-but-guaranteed trip to Phoenix for Super Bowl XLIX.  Bill Barnwell has a nice take today up on Grantland, which is quite explicit about the shortcomings of McCarthyism.

The whole situation was enough to get me a little agitated, as I expressed via Twitter:

Let’s all just sit here and think about it for a second to review the basics. There are really only two levers of control that a team has in influencing the outcome of an NFL football game.

One is the performance of the players on the field. Through coaching, training, and practice throughout each week, players prepare to perform to the best of their abilities to shift the outcome of the game in their favor.

The second sphere of influence is in-game decision-making. Play-calling, timeout strategy, substitutions, and offensive tempo all fall under this umbrella. These are the things that you might not explicitly see with your eyes unless you are looking closely.

When it comes to performance of players on the field, there are only so many things you can do without breaching the rules and getting suspended, fined, or penalized in some way. For example, performance-enhancing drugs are explicitly illegal under the current CBA & league rules. If these rules weren’t in place, it’s reasonable to assume that most players would partake in the consumption of these substances in order to increase their influence on the outcome of a game (actually, some players even do this in spite of the rules). Taken a step further, let’s pretend that robots and other wearable technology were allowed on the field with the players to influence the outcome of a game. If this were the case, don’t you think teams would be putting some serious capital investment and R&D into these arenas? A robot that could play quarterback better than Aaron Rodgers would probably be a hot commodity on the free-agent market. But unfortunately, robots are illegal on the field.

What happens when we extend this line of thinking to in-game decision-making? Well, unless I’m mistaken, there are no rules that prohibit robots from helping coaches with play-calling, timeout strategy, substitutions, and how to set the offensive tempo. In fact, the New York Times already has a robot that makes pretty good judgments on fourth-down decisions. Why are more teams in the NFL not taking advantage of these legal, performance-enhancing methods? Your guess is as good as mine.

Listen, I know that this stuff shouldn’t drive me crazy. I know that I’m really fighting the good fight by blogging about fourth-down decision-making on Martin Luther King, Jr. Day. But when an NFL coach lights 2.6 expected points on fire in the first quarter of a high-stakes game where his team is the underdog, there should be some immediate form of consequence, especially for the team’s fans who contribute so much time, money, and emotion to that coach’s employer. If people can make such destructive and preventable decisions in broad daylight without any form of real consequence from the people charged with holding them responsible, what does that say about the version of society which we’re currently a part of?