content/blog/2015/11/happy-little-words.markdown @ 954bb5f0facd

Update Makefile
author Steve Losh <steve@stevelosh.com>
date Fri, 07 Oct 2016 12:50:39 +0000
parents e7bc59b9ebda
children fdf01e99fd51
+++
title = "Happy Little Words"
snip = "Analyzing the Bob Ross Twitch chat."
date = 2015-11-20T18:43:00Z
draft = false

+++

In late October the video game streaming site Twitch.tv [launched "Twitch
Creative"][twitch-creative], essentially giving people permission to stream
non-video game related creative content on the site.  To celebrate the launch
they streamed all 403 episodes of [The Joy of Painting with Bob Ross][joy] in
a giant marathon.

The Bob Ross channel has its own chat room, and it quickly became packed with
folks watching Bob paint.  The chat spawned its own memes and conventions within
days, mostly taking gamer slang (e.g. "gg" for "good game") and applying it to
the show (people spam "gg" in the chat whenever Bob finishes a painting).

Sadly that marathon has ended, but they've kept the dream alive by having ["Bob
Ross Night" on Mondays][mondays].  Every Monday they're going to stream a season
of the show twice (once at a Europe-friendly time and again for American folks).
Last Monday I scraped the Twitch chat during the marathon(s) of Season 2 and
decided to have some fun poking around at the data.

[twitch-creative]: http://blog.twitch.tv/2015/10/introducing-twitch-creative/
[joy]: https://en.wikipedia.org/wiki/The_Joy_of_Painting
[mondays]: http://blog.twitch.tv/2015/11/monday-night-is-bob-ross-night/

{{% toc %}}

## Scraping

Scraping the chat was pretty easy.  Twitch has an IRC gateway for chats, so
I just ran an IRC client ([weechat][]) on a VPS and had it log the channel like
any other.  Once the marathon finished I just `scp`'ed down the 8mb log and
started working with it.

First I trimmed both ends to only leave messages from about an hour and a half
before and after the marathons started and ended.  So the data I'm going to work
with runs from 2015-11-16 14:30 to 2015-11-17 07:30 (all times are in UTC),
or 17 hours.

Then I cleaned it up to
remove some of the cruft (status messages from the client and such) and
lowercase everything:

    cat data/raw | grep -E '^[^\t]+\t <' | gsed -e 's/./\L\0/g' > data/log

Then I made an ugly little Python script to massage the data into something
a bit easier to work with later:

```python
import datetime, sys, time

def datetime_to_epoch(dt):
    return int(time.mktime(dt.timetuple()))

for line in sys.stdin:
    timestamp, nick, msg = (s.strip() for s in line.split('\t', 2))

    timestamp = datetime_to_epoch(
        datetime.datetime.strptime(timestamp, '%Y-%m-%d %H:%M:%S'))

    # strip off <>'s
    nick = nick[1:-1]

    print(timestamp, nick, msg)
```

This results in a file with one message per line, in the format:

    timestamp nick message goes here...

On a side note: I tried out [Mosh][] for persisting a connection to the server
(instead of using tmux or screen to persist a session) and it worked pretty
well.  I might start using it more often.

[weechat]: https://weechat.org/
[Mosh]: https://mosh.mit.edu/

## Volume

Now that we've got a nice clean corpus, let's start playing with it!

The obvious first question: how many messages did people send in total?

     ><((°> cat data/messages | wc -l
      165368

That's almost 10,000 messages per hour!  And since there were periods of almost
no activity before, between, and after the two marathons it means the rate
*during* them was well over that!

Who talked the most?

    ><((°> cat data/messages | cuts -f 2 | sort | uniq -c | sort -nr | head -5
     269 fuscia13
     259 almightypainter
     239 sabrinamywaifu
     235 roudydogg1
     201 ionone

Some talkative folks (though honestly I expected a bit higher numbers here).
[cuts][] is "**cut** on **s**paces" — a little function I use so I don't have
to type `-d ' '` all the time.

[cuts]: https://bitbucket.org/sjl/dotfiles/src/default/fish/functions/cuts.fish

## N-grams

The chat has spawned a bunch of its own memes and jargon.  I made another ugly
Python script to split up messages into [n-grams] so we can analyze them more
easily:

```python
import sys
import nltk

def window(coll, size):
    '''Generate a "sliding window" of tuples of size l over coll.

    coll must be sliceable and have a fixed len.

    '''
    coll_len = len(coll)
    for i in range(coll_len):
        if i + size > coll_len:
            break
        else:
            yield tuple(coll[i:i+size])

for line in sys.stdin:
    timestamp, nick, msg = line.split(' ', 2)

    n = int(sys.argv[1])

    for ngram in set(window(nltk.word_tokenize(msg), n)):
        print(timestamp, nick, '__'.join(ngram))
```

This lets us easily split a message into unigrams:

    ><((°> echo "1447680000 sjl beat the devil out of it" | python src/split.py 1
    1447680000 sjl it
    1447680000 sjl the
    1447680000 sjl beat
    1447680000 sjl of
    1447680000 sjl out
    1447680000 sjl devil

The order of n-grams within a message isn't preserved because the splitting
script uses a `set` to remove duplicate n-grams.  I wanted to remove dupes
because it turns out people frequently copy and paste the same word many times
in a single message and I didn't want that to throw off the numbers.

Bigrams are just as easy — just change the parameter to `split.py`:

    ><((°> echo "1447680000 sjl beat the devil out of it" | python src/split.py 2
    1447680000 sjl of__it
    1447680000 sjl the__devil
    1447680000 sjl beat__the
    1447680000 sjl devil__out
    1447680000 sjl out__of

N-grams are joined with double underscores to make them easier to plot later.

So what are the most frequent unigrams?

    ><((°> cat data/words | cuts -f3 | sort | uniq -c | sort -nr | head -15
    19523 bob
    14367 ruined
    11961 kappaross
    11331 !
    10989 gg
    7666 is
    6592 the
    6305 ?
    6090 i
    5376 biblethump
    5240 devil
    5122 saved
    5075 rip
    4813 it
    4727 a

Some of these are expected, like "Bob" and stopwords like "is" and "the".

The chat loves to spam "RUINED" whenever Bob makes a drastic change to the
painting that looks awful at first, and then spam "SAVED" once he applies a bit
more paint and it looks beautiful.  This happens frequently with mountains.

"KappaRoss" and "BibleThump" are [Twitch emotes][kappa] that produce small
images in the chat.

When Bob cleans his brush he beats it against the leg of the easel to remove the
paint thinner, and he often smiles and says "just beat the devil out of it".  It
didn't take long before chat started spamming "RIP DEVIL" every time he cleans
the brush.

How about the most frequent bigrams and trigrams?

    ><((°> cat data/bigrams | cuts -f3 | sort | uniq -c | sort -nr | head -15
    3731 rip__devil
    3153 !__!
    2660 bob__ross
    2490 hi__bob
    1844 <__3
    1838 kappaross__kappaross
    1533 bob__is
    1409 bob__!
    1389 god__bless
    1324 happy__little
    1181 van__dyke
    1093 gg__wp
    1024 is__back
    908 i__believe
    895 ?__?

    ><((°> cat data/trigrams | cuts -f3 | sort | uniq -c | sort -nr | head -15
    2130 !__!__!
    1368 kappaross__kappaross__kappaross
    678 van__dyke__brown
    617 bob__is__back
    548 ?__?__?
    503 biblethump__biblethump__biblethump
    401 bob__ross__is
    377 hi__bob__!
    376 beat__the__devil
    361 bob__!__!
    331 bob__<__3
    324 <__3__<
    324 3__<__3
    303 i__love__you
    302 son__of__a

Looks like lots of love for Bob and no sympathy for the devil.  It also seems
like [Van Dyke Brown][vdb] is Twitch chat's favorite color by a landslide.

Note that the exact n-grams depend on the tokenization method.  I used NLTK's
`word_tokenize` because it was easy and worked pretty well.
`wordpunct_tokenize` also works, but it splits up basic punctuation a bit too
much for my liking (e.g. it turns `bob's` into three tokens `bob`, `'`, and `s`,
where `word_tokenize` produces just `bob` and `'s`).

[n-grams]: https://en.wikipedia.org/wiki/N-gram
[kappa]: https://fivethirtyeight.com/features/why-a-former-twitch-employee-has-one-of-the-most-reproduced-faces-ever/
[vdb]: https://www.bobross.com/ProductDetails.asp?ProductCode=VanDykeBrown

## Graphing

Pure numbers are interesting, but [can be misleading][aq].  Let's make some
graphs to get a sense of what the data feels like.  I'm using [gnuplot][] to
make the graphs.

What does the overall volume look like?  We'll use minute-wide buckets in the
x axis to make the graph a bit easier to read.

[![Photo](/media/images/blog/2015/11/hlw-total.png)](/media/images/blog/2015/11/hlw-total-large.png)

Can you tell where the two marathons start and end?

Let's try to identify where episodes start and finish.  Chat usually spams "hi
bob" when an episode starts and "gg" when it finishes, so let's plot those.
We'll use 30-second x buckets here because a minute isn't a fine enough
resolution for the events we're looking for.  To make it easier to read we'll
just look at the first half of the first marathon.

[![Photo](/media/images/blog/2015/11/hlw-higg.png)](/media/images/blog/2015/11/hlw-higg-large.png)

This works pretty well!  The graph starts with a big spike of "hi bob", then as
each episode finishes we see a (huge) spike of "gg", followed immediately by
a round of "hi bob" as the next episode starts.

Can we find all the times Bob cleaned his brush?

[![Photo](/media/images/blog/2015/11/hlw-ripdevil.png)](/media/images/blog/2015/11/hlw-ripdevil-large.png)

Looks like the devil isn't having a very good time.  It's encouraging that the
two seasons have roughly the same structure (three main clusters of peaks).

Note that there are a couple of smaller peaks between the two showings.  Twitch
showed another streamer painting between the two marathons, so it's likely that
she cleaned her brush a couple of times and the chat responded.  Fewer people
were watching the stream during the break, hence the smaller peaks.

When did Bob get the most love?  We'll use 5-minute x bins here because we just
want a general idea.

[![Photo](/media/images/blog/2015/11/hlw-love.png)](/media/images/blog/2015/11/hlw-love-large.png)

Lots of love all around, but especially as he signed off at the end.

One of my favorite moments was when Bob said something about "changing your mind
in mid **stream**" and the chat started spamming conspiracy theories about how
he somehow knew about the stream 30 years in the past:

[![Photo](/media/images/blog/2015/11/hlw-heknew.png)](/media/images/blog/2015/11/hlw-heknew-large.png)

[aq]: https://en.wikipedia.org/wiki/Anscombe%27s_quartet
[gnuplot]: http://www.gnuplot.info/

## Up Next

Poking around at this chat corpus was a lot of fun (and *definitely* counts as
studying for my NLP final, *definitely*).  I'll probably record the chat during
next week's marathon and do some more poking, specifically around finding unique
events (e.g. his son Steve coming on the show) by comparing rate percentiles.

If you've got other ideas for things I should graph, [let me know][twitter].

[twitter]: http://twitter.com/stevelosh