--- /dev/null Thu Jan 01 00:00:00 1970 +0000
+++ b/content/blog/2015/11/happy-little-words.html Fri Nov 20 18:44:20 2015 +0000
@@ -0,0 +1,314 @@
+ {% extends "_post.html" %}
+ {% hyde
+ title: "Happy Little Words"
+ snip: "Analyzing the Bob Ross Twitch chat."
+ created: 2015-11-20 18:43:00
+ %}
+ {% block article %}
+In late October the video game streaming site Twitch.tv [launched "Twitch
+Creative"][twitch-creative], essentially giving people permission to stream
+non-video game related creative content on the site. To celebrate the launch
+they streamed all 403 episodes of [The Joy of Painting with Bob Ross][joy] in
+a giant marathon.
+The Bob Ross channel has its own chat room, and it quickly became packed with
+folks watching Bob paint. The chat spawned its own memes and conventions within
+days, mostly taking gamer slang (e.g. "gg" for "good game") and applying it to
+the show (people spam "gg" in the chat whenever Bob finishes a painting).
+Sadly that marathon has ended, but they've kept the dream alive by having ["Bob
+Ross Night" on Mondays][mondays]. Every Monday they're going to stream a season
+of the show twice (once at a Europe-friendly time and again for American folks).
+Last Monday I scraped the Twitch chat during the marathon(s) of Season 2 and
+decided to have some fun poking around at the data.
+[twitch-creative]: http://blog.twitch.tv/2015/10/introducing-twitch-creative/
+[joy]: https://en.wikipedia.org/wiki/The_Joy_of_Painting
+[mondays]: http://blog.twitch.tv/2015/11/monday-night-is-bob-ross-night/
+## Scraping
+Scraping the chat was pretty easy. Twitch has an IRC gateway for chats, so
+I just ran an IRC client ([weechat][]) on a VPS and had it log the channel like
+any other. Once the marathon finished I just `scp`'ed down the 8mb log and
+started working with it.
+First I trimmed both ends to only leave messages from about an hour and a half
+before and after the marathons started and ended. So the data I'm going to work
+with runs from 2015-11-16 14:30 to 2015-11-16 14:30 (all times are in UTC), or
+17 hours.
+Then I cleaned it up to
+remove some of the cruft (status messages from the client and such) and
+lowercase everything:
+ cat data/raw | grep -E '^[^\t]+\t <' | gsed -e 's/./\L\0/g' > data/log
+Then I made an ugly little Python script to massage the data into something
+a bit easier to work with later:
+ :::python
+ import datetime, sys, time
+ def datetime_to_epoch(dt):
+ return int(time.mktime(dt.timetuple()))
+ for line in sys.stdin:
+ timestamp, nick, msg = (s.strip() for s in line.split('\t', 2))
+ timestamp = datetime_to_epoch(
+ datetime.datetime.strptime(timestamp, '%Y-%m-%d %H:%M:%S'))
+ # strip off <>'s
+ nick = nick[1:-1]
+ print(timestamp, nick, msg)
+This results in a file with one message per line, in the format:
+ timestamp nick message goes here...
+On a side note: I tried out [Mosh][] for persisting a connection to the server
+(instead of using tmux or screen to persist a session) and it worked pretty
+well. I might start using it more often.
+[weechat]: https://weechat.org/
+[Mosh]: https://mosh.mit.edu/
+## Volume
+Now that we've got a nice clean corpus, let's start playing with it!
+The obvious first question: how many messages did people send in total?
+ ><((°> cat data/messages | wc -l
+ 165368
+That's almost 10,000 messages per hour! And since there were periods of almost
+no activity before, between, and after the two marathons it means the rate
+*during* them was well over that!
+Who talked the most?
+ ><((°> cat data/messages | cuts -f 2 | sort | uniq -c | sort -nr | head -5
+ 269 fuscia13
+ 259 almightypainter
+ 239 sabrinamywaifu
+ 235 roudydogg1
+ 201 ionone
+Some talkative folks (though honestly I expected a bit higher numbers here).
+[cuts][] is "**cut** on **s**paces" -- a little function I use so I don't have
+to type `-d ' '` all the time.
+[cuts]: https://bitbucket.org/sjl/dotfiles/src/default/fish/functions/cuts.fish
+## N-grams
+The chat has spawned a bunch of its own memes and jargon. I made another ugly
+Python script to split up messages into [n-grams] so we can analyze them more
+ :::python
+ import sys
+ import nltk
+ def window(coll, size):
+ '''Generate a "sliding window" of tuples of size l over coll.
+ coll must be sliceable and have a fixed len.
+ '''
+ coll_len = len(coll)
+ for i in range(coll_len):
+ if i + size > coll_len:
+ break
+ else:
+ yield tuple(coll[i:i+size])
+ for line in sys.stdin:
+ timestamp, nick, msg = line.split(' ', 2)
+ n = int(sys.argv[1])
+ for ngram in set(window(nltk.word_tokenize(msg), n)):
+ print(timestamp, nick, '__'.join(ngram))
+This lets us easily split a message into unigrams:
+ ><((°> echo "1447680000 sjl beat the devil out of it" | python src/split.py 1
+ 1447680000 sjl it
+ 1447680000 sjl the
+ 1447680000 sjl beat
+ 1447680000 sjl of
+ 1447680000 sjl out
+ 1447680000 sjl devil
+The order of n-grams within a message isn't preserved because the splitting
+script uses a `set` to remove duplicate n-grams. I wanted to remove dupes
+because it turns out people frequently copy and paste the same word many times
+in a single message and I didn't want that to throw off the numbers.
+Bigrams are just as easy -- just change the parameter to `split.py`:
+ ><((°> echo "1447680000 sjl beat the devil out of it" | python src/split.py 2
+ 1447680000 sjl of__it
+ 1447680000 sjl the__devil
+ 1447680000 sjl beat__the
+ 1447680000 sjl devil__out
+ 1447680000 sjl out__of
+N-grams are joined with double underscores to make them easier to plot later.
+So what are the most frequent unigrams?
+ ><((°> cat data/words | cuts -f3 | sort | uniq -c | sort -nr | head -15
+ 19523 bob
+ 14367 ruined
+ 11961 kappaross
+ 11331 !
+ 10989 gg
+ 7666 is
+ 6592 the
+ 6305 ?
+ 6090 i
+ 5376 biblethump
+ 5240 devil
+ 5122 saved
+ 5075 rip
+ 4813 it
+ 4727 a
+Some of these are expected, like "Bob" and stopwords like "is" and "the".
+The chat loves to spam "RUINED" whenever Bob makes a drastic change to the
+painting that looks awful at first, and then spam "SAVED" once he applies a bit
+more paint and it looks beautiful. This happens frequently with mountains.
+"KappaRoss" and "BibleThump" are [Twitch emotes][kappa] that produce small
+images in the chat.
+When Bob cleans his brush he beats it against the leg of the easel to remove the
+paint thinner, and he often smiles and says "just beat the devil out of it". It
+didn't take long before chat started spamming "RIP DEVIL" every time he cleans
+the brush.
+How about the most frequent bigrams and trigrams?
+ ><((°> cat data/bigrams | cuts -f3 | sort | uniq -c | sort -nr | head -15
+ 3731 rip__devil
+ 3153 !__!
+ 2660 bob__ross
+ 2490 hi__bob
+ 1844 <__3
+ 1838 kappaross__kappaross
+ 1533 bob__is
+ 1409 bob__!
+ 1389 god__bless
+ 1324 happy__little
+ 1181 van__dyke
+ 1093 gg__wp
+ 1024 is__back
+ 908 i__believe
+ 895 ?__?
+ ><((°> cat data/trigrams | cuts -f3 | sort | uniq -c | sort -nr | head -15
+ 2130 !__!__!
+ 1368 kappaross__kappaross__kappaross
+ 678 van__dyke__brown
+ 617 bob__is__back
+ 548 ?__?__?
+ 503 biblethump__biblethump__biblethump
+ 401 bob__ross__is
+ 377 hi__bob__!
+ 376 beat__the__devil
+ 361 bob__!__!
+ 331 bob__<__3
+ 324 <__3__<
+ 324 3__<__3
+ 303 i__love__you
+ 302 son__of__a
+Looks like lots of love for Bob and no sympathy for the devil. It also seems
+like [Van Dyke Brown][vdb] is Twitch chat's favorite color by a landslide.
+Note that the exact n-grams depend on the tokenization method. I used NLTK's
+`word_tokenize` because it was easy and worked pretty well.
+`wordpunct_tokenize` also works, but it splits up basic punctuation a bit too
+much for my liking (e.g. it turns `bob's` into three tokens `bob`, `'`, and `s`,
+where `word_tokenize` produces just `bob` and `'s`).
+[n-grams]: https://en.wikipedia.org/wiki/N-gram
+[kappa]: https://fivethirtyeight.com/features/why-a-former-twitch-employee-has-one-of-the-most-reproduced-faces-ever/
+[vdb]: https://www.bobross.com/ProductDetails.asp?ProductCode=VanDykeBrown
+## Graphing
+Pure numbers are interesting, but [can be misleading][aq]. Let's make some
+graphs to get a sense of what the data feels like. I'm using [gnuplot][] to
+make the graphs.
+What does the overall volume look like? We'll use minute-wide buckets in the
+x axis to make the graph a bit easier to read.
+[](/media/images{{ parent_url }}/hlw-total-large.png)
+Can you tell where the two marathons start and end?
+Let's try to identify where episodes start and finish. Chat usually spams "hi
+bob" when an episode starts and "gg" when it finishes, so let's plot those.
+We'll use 30-second x buckets here because a minute isn't a fine enough
+resolution for the events we're looking for. To make it easier to read we'll
+just look at the first half of the first marathon.
+[](/media/images{{ parent_url }}/hlw-higg-large.png)
+This works pretty well! The graph starts with a big spike of "hi bob", then as
+each episode finishes we see a (huge) spike of "gg", followed immediately by
+a round of "hi bob" as the next episode starts.
+Can we find all the times Bob cleaned his brush?
+[](/media/images{{ parent_url }}/hlw-ripdevil-large.png)
+Looks like the devil isn't having a very good time. It's encouraging that the
+two seasons have roughly the same structure (three main clusters of peaks).
+Note that there are a couple of smaller peaks between the two showings. Twitch
+showed another streamer painting between the two marathons, so it's likely that
+she cleaned her brush a couple of times and the chat responded. Fewer people
+were watching the stream during the break, hence the smaller peaks.
+When did Bob get the most love? We'll use 5-minute x bins here because we just
+want a general idea.
+[](/media/images{{ parent_url }}/hlw-love-large.png)
+Lots of love all around, but especially as he signed off at the end.
+One of my favorite moments was when Bob said something about "changing your mind
+in mid **stream**" and the chat started spamming conspiracy theories about how
+he somehow knew about the stream 30 years in the past:
+[](/media/images{{ parent_url }}/hlw-heknew-large.png)
+[aq]: https://en.wikipedia.org/wiki/Anscombe%27s_quartet
+[gnuplot]: http://www.gnuplot.info/
+## Up Next
+Poking around at this chat corpus was a lot of fun (and *definitely* counts as
+studying for my NLP final, *definitely*). I'll probably record the chat during
+next week's marathon and do some more poking, specifically around finding unique
+events (e.g. his son Steve coming on the show) by comparing rate percentiles.
+If you've got other ideas for things I should graph, [let me know][twitter].
+[twitter]: http://twitter.com/stevelosh
+ {% endblock article %}
--- a/layout/_post.html Thu Sep 17 09:20:42 2015 -0700
+++ b/layout/_post.html Fri Nov 20 18:44:20 2015 +0000
@@ -16,7 +16,7 @@
<span class="timeago"
- title="{{ page.created|date:"Y-m-d" }}T{{ page.created|date:"H:i:s" }}-0400">
+ title="{{ page.created|date:"Y-m-d" }}T{{ page.created|date:"H:i:s" }}-0000">
on {{ page.created|date:"F j, Y" }}.