content/blog/2015/11/happy-little-words.markdown @ 8b28ba4b6528
Add cl-pcg, fix bitbucket bullshit
| author | Steve Losh <steve@stevelosh.com> | 
|---|---|
| date | Thu, 06 Apr 2017 14:27:05 +0200 | 
| parents | fdf01e99fd51 | 
| children | f5556130bda1 | 
+++ title = "Happy Little Words" snip = "Analyzing the Bob Ross Twitch chat." date = 2015-11-20T18:43:00Z draft = false +++ In late October the video game streaming site Twitch.tv [launched "Twitch Creative"][twitch-creative], essentially giving people permission to stream non-video game related creative content on the site. To celebrate the launch they streamed all 403 episodes of [The Joy of Painting with Bob Ross][joy] in a giant marathon. The Bob Ross channel has its own chat room, and it quickly became packed with folks watching Bob paint. The chat spawned its own memes and conventions within days, mostly taking gamer slang (e.g. "gg" for "good game") and applying it to the show (people spam "gg" in the chat whenever Bob finishes a painting). Sadly that marathon has ended, but they've kept the dream alive by having ["Bob Ross Night" on Mondays][mondays]. Every Monday they're going to stream a season of the show twice (once at a Europe-friendly time and again for American folks). Last Monday I scraped the Twitch chat during the marathon(s) of Season 2 and decided to have some fun poking around at the data. [twitch-creative]: http://blog.twitch.tv/2015/10/introducing-twitch-creative/ [joy]: https://en.wikipedia.org/wiki/The_Joy_of_Painting [mondays]: http://blog.twitch.tv/2015/11/monday-night-is-bob-ross-night/ <div id="toc"></div> ## Scraping Scraping the chat was pretty easy. Twitch has an IRC gateway for chats, so I just ran an IRC client ([weechat][]) on a VPS and had it log the channel like any other. Once the marathon finished I just `scp`'ed down the 8mb log and started working with it. First I trimmed both ends to only leave messages from about an hour and a half before and after the marathons started and ended. So the data I'm going to work with runs from 2015-11-16 14:30 to 2015-11-17 07:30 (all times are in UTC), or 17 hours. Then I cleaned it up to remove some of the cruft (status messages from the client and such) and lowercase everything: cat data/raw | grep -E '^[^\t]+\t <' | gsed -e 's/./\L\0/g' > data/log Then I made an ugly little Python script to massage the data into something a bit easier to work with later: ```python import datetime, sys, time def datetime_to_epoch(dt): return int(time.mktime(dt.timetuple())) for line in sys.stdin: timestamp, nick, msg = (s.strip() for s in line.split('\t', 2)) timestamp = datetime_to_epoch( datetime.datetime.strptime(timestamp, '%Y-%m-%d %H:%M:%S')) # strip off <>'s nick = nick[1:-1] print(timestamp, nick, msg) ``` This results in a file with one message per line, in the format: timestamp nick message goes here... On a side note: I tried out [Mosh][] for persisting a connection to the server (instead of using tmux or screen to persist a session) and it worked pretty well. I might start using it more often. [weechat]: https://weechat.org/ [Mosh]: https://mosh.mit.edu/ ## Volume Now that we've got a nice clean corpus, let's start playing with it! The obvious first question: how many messages did people send in total? ><((°> cat data/messages | wc -l 165368 That's almost 10,000 messages per hour! And since there were periods of almost no activity before, between, and after the two marathons it means the rate *during* them was well over that! Who talked the most? ><((°> cat data/messages | cuts -f 2 | sort | uniq -c | sort -nr | head -5 269 fuscia13 259 almightypainter 239 sabrinamywaifu 235 roudydogg1 201 ionone Some talkative folks (though honestly I expected a bit higher numbers here). [cuts][] is "**cut** on **s**paces" — a little function I use so I don't have to type `-d ' '` all the time. [cuts]: https://bitbucket.org/sjl/dotfiles/src/default/fish/functions/cuts.fish ## N-grams The chat has spawned a bunch of its own memes and jargon. I made another ugly Python script to split up messages into [n-grams] so we can analyze them more easily: ```python import sys import nltk def window(coll, size): '''Generate a "sliding window" of tuples of size l over coll. coll must be sliceable and have a fixed len. ''' coll_len = len(coll) for i in range(coll_len): if i + size > coll_len: break else: yield tuple(coll[i:i+size]) for line in sys.stdin: timestamp, nick, msg = line.split(' ', 2) n = int(sys.argv[1]) for ngram in set(window(nltk.word_tokenize(msg), n)): print(timestamp, nick, '__'.join(ngram)) ``` This lets us easily split a message into unigrams: ><((°> echo "1447680000 sjl beat the devil out of it" | python src/split.py 1 1447680000 sjl it 1447680000 sjl the 1447680000 sjl beat 1447680000 sjl of 1447680000 sjl out 1447680000 sjl devil The order of n-grams within a message isn't preserved because the splitting script uses a `set` to remove duplicate n-grams. I wanted to remove dupes because it turns out people frequently copy and paste the same word many times in a single message and I didn't want that to throw off the numbers. Bigrams are just as easy — just change the parameter to `split.py`: ><((°> echo "1447680000 sjl beat the devil out of it" | python src/split.py 2 1447680000 sjl of__it 1447680000 sjl the__devil 1447680000 sjl beat__the 1447680000 sjl devil__out 1447680000 sjl out__of N-grams are joined with double underscores to make them easier to plot later. So what are the most frequent unigrams? ><((°> cat data/words | cuts -f3 | sort | uniq -c | sort -nr | head -15 19523 bob 14367 ruined 11961 kappaross 11331 ! 10989 gg 7666 is 6592 the 6305 ? 6090 i 5376 biblethump 5240 devil 5122 saved 5075 rip 4813 it 4727 a Some of these are expected, like "Bob" and stopwords like "is" and "the". The chat loves to spam "RUINED" whenever Bob makes a drastic change to the painting that looks awful at first, and then spam "SAVED" once he applies a bit more paint and it looks beautiful. This happens frequently with mountains. "KappaRoss" and "BibleThump" are [Twitch emotes][kappa] that produce small images in the chat. When Bob cleans his brush he beats it against the leg of the easel to remove the paint thinner, and he often smiles and says "just beat the devil out of it". It didn't take long before chat started spamming "RIP DEVIL" every time he cleans the brush. How about the most frequent bigrams and trigrams? ><((°> cat data/bigrams | cuts -f3 | sort | uniq -c | sort -nr | head -15 3731 rip__devil 3153 !__! 2660 bob__ross 2490 hi__bob 1844 <__3 1838 kappaross__kappaross 1533 bob__is 1409 bob__! 1389 god__bless 1324 happy__little 1181 van__dyke 1093 gg__wp 1024 is__back 908 i__believe 895 ?__? ><((°> cat data/trigrams | cuts -f3 | sort | uniq -c | sort -nr | head -15 2130 !__!__! 1368 kappaross__kappaross__kappaross 678 van__dyke__brown 617 bob__is__back 548 ?__?__? 503 biblethump__biblethump__biblethump 401 bob__ross__is 377 hi__bob__! 376 beat__the__devil 361 bob__!__! 331 bob__<__3 324 <__3__< 324 3__<__3 303 i__love__you 302 son__of__a Looks like lots of love for Bob and no sympathy for the devil. It also seems like [Van Dyke Brown][vdb] is Twitch chat's favorite color by a landslide. Note that the exact n-grams depend on the tokenization method. I used NLTK's `word_tokenize` because it was easy and worked pretty well. `wordpunct_tokenize` also works, but it splits up basic punctuation a bit too much for my liking (e.g. it turns `bob's` into three tokens `bob`, `'`, and `s`, where `word_tokenize` produces just `bob` and `'s`). [n-grams]: https://en.wikipedia.org/wiki/N-gram [kappa]: https://fivethirtyeight.com/features/why-a-former-twitch-employee-has-one-of-the-most-reproduced-faces-ever/ [vdb]: https://www.bobross.com/ProductDetails.asp?ProductCode=VanDykeBrown ## Graphing Pure numbers are interesting, but [can be misleading][aq]. Let's make some graphs to get a sense of what the data feels like. I'm using [gnuplot][] to make the graphs. What does the overall volume look like? We'll use minute-wide buckets in the x axis to make the graph a bit easier to read. [](/media/images/blog/2015/11/hlw-total-large.png) Can you tell where the two marathons start and end? Let's try to identify where episodes start and finish. Chat usually spams "hi bob" when an episode starts and "gg" when it finishes, so let's plot those. We'll use 30-second x buckets here because a minute isn't a fine enough resolution for the events we're looking for. To make it easier to read we'll just look at the first half of the first marathon. [](/media/images/blog/2015/11/hlw-higg-large.png) This works pretty well! The graph starts with a big spike of "hi bob", then as each episode finishes we see a (huge) spike of "gg", followed immediately by a round of "hi bob" as the next episode starts. Can we find all the times Bob cleaned his brush? [](/media/images/blog/2015/11/hlw-ripdevil-large.png) Looks like the devil isn't having a very good time. It's encouraging that the two seasons have roughly the same structure (three main clusters of peaks). Note that there are a couple of smaller peaks between the two showings. Twitch showed another streamer painting between the two marathons, so it's likely that she cleaned her brush a couple of times and the chat responded. Fewer people were watching the stream during the break, hence the smaller peaks. When did Bob get the most love? We'll use 5-minute x bins here because we just want a general idea. [](/media/images/blog/2015/11/hlw-love-large.png) Lots of love all around, but especially as he signed off at the end. One of my favorite moments was when Bob said something about "changing your mind in mid **stream**" and the chat started spamming conspiracy theories about how he somehow knew about the stream 30 years in the past: [](/media/images/blog/2015/11/hlw-heknew-large.png) [aq]: https://en.wikipedia.org/wiki/Anscombe%27s_quartet [gnuplot]: http://www.gnuplot.info/ ## Up Next Poking around at this chat corpus was a lot of fun (and *definitely* counts as studying for my NLP final, *definitely*). I'll probably record the chat during next week's marathon and do some more poking, specifically around finding unique events (e.g. his son Steve coming on the show) by comparing rate percentiles. If you've got other ideas for things I should graph, [let me know][twitter]. [twitter]: http://twitter.com/stevelosh