# HG changeset patch # User Steve Losh # Date 1448899750 0 # Node ID c5f5f1d86f24197bf334aaf9d77b53f3f714a3b8 # Parent 2db5b638dc685c8558df5f5908b1db04ef35c433 just beat the data out of it diff -r 2db5b638dc68 -r c5f5f1d86f24 content/blog/2015/11/beat-the-data.html --- /dev/null Thu Jan 01 00:00:00 1970 +0000 +++ b/content/blog/2015/11/beat-the-data.html Mon Nov 30 16:09:10 2015 +0000 @@ -0,0 +1,271 @@ + {% extends "_post.html" %} + + {% hyde + title: "Just Beat the Data Out of It" + snip: "Round two of the Bob Ross Twitch chat analysis." + created: 2015-11-30 16:10:00 + %} + + {% block article %} + +[Last week][last-week] we played around with a transcript of the Bob Ross Twitch +chat during the Season 2 marathon. I scraped the chat again last Monday to get +the transcript for the Season 3 marathon, so let's pick up where we left off. + +[last-week]: blog/2015/11/happy-little-words/ + +[TOC] + +## Volume Comparison + +Was this week busier or quieter than last week? + +[![Season 2 and 3 chat volume comparison](/media/images{{ parent_url }}/btd-volume-comparison.png)](/media/images{{ parent_url }}/btd-volume-comparison-large.png) + +Note the separate x axes to line up the start and end times of the logs. Also +two-minute buckets were used to make things a bit cleaner to look at on this +crowded graph (see the y axis label). + +Seems like this was a bit quieter than last week. It's encouraging that the +basic structure looks the same -- this hints that there are some patterns +waiting to be discovered. + +## Spiky N-grams + +Last week we looked at graphs of various ngrams and saw that some of them show +pretty clear patterns. The end of each episode brings a flood of `gg`, and when +Bob's son Steve comes on the show we get a big spike in `steve`: + +[![Plot of "gg" and "steve" unigrams in Season 2](/media/images{{ parent_url }}/btd-s2-ggsteve.png)](/media/images{{ parent_url }}/btd-s2-ggsteve-large.png) + +It's reasonable to expect the same behavior this week. What did we get? + +[![Plot of "gg" and "steve" unigrams in Season 3](/media/images{{ parent_url }}/btd-s3-ggsteve.png)](/media/images{{ parent_url }}/btd-s3-ggsteve-large.png) + +Looks pretty similar! In fact the `steve` plot is even more obvious this week. +And in both cases the second streaming of the season repeats the pattern seen +in the first. + +Each week between the two seasons the channel "hosts" another painter. This +just means that it "pipes through" another streamer's channel so people don't +get bored. + +This week whoever is in charge of picking the guest stream did a shitty job. +After the first showing ended viewers were assaulted with the most loud, +obnoxious manchild on the planet. + +The chat was not pleased: + +[![The Douche-o-Meterâ„¢](/media/images{{ parent_url }}/btd-s3-douche.png)](/media/images{{ parent_url }}/btd-s3-douche-large.png) + +Thankfully whoever manages Bob's channel mercy-killed the hosting after 10 +minutes or so, and we enjoyed the blissful silence. + +So we've seen that the rate of certain n-grams have clear patterns. If we're +interested in a particular n-gram that's great -- we can graph it and take +a look. But what if we want to *find* interesting n-grams to look at, without +having to watch the whole marathon (or comb through the logs)? + +## Percentiles + +[Percentiles][] are a really useful measurement in a lot of fields, so let's +take a look at them here. We'll start with a relatively common n-gram like +"the": + +[![Percentile graph of "the" in Season 3](/media/images{{ parent_url }}/btd-s3-percentile-the.png)](/media/images{{ parent_url }}/btd-s3-percentile-the-large.png) + +Here we've got a pretty smooth gradation from the lower percentiles up to the +higher ones. Note that these are rates of `the` per minute, so the value `11` +at `50` means that half of all 2-minute bins recorded had eleven or fewer +instances of `the`. This seems low for English text, but a lot of the messages +in the Bob Ross chat are one or two-word slang -- full sentences are rare. + +If we go back to the normal n-gram plot of `the` we can see that it's not a very +"spiky" word: + +[![Plot of "the" unigram in Season 3](/media/images{{ parent_url }}/btd-s3-the.png)](/media/images{{ parent_url }}/btd-s3-the-large.png) + +Let's look at another common word, `bob`: + +[![Percentile graph of "bob"](/media/images{{ parent_url }}/btd-s3-percentile-bob.png)](/media/images{{ parent_url }}/btd-s3-percentile-bob-large.png) + +Pretty smooth, though it's a little bit steeper at the end (probably because of +the deluge of `hi bob` when an episode starts). N-gram plot for comparison: + +[![Plot of "bob" unigram in Season 3](/media/images{{ parent_url }}/btd-s3-bob.png)](/media/images{{ parent_url }}/btd-s3-bob-large.png) + +What about an n-gram we *know* represents a mostly-unique event, like `steve`? +We would expect the graph of percentiles to look steeper, because the lower and +middle percentiles would be very low and the highest few would skyrocket. + +[![Percentile graph of "steve" in Season 3](/media/images{{ parent_url }}/btd-s3-percentile-steve.png)](/media/images{{ parent_url }}/btd-s3-percentile-steve-large.png) + +[![Plot of "steve" unigram in Season 3](/media/images{{ parent_url }}/btd-s3-steve.png)](/media/images{{ parent_url }}/btd-s3-steve-large.png) + +We've tentatively identified another pattern in the data, but how can it help us +find new interesting terms? + +[percentiles]: https://en.wikipedia.org/wiki/Percentile + +## Spikiness Scores + +If we look at the percentiles for a few known-spiky terms we can see a pattern: + +[![Percentile graph of "steve" in Season 3](/media/images{{ parent_url }}/btd-s3-percentile-steve.png)](/media/images{{ parent_url }}/btd-s3-percentile-steve-large.png) + +[![Percentile graph of "drugs" in Season 3](/media/images{{ parent_url }}/btd-s3-percentile-drugs.png)](/media/images{{ parent_url }}/btd-s3-percentile-drugs-large.png) + +[![Percentile graph of "cringe" in Season 3](/media/images{{ parent_url }}/btd-s3-percentile-cringe.png)](/media/images{{ parent_url }}/btd-s3-percentile-cringe-large.png) + +The top percentile or two have some volume, but it quickly drops away to +nothingness within five or ten percent. So let's try to define a really basic +"spikiness score" that we can work out for all n-grams: + +$$ {\text{Spikiness}}(w) = \frac{P_{100}(w)}{P_{95}(w) + 0.1} $$ + +We'll start by saying that the spikiness score of a word is the value of the +100th percentile for that word, divided by the 95th percentile (plus a small +smoothing factor to avoid division by zero). Let's try some words: + + :::text + the 1.78 + bob 2.39 + steve 4.67 + drugs 30.00 + cringe 60.00 + +This doesn't look too terrible. The words we consider spiky are all scored +higher than the non-spiky ones, but it's not quite there yet. `steve` is rated +pretty low even though we consider it to be spiky. + +When we made our initial formula we arbitrarily picked the 100th and 95th +percentiles out of thin air. What if we choose the 99th and 90th instead? + +$$ {\text{Spikiness}}(w) = \frac{P_{99}(w)}{P_{90}(w) + 0.1} $$ + + :::text + bob 3.56 + the 1.42 + steve 77.27 + cringe 20.00 + drugs 10.00 + +This has changed the scores quite a bit, and now they're more like what we want. +But again, we just picked the two percentiles out of thin air. It would be nice +if we could get a feel for how the choice of percentiles affects our spikiness +scores. Once again, let's turn to gnuplot. We'll generalize our function: + +$$ {\text{Spikiness}}(w, L, U) = \frac{P_{U}(w)}{P_{L}(w) + 0.1} $$ + +And graph it for all the combinations of percentiles for a couple of words we +know: + +[![Spikiness percentile sensitivity plot for "the"](/media/images{{ parent_url }}/btd-ssp-the.png)](/media/images{{ parent_url }}/btd-ssp-the-large.png) + +[![Spikiness percentile sensitivity plot for "bob"](/media/images{{ parent_url }}/btd-ssp-bob.png)](/media/images{{ parent_url }}/btd-ssp-bob-large.png) + +[![Spikiness percentile sensitivity plot for "steve"](/media/images{{ parent_url }}/btd-ssp-steve.png)](/media/images{{ parent_url }}/btd-ssp-steve-large.png) + +[![Spikiness percentile sensitivity plot for "rip devil"](/media/images{{ parent_url }}/btd-ssp-rip__devil.png)](/media/images{{ parent_url }}/btd-ssp-rip__devil-large.png) + +These graphs are approaching the point of being impossible to read, but we can +definitely see a pattern. In the first two graphs (common words) the only way +to get a high spikiness score is to choose our formula's lower percentile to be +*really* low (15th percentile or lower). + +In the second two graphs (spiky words) we can see that the score is high when +the upper percentile is 99th or 100th, and the lower percentile is beneath the +90th (or thereabouts). + +Now that we have a hypothesis let's try a couple more plots to see if it still +holds: + +[![Spikiness percentile sensitivity plot for "gg"](/media/images{{ parent_url }}/btd-ssp-gg.png)](/media/images{{ parent_url }}/btd-ssp-gg-large.png) + +`gg` does come in spikes, but it happens so often that we need to select +a smaller lower percentile if we want it to be considered spiky. Whether we +want to depends on what we're looking for -- if we want *rare* events then we +probably want to exclude it. + +`ruined` get spammed so much that it's certainly not rare, and isn't even +particularly spiky in any way: + +[![Spikiness percentile sensitivity plot for "ruined"](/media/images{{ parent_url }}/btd-ssp-ruined.png)](/media/images{{ parent_url }}/btd-ssp-ruined-large.png) + +[![Plot of "ruined" unigram in Season 3](/media/images{{ parent_url }}/btd-s3-ruined.png)](/media/images{{ parent_url }}/btd-s3-ruined-large.png) + +So it looks like we're at least on a reasonable track here. Let's settle the +100th and 90th for now and see where they lead. + +There's one other addition to our spikiness formula we should make before moving +on: if the 100th percentile of a term is small (e.g. less than 5) then while it +might technically be spiky, we probably don't care about it. So we'll just drop +those on the floor and not really worry about them. + +$$ {\text{Spikiness}}(w) = \begin{cases} 0& {\text{if}}\ P_{100}(w) < 5 \\ \frac{P_{100}(w)}{P_{90}(w) + 0.1}& {\text{otherwise}} \end{cases} $$ + +## Results + +Now that we've got a way to measure a term's spikiness, we can calculate it for +all n-grams and sort to find some interesting ones. Let's try it with bigrams: + + :::text + mouth__noises 680.00 + (__mouth 520.00 + soft__music 480.00 + elevator__music 480.00 + noises__) 470.00 + believe__biblethump 460.00 + cool__elevator 450.00 + soft__rock 390.00 + smooth__soft 390.00 + smooth__jazz 380.00 + relaxing__guitar 360.00 + guitar__music 360.00 + son__of 330.00 + music__) 330.00 + (__soft 330.00 + a__gun 320.00 + (__relaxing 320.00 + big__shaft 300.00 + super__steve 290.00 + jazz__music 280.00 + crazy__day 280.00 + zoop__zoop 270.00 + the__heck 270.00 + (__smooth 260.00 + flat__trees 240.00 + steve__! 220.00 + hi__steve 220.00 + ... + +We can get similar results for unigrams, trigrams, etc. Let's graph a couple of +these highly-spiky terms. Twitch chat definitely loves innuendo: + +[![Plot of vaguely sexual n-grams in Season 3](/media/images{{ parent_url }}/btd-s3-innuendo.png)](/media/images{{ parent_url }}/btd-s3-innuendo-large.png) + +Something new this week was the addition of captions, which sometimes included +things like `(soft music)` and `(mouth noises)`. The chat liked to poke fun at +those: + +[![Plot of "soft music" and "mouth noises" bigrams in Season 3](/media/images{{ parent_url }}/btd-s3-mouthnoises.png)](/media/images{{ parent_url }}/btd-s3-mouthnoises-large.png) + +We can also see some particular elements of paintings: + +[![Plot of subject n-grams in Season 3](/media/images{{ parent_url }}/btd-s3-subjects.png)](/media/images{{ parent_url }}/btd-s3-subjects-large.png) + +The lists aren't perfect. They contain a lot of redundant stuff (e.g. `(soft +music)` produces 3 separate bigrams that are all equally spiky), and there's +a bunch of stuff we don't care about as much. But if you're looking to find +some interesting terms they can at least give you a starting point. + +## Join the Fun + +I'm posting this right as the Season 4 marathon is going live on [the Bob Ross +Twitch channel][brtwitch] If you've got some time feel free to pull up your +comfy computer chair and join a few thousand other people for a relaxing evening +with Bob! + +[brtwitch]: http://twitch.tv/BobRoss + + {% endblock article %} diff -r 2db5b638dc68 -r c5f5f1d86f24 layout/skeleton/_base.html --- a/layout/skeleton/_base.html Sat Nov 21 11:51:21 2015 +0000 +++ b/layout/skeleton/_base.html Mon Nov 30 16:09:10 2015 +0000 @@ -27,6 +27,7 @@ + {% block extra_js %}{% endblock %} {% endblock %} diff -r 2db5b638dc68 -r c5f5f1d86f24 media/images/blog/2015/11/btd-s2-ggsteve-large.png Binary file media/images/blog/2015/11/btd-s2-ggsteve-large.png has changed diff -r 2db5b638dc68 -r c5f5f1d86f24 media/images/blog/2015/11/btd-s2-ggsteve.png Binary file media/images/blog/2015/11/btd-s2-ggsteve.png has changed diff -r 2db5b638dc68 -r c5f5f1d86f24 media/images/blog/2015/11/btd-s3-bob-large.png Binary file media/images/blog/2015/11/btd-s3-bob-large.png has changed diff -r 2db5b638dc68 -r c5f5f1d86f24 media/images/blog/2015/11/btd-s3-bob.png Binary file media/images/blog/2015/11/btd-s3-bob.png has changed diff -r 2db5b638dc68 -r c5f5f1d86f24 media/images/blog/2015/11/btd-s3-douche-large.png Binary file media/images/blog/2015/11/btd-s3-douche-large.png has changed diff -r 2db5b638dc68 -r c5f5f1d86f24 media/images/blog/2015/11/btd-s3-douche.png Binary file media/images/blog/2015/11/btd-s3-douche.png has changed diff -r 2db5b638dc68 -r c5f5f1d86f24 media/images/blog/2015/11/btd-s3-ggsteve-large.png Binary file media/images/blog/2015/11/btd-s3-ggsteve-large.png has changed diff -r 2db5b638dc68 -r c5f5f1d86f24 media/images/blog/2015/11/btd-s3-ggsteve.png Binary file media/images/blog/2015/11/btd-s3-ggsteve.png has changed diff -r 2db5b638dc68 -r c5f5f1d86f24 media/images/blog/2015/11/btd-s3-innuendo-large.png Binary file media/images/blog/2015/11/btd-s3-innuendo-large.png has changed diff -r 2db5b638dc68 -r c5f5f1d86f24 media/images/blog/2015/11/btd-s3-innuendo.png Binary file media/images/blog/2015/11/btd-s3-innuendo.png has changed diff -r 2db5b638dc68 -r c5f5f1d86f24 media/images/blog/2015/11/btd-s3-mouthnoises-large.png Binary file media/images/blog/2015/11/btd-s3-mouthnoises-large.png has changed diff -r 2db5b638dc68 -r c5f5f1d86f24 media/images/blog/2015/11/btd-s3-mouthnoises.png Binary file media/images/blog/2015/11/btd-s3-mouthnoises.png has changed diff -r 2db5b638dc68 -r c5f5f1d86f24 media/images/blog/2015/11/btd-s3-percentile-bob-large.png Binary file media/images/blog/2015/11/btd-s3-percentile-bob-large.png has changed diff -r 2db5b638dc68 -r c5f5f1d86f24 media/images/blog/2015/11/btd-s3-percentile-bob.png Binary file media/images/blog/2015/11/btd-s3-percentile-bob.png has changed diff -r 2db5b638dc68 -r c5f5f1d86f24 media/images/blog/2015/11/btd-s3-percentile-cringe-large.png Binary file media/images/blog/2015/11/btd-s3-percentile-cringe-large.png has changed diff -r 2db5b638dc68 -r c5f5f1d86f24 media/images/blog/2015/11/btd-s3-percentile-cringe.png Binary file media/images/blog/2015/11/btd-s3-percentile-cringe.png has changed diff -r 2db5b638dc68 -r c5f5f1d86f24 media/images/blog/2015/11/btd-s3-percentile-drugs-large.png Binary file media/images/blog/2015/11/btd-s3-percentile-drugs-large.png has changed diff -r 2db5b638dc68 -r c5f5f1d86f24 media/images/blog/2015/11/btd-s3-percentile-drugs.png Binary file media/images/blog/2015/11/btd-s3-percentile-drugs.png has changed diff -r 2db5b638dc68 -r c5f5f1d86f24 media/images/blog/2015/11/btd-s3-percentile-steve-large.png Binary file media/images/blog/2015/11/btd-s3-percentile-steve-large.png has changed diff -r 2db5b638dc68 -r c5f5f1d86f24 media/images/blog/2015/11/btd-s3-percentile-steve.png Binary file media/images/blog/2015/11/btd-s3-percentile-steve.png has changed diff -r 2db5b638dc68 -r c5f5f1d86f24 media/images/blog/2015/11/btd-s3-percentile-the-large.png Binary file media/images/blog/2015/11/btd-s3-percentile-the-large.png has changed diff -r 2db5b638dc68 -r c5f5f1d86f24 media/images/blog/2015/11/btd-s3-percentile-the.png Binary file media/images/blog/2015/11/btd-s3-percentile-the.png has changed diff -r 2db5b638dc68 -r c5f5f1d86f24 media/images/blog/2015/11/btd-s3-ruined-large.png Binary file media/images/blog/2015/11/btd-s3-ruined-large.png has changed diff -r 2db5b638dc68 -r c5f5f1d86f24 media/images/blog/2015/11/btd-s3-ruined.png Binary file media/images/blog/2015/11/btd-s3-ruined.png has changed diff -r 2db5b638dc68 -r c5f5f1d86f24 media/images/blog/2015/11/btd-s3-steve-large.png Binary file media/images/blog/2015/11/btd-s3-steve-large.png has changed diff -r 2db5b638dc68 -r c5f5f1d86f24 media/images/blog/2015/11/btd-s3-steve.png Binary file media/images/blog/2015/11/btd-s3-steve.png has changed diff -r 2db5b638dc68 -r c5f5f1d86f24 media/images/blog/2015/11/btd-s3-subjects-large.png Binary file media/images/blog/2015/11/btd-s3-subjects-large.png has changed diff -r 2db5b638dc68 -r c5f5f1d86f24 media/images/blog/2015/11/btd-s3-subjects.png Binary file media/images/blog/2015/11/btd-s3-subjects.png has changed diff -r 2db5b638dc68 -r c5f5f1d86f24 media/images/blog/2015/11/btd-s3-the-large.png Binary file media/images/blog/2015/11/btd-s3-the-large.png has changed diff -r 2db5b638dc68 -r c5f5f1d86f24 media/images/blog/2015/11/btd-s3-the.png Binary file media/images/blog/2015/11/btd-s3-the.png has changed diff -r 2db5b638dc68 -r c5f5f1d86f24 media/images/blog/2015/11/btd-ssp-bob-large.png Binary file media/images/blog/2015/11/btd-ssp-bob-large.png has changed diff -r 2db5b638dc68 -r c5f5f1d86f24 media/images/blog/2015/11/btd-ssp-bob.png Binary file media/images/blog/2015/11/btd-ssp-bob.png has changed diff -r 2db5b638dc68 -r c5f5f1d86f24 media/images/blog/2015/11/btd-ssp-gg-large.png Binary file media/images/blog/2015/11/btd-ssp-gg-large.png has changed diff -r 2db5b638dc68 -r c5f5f1d86f24 media/images/blog/2015/11/btd-ssp-gg.png Binary file media/images/blog/2015/11/btd-ssp-gg.png has changed diff -r 2db5b638dc68 -r c5f5f1d86f24 media/images/blog/2015/11/btd-ssp-rip__devil-large.png Binary file media/images/blog/2015/11/btd-ssp-rip__devil-large.png has changed diff -r 2db5b638dc68 -r c5f5f1d86f24 media/images/blog/2015/11/btd-ssp-rip__devil.png Binary file media/images/blog/2015/11/btd-ssp-rip__devil.png has changed diff -r 2db5b638dc68 -r c5f5f1d86f24 media/images/blog/2015/11/btd-ssp-ruined-large.png Binary file media/images/blog/2015/11/btd-ssp-ruined-large.png has changed diff -r 2db5b638dc68 -r c5f5f1d86f24 media/images/blog/2015/11/btd-ssp-ruined.png Binary file media/images/blog/2015/11/btd-ssp-ruined.png has changed diff -r 2db5b638dc68 -r c5f5f1d86f24 media/images/blog/2015/11/btd-ssp-steve-large.png Binary file media/images/blog/2015/11/btd-ssp-steve-large.png has changed diff -r 2db5b638dc68 -r c5f5f1d86f24 media/images/blog/2015/11/btd-ssp-steve.png Binary file media/images/blog/2015/11/btd-ssp-steve.png has changed diff -r 2db5b638dc68 -r c5f5f1d86f24 media/images/blog/2015/11/btd-ssp-the-large.png Binary file media/images/blog/2015/11/btd-ssp-the-large.png has changed diff -r 2db5b638dc68 -r c5f5f1d86f24 media/images/blog/2015/11/btd-ssp-the.png Binary file media/images/blog/2015/11/btd-ssp-the.png has changed diff -r 2db5b638dc68 -r c5f5f1d86f24 media/images/blog/2015/11/btd-volume-comparison-large.png Binary file media/images/blog/2015/11/btd-volume-comparison-large.png has changed diff -r 2db5b638dc68 -r c5f5f1d86f24 media/images/blog/2015/11/btd-volume-comparison.png Binary file media/images/blog/2015/11/btd-volume-comparison.png has changed