hg.stevelosh.com > stevelosh.com

--- /dev/null	Thu Jan 01 00:00:00 1970 +0000
+++ b/content/blog/2015/11/beat-the-data.html	Mon Nov 30 16:09:10 2015 +0000
@@ -0,0 +1,271 @@
+    {% extends "_post.html" %}
+
+    {% hyde
+        title: "Just Beat the Data Out of It"
+        snip: "Round two of the Bob Ross Twitch chat analysis."
+        created: 2015-11-30 16:10:00
+    %}
+
+    {% block article %}
+
+[Last week][last-week] we played around with a transcript of the Bob Ross Twitch
+chat during the Season 2 marathon.  I scraped the chat again last Monday to get
+the transcript for the Season 3 marathon, so let's pick up where we left off.
+
+[last-week]: blog/2015/11/happy-little-words/
+
+[TOC]
+
+## Volume Comparison
+
+Was this week busier or quieter than last week?
+
+[![Season 2 and 3 chat volume comparison](/media/images{{ parent_url }}/btd-volume-comparison.png)](/media/images{{ parent_url }}/btd-volume-comparison-large.png)
+
+Note the separate x axes to line up the start and end times of the logs.  Also
+two-minute buckets were used to make things a bit cleaner to look at on this
+crowded graph (see the y axis label).
+
+Seems like this was a bit quieter than last week.  It's encouraging that the
+basic structure looks the same -- this hints that there are some patterns
+waiting to be discovered.
+
+## Spiky N-grams
+
+Last week we looked at graphs of various ngrams and saw that some of them show
+pretty clear patterns.  The end of each episode brings a flood of `gg`, and when
+Bob's son Steve comes on the show we get a big spike in `steve`:
+
+[![Plot of "gg" and "steve" unigrams in Season 2](/media/images{{ parent_url }}/btd-s2-ggsteve.png)](/media/images{{ parent_url }}/btd-s2-ggsteve-large.png)
+
+It's reasonable to expect the same behavior this week.  What did we get?
+
+[![Plot of "gg" and "steve" unigrams in Season 3](/media/images{{ parent_url }}/btd-s3-ggsteve.png)](/media/images{{ parent_url }}/btd-s3-ggsteve-large.png)
+
+Looks pretty similar!  In fact the `steve` plot is even more obvious this week.
+And in both cases the second streaming of the season repeats the pattern seen
+in the first.
+
+Each week between the two seasons the channel "hosts" another painter.  This
+just means that it "pipes through" another streamer's channel so people don't
+get bored.
+
+This week whoever is in charge of picking the guest stream did a shitty job.
+After the first showing ended viewers were assaulted with the most loud,
+obnoxious manchild on the planet.
+
+The chat was not pleased:
+
+[![The Douche-o-Meter™](/media/images{{ parent_url }}/btd-s3-douche.png)](/media/images{{ parent_url }}/btd-s3-douche-large.png)
+
+Thankfully whoever manages Bob's channel mercy-killed the hosting after 10
+minutes or so, and we enjoyed the blissful silence.
+
+So we've seen that the rate of certain n-grams have clear patterns.  If we're
+interested in a particular n-gram that's great -- we can graph it and take
+a look.  But what if we want to *find* interesting n-grams to look at, without
+having to watch the whole marathon (or comb through the logs)?
+
+## Percentiles
+
+[Percentiles][] are a really useful measurement in a lot of fields, so let's
+take a look at them here.  We'll start with a relatively common n-gram like
+"the":
+
+[![Percentile graph of "the" in Season 3](/media/images{{ parent_url }}/btd-s3-percentile-the.png)](/media/images{{ parent_url }}/btd-s3-percentile-the-large.png)
+
+Here we've got a pretty smooth gradation from the lower percentiles up to the
+higher ones.  Note that these are rates of `the` per minute, so the value `11`
+at `50` means that half of all 2-minute bins recorded had eleven or fewer
+instances of `the`.  This seems low for English text, but a lot of the messages
+in the Bob Ross chat are one or two-word slang -- full sentences are rare.
+
+If we go back to the normal n-gram plot of `the` we can see that it's not a very
+"spiky" word:
+
+[![Plot of "the" unigram in Season 3](/media/images{{ parent_url }}/btd-s3-the.png)](/media/images{{ parent_url }}/btd-s3-the-large.png)
+
+Let's look at another common word, `bob`:
+
+[![Percentile graph of "bob"](/media/images{{ parent_url }}/btd-s3-percentile-bob.png)](/media/images{{ parent_url }}/btd-s3-percentile-bob-large.png)
+
+Pretty smooth, though it's a little bit steeper at the end (probably because of
+the deluge of `hi bob` when an episode starts).  N-gram plot for comparison:
+
+[![Plot of "bob" unigram in Season 3](/media/images{{ parent_url }}/btd-s3-bob.png)](/media/images{{ parent_url }}/btd-s3-bob-large.png)
+
+What about an n-gram we *know* represents a mostly-unique event, like `steve`?
+We would expect the graph of percentiles to look steeper, because the lower and
+middle percentiles would be very low and the highest few would skyrocket.
+
+[![Percentile graph of "steve" in Season 3](/media/images{{ parent_url }}/btd-s3-percentile-steve.png)](/media/images{{ parent_url }}/btd-s3-percentile-steve-large.png)
+
+[![Plot of "steve" unigram in Season 3](/media/images{{ parent_url }}/btd-s3-steve.png)](/media/images{{ parent_url }}/btd-s3-steve-large.png)
+
+We've tentatively identified another pattern in the data, but how can it help us
+find new interesting terms?
+
+[percentiles]: https://en.wikipedia.org/wiki/Percentile
+
+## Spikiness Scores
+
+If we look at the percentiles for a few known-spiky terms we can see a pattern:
+
+[![Percentile graph of "steve" in Season 3](/media/images{{ parent_url }}/btd-s3-percentile-steve.png)](/media/images{{ parent_url }}/btd-s3-percentile-steve-large.png)
+
+[![Percentile graph of "drugs" in Season 3](/media/images{{ parent_url }}/btd-s3-percentile-drugs.png)](/media/images{{ parent_url }}/btd-s3-percentile-drugs-large.png)
+
+[![Percentile graph of "cringe" in Season 3](/media/images{{ parent_url }}/btd-s3-percentile-cringe.png)](/media/images{{ parent_url }}/btd-s3-percentile-cringe-large.png)
+
+The top percentile or two have some volume, but it quickly drops away to
+nothingness within five or ten percent.  So let's try to define a really basic
+"spikiness score" that we can work out for all n-grams:
+
+$$ {\text{Spikiness}}(w) = \frac{P_{100}(w)}{P_{95}(w) + 0.1} $$
+
+We'll start by saying that the spikiness score of a word is the value of the
+100th percentile for that word, divided by the 95th percentile (plus a small
+smoothing factor to avoid division by zero).  Let's try some words:
+
+    :::text
+    the     1.78
+    bob     2.39
+    steve   4.67
+    drugs  30.00
+    cringe 60.00
+
+This doesn't look too terrible.  The words we consider spiky are all scored
+higher than the non-spiky ones, but it's not quite there yet.  `steve` is rated
+pretty low even though we consider it to be spiky.
+
+When we made our initial formula we arbitrarily picked the 100th and 95th
+percentiles out of thin air.  What if we choose the 99th and 90th instead?
+
+$$ {\text{Spikiness}}(w) = \frac{P_{99}(w)}{P_{90}(w) + 0.1} $$
+
+    :::text
+    bob     3.56
+    the     1.42
+    steve  77.27
+    cringe 20.00
+    drugs  10.00
+
+This has changed the scores quite a bit, and now they're more like what we want.
+But again, we just picked the two percentiles out of thin air.  It would be nice
+if we could get a feel for how the choice of percentiles affects our spikiness
+scores.  Once again, let's turn to gnuplot.  We'll generalize our function:
+
+$$ {\text{Spikiness}}(w, L, U) = \frac{P_{U}(w)}{P_{L}(w) + 0.1} $$
+
+And graph it for all the combinations of percentiles for a couple of words we
+know:
+
+[![Spikiness percentile sensitivity plot for "the"](/media/images{{ parent_url }}/btd-ssp-the.png)](/media/images{{ parent_url }}/btd-ssp-the-large.png)
+
+[![Spikiness percentile sensitivity plot for "bob"](/media/images{{ parent_url }}/btd-ssp-bob.png)](/media/images{{ parent_url }}/btd-ssp-bob-large.png)
+
+[![Spikiness percentile sensitivity plot for "steve"](/media/images{{ parent_url }}/btd-ssp-steve.png)](/media/images{{ parent_url }}/btd-ssp-steve-large.png)
+
+[![Spikiness percentile sensitivity plot for "rip devil"](/media/images{{ parent_url }}/btd-ssp-rip__devil.png)](/media/images{{ parent_url }}/btd-ssp-rip__devil-large.png)
+
+These graphs are approaching the point of being impossible to read, but we can
+definitely see a pattern.  In the first two graphs (common words) the only way
+to get a high spikiness score is to choose our formula's lower percentile to be
+*really* low (15th percentile or lower).
+
+In the second two graphs (spiky words) we can see that the score is high when
+the upper percentile is 99th or 100th, and the lower percentile is beneath the
+90th (or thereabouts).
+
+Now that we have a hypothesis let's try a couple more plots to see if it still
+holds:
+
+[![Spikiness percentile sensitivity plot for "gg"](/media/images{{ parent_url }}/btd-ssp-gg.png)](/media/images{{ parent_url }}/btd-ssp-gg-large.png)
+
+`gg` does come in spikes, but it happens so often that we need to select
+a smaller lower percentile if we want it to be considered spiky.  Whether we
+want to depends on what we're looking for -- if we want *rare* events then we
+probably want to exclude it.
+
+`ruined` get spammed so much that it's certainly not rare, and isn't even
+particularly spiky in any way:
+
+[![Spikiness percentile sensitivity plot for "ruined"](/media/images{{ parent_url }}/btd-ssp-ruined.png)](/media/images{{ parent_url }}/btd-ssp-ruined-large.png)
+
+[![Plot of "ruined" unigram in Season 3](/media/images{{ parent_url }}/btd-s3-ruined.png)](/media/images{{ parent_url }}/btd-s3-ruined-large.png)
+
+So it looks like we're at least on a reasonable track here.  Let's settle the
+100th and 90th for now and see where they lead.
+
+There's one other addition to our spikiness formula we should make before moving
+on: if the 100th percentile of a term is small (e.g. less than 5) then while it
+might technically be spiky, we probably don't care about it.  So we'll just drop
+those on the floor and not really worry about them.
+
+$$ {\text{Spikiness}}(w) = \begin{cases} 0&amp; {\text{if}}\ P_{100}(w) &lt; 5 &#92;&#92; \frac{P_{100}(w)}{P_{90}(w) + 0.1}&amp; {\text{otherwise}} \end{cases} $$ <!-- ._ fuckin markdown -->
+
+## Results
+
+Now that we've got a way to measure a term's spikiness, we can calculate it for
+all n-grams and sort to find some interesting ones.  Let's try it with bigrams:
+
+    :::text
+    mouth__noises 680.00
+    (__mouth 520.00
+    soft__music 480.00
+    elevator__music 480.00
+    noises__) 470.00
+    believe__biblethump 460.00
+    cool__elevator 450.00
+    soft__rock 390.00
+    smooth__soft 390.00
+    smooth__jazz 380.00
+    relaxing__guitar 360.00
+    guitar__music 360.00
+    son__of 330.00
+    music__) 330.00
+    (__soft 330.00
+    a__gun 320.00
+    (__relaxing 320.00
+    big__shaft 300.00
+    super__steve 290.00
+    jazz__music 280.00
+    crazy__day 280.00
+    zoop__zoop 270.00
+    the__heck 270.00
+    (__smooth 260.00
+    flat__trees 240.00
+    steve__! 220.00
+    hi__steve 220.00
+    ...
+
+We can get similar results for unigrams, trigrams, etc.  Let's graph a couple of
+these highly-spiky terms.  Twitch chat definitely loves innuendo:
+
+[![Plot of vaguely sexual n-grams in Season 3](/media/images{{ parent_url }}/btd-s3-innuendo.png)](/media/images{{ parent_url }}/btd-s3-innuendo-large.png)
+
+Something new this week was the addition of captions, which sometimes included
+things like `(soft music)` and `(mouth noises)`.  The chat liked to poke fun at
+those:
+
+[![Plot of "soft music" and "mouth noises" bigrams in Season 3](/media/images{{ parent_url }}/btd-s3-mouthnoises.png)](/media/images{{ parent_url }}/btd-s3-mouthnoises-large.png)
+
+We can also see some particular elements of paintings:
+
+[![Plot of subject n-grams in Season 3](/media/images{{ parent_url }}/btd-s3-subjects.png)](/media/images{{ parent_url }}/btd-s3-subjects-large.png)
+
+The lists aren't perfect.  They contain a lot of redundant stuff (e.g. `(soft
+music)` produces 3 separate bigrams that are all equally spiky), and there's
+a bunch of stuff we don't care about as much.  But if you're looking to find
+some interesting terms they can at least give you a starting point.
+
+## Join the Fun
+
+I'm posting this right as the Season 4 marathon is going live on [the Bob Ross
+Twitch channel][brtwitch]  If you've got some time feel free to pull up your
+comfy computer chair and join a few thousand other people for a relaxing evening
+with Bob!
+
+[brtwitch]: http://twitch.tv/BobRoss
+
+    {% endblock article %}
--- a/layout/skeleton/_base.html	Sat Nov 21 11:51:21 2015 +0000
+++ b/layout/skeleton/_base.html	Mon Nov 30 16:09:10 2015 +0000
@@ -27,6 +27,7 @@
             <script data-cfasync="false" src="/media/js/jquery.timeago.js" type="text/javascript"></script>
             <script data-cfasync="false" src="/media/js/sjl.js" type="text/javascript"></script>
             <script data-cfasync="false" src="/media/js/print.js" type="text/javascript"></script>
+            <script data-cfasync="false" type="text/javascript" src="//cdn.mathjax.org/mathjax/latest/MathJax.js?config=TeX-AMS-MML_HTMLorMML"></script>

             {% block extra_js %}{% endblock %}
         {% endblock %}
Binary file media/images/blog/2015/11/btd-s2-ggsteve-large.png has changed
Binary file media/images/blog/2015/11/btd-s2-ggsteve.png has changed
Binary file media/images/blog/2015/11/btd-s3-bob-large.png has changed
Binary file media/images/blog/2015/11/btd-s3-bob.png has changed
Binary file media/images/blog/2015/11/btd-s3-douche-large.png has changed
Binary file media/images/blog/2015/11/btd-s3-douche.png has changed
Binary file media/images/blog/2015/11/btd-s3-ggsteve-large.png has changed
Binary file media/images/blog/2015/11/btd-s3-ggsteve.png has changed
Binary file media/images/blog/2015/11/btd-s3-innuendo-large.png has changed
Binary file media/images/blog/2015/11/btd-s3-innuendo.png has changed
Binary file media/images/blog/2015/11/btd-s3-mouthnoises-large.png has changed
Binary file media/images/blog/2015/11/btd-s3-mouthnoises.png has changed
Binary file media/images/blog/2015/11/btd-s3-percentile-bob-large.png has changed
Binary file media/images/blog/2015/11/btd-s3-percentile-bob.png has changed
Binary file media/images/blog/2015/11/btd-s3-percentile-cringe-large.png has changed
Binary file media/images/blog/2015/11/btd-s3-percentile-cringe.png has changed
Binary file media/images/blog/2015/11/btd-s3-percentile-drugs-large.png has changed
Binary file media/images/blog/2015/11/btd-s3-percentile-drugs.png has changed
Binary file media/images/blog/2015/11/btd-s3-percentile-steve-large.png has changed
Binary file media/images/blog/2015/11/btd-s3-percentile-steve.png has changed
Binary file media/images/blog/2015/11/btd-s3-percentile-the-large.png has changed
Binary file media/images/blog/2015/11/btd-s3-percentile-the.png has changed
Binary file media/images/blog/2015/11/btd-s3-ruined-large.png has changed
Binary file media/images/blog/2015/11/btd-s3-ruined.png has changed
Binary file media/images/blog/2015/11/btd-s3-steve-large.png has changed
Binary file media/images/blog/2015/11/btd-s3-steve.png has changed
Binary file media/images/blog/2015/11/btd-s3-subjects-large.png has changed
Binary file media/images/blog/2015/11/btd-s3-subjects.png has changed
Binary file media/images/blog/2015/11/btd-s3-the-large.png has changed
Binary file media/images/blog/2015/11/btd-s3-the.png has changed
Binary file media/images/blog/2015/11/btd-ssp-bob-large.png has changed
Binary file media/images/blog/2015/11/btd-ssp-bob.png has changed
Binary file media/images/blog/2015/11/btd-ssp-gg-large.png has changed
Binary file media/images/blog/2015/11/btd-ssp-gg.png has changed
Binary file media/images/blog/2015/11/btd-ssp-rip__devil-large.png has changed
Binary file media/images/blog/2015/11/btd-ssp-rip__devil.png has changed
Binary file media/images/blog/2015/11/btd-ssp-ruined-large.png has changed
Binary file media/images/blog/2015/11/btd-ssp-ruined.png has changed
Binary file media/images/blog/2015/11/btd-ssp-steve-large.png has changed
Binary file media/images/blog/2015/11/btd-ssp-steve.png has changed
Binary file media/images/blog/2015/11/btd-ssp-the-large.png has changed
Binary file media/images/blog/2015/11/btd-ssp-the.png has changed
Binary file media/images/blog/2015/11/btd-volume-comparison-large.png has changed
Binary file media/images/blog/2015/11/btd-volume-comparison.png has changed
author	Steve Losh <steve@stevelosh.com>
date	Mon, 30 Nov 2015 16:09:10 +0000
parents	2db5b638dc68
children	47f9b4b91599
branches/tags	(none)
files	content/blog/2015/11/beat-the-data.html layout/skeleton/_base.html media/images/blog/2015/11/btd-s2-ggsteve-large.png media/images/blog/2015/11/btd-s2-ggsteve.png media/images/blog/2015/11/btd-s3-bob-large.png media/images/blog/2015/11/btd-s3-bob.png media/images/blog/2015/11/btd-s3-douche-large.png media/images/blog/2015/11/btd-s3-douche.png media/images/blog/2015/11/btd-s3-ggsteve-large.png media/images/blog/2015/11/btd-s3-ggsteve.png media/images/blog/2015/11/btd-s3-innuendo-large.png media/images/blog/2015/11/btd-s3-innuendo.png media/images/blog/2015/11/btd-s3-mouthnoises-large.png media/images/blog/2015/11/btd-s3-mouthnoises.png media/images/blog/2015/11/btd-s3-percentile-bob-large.png media/images/blog/2015/11/btd-s3-percentile-bob.png media/images/blog/2015/11/btd-s3-percentile-cringe-large.png media/images/blog/2015/11/btd-s3-percentile-cringe.png media/images/blog/2015/11/btd-s3-percentile-drugs-large.png media/images/blog/2015/11/btd-s3-percentile-drugs.png media/images/blog/2015/11/btd-s3-percentile-steve-large.png media/images/blog/2015/11/btd-s3-percentile-steve.png media/images/blog/2015/11/btd-s3-percentile-the-large.png media/images/blog/2015/11/btd-s3-percentile-the.png media/images/blog/2015/11/btd-s3-ruined-large.png media/images/blog/2015/11/btd-s3-ruined.png media/images/blog/2015/11/btd-s3-steve-large.png media/images/blog/2015/11/btd-s3-steve.png media/images/blog/2015/11/btd-s3-subjects-large.png media/images/blog/2015/11/btd-s3-subjects.png media/images/blog/2015/11/btd-s3-the-large.png media/images/blog/2015/11/btd-s3-the.png media/images/blog/2015/11/btd-ssp-bob-large.png media/images/blog/2015/11/btd-ssp-bob.png media/images/blog/2015/11/btd-ssp-gg-large.png media/images/blog/2015/11/btd-ssp-gg.png media/images/blog/2015/11/btd-ssp-rip__devil-large.png media/images/blog/2015/11/btd-ssp-rip__devil.png media/images/blog/2015/11/btd-ssp-ruined-large.png media/images/blog/2015/11/btd-ssp-ruined.png media/images/blog/2015/11/btd-ssp-steve-large.png media/images/blog/2015/11/btd-ssp-steve.png media/images/blog/2015/11/btd-ssp-the-large.png media/images/blog/2015/11/btd-ssp-the.png media/images/blog/2015/11/btd-volume-comparison-large.png media/images/blog/2015/11/btd-volume-comparison.png