I read this article the other day and found it interesting for a variety of reasons: I approve of people performing detailed analysis of their data; I like it when someone with access to large data sets takes the time to tell us about what they see in them; and I'm always interested in seeing nice visualisations of complex & large data sets.
I woke up this morning with an uneasy feeling swirling around the back of my (still clouded) mind and traced it to this article. Something was bugging me about it, so I went back and took another read.
Jakob advocates the use of an advanced log-log plot of traffic usage data as a way of highlighting the behaviour of site traffic with respect to page views at the low frequency end of the plot. He notes that a simple linear plot of the page view data seems to indicate that the traffic displays a Zipf distribution, but that, when viewed using the log-log plot you can clearly see that it falls away from the predicted values at the low end.
And here's where he lost me: he then goes on to assert that this is evidence that the site in question is failing to meet the demand of the site's audience, based on the divergence from the Zipf distribution... 'So what?' I hear you ask. Well, at no stage has he shown - even loosely - that the Zipf distribution is ever a good model for site traffic, even if the site owner goes crazy with the content development. Is there even one example that can be used to show that this model is a good predictor of traffic patterns to a site? Anything?
So, based on the evidence provided (nil) I'd have to reject this advice and look for alternative explanations. Some exist already: the traffic usage distribution shown by Jakob in his example could actually be an occurence of a lognormal distribution. It may be that the 'natural' distribution for page view data on a Web site is the lognormal, and not the Zipf at all.
But what factors might contribute to the occurence of the one versus the other? Jakob argues that the reason we are seeing something similar to the lognormal distribution is due to the scarcity of specialised, low-view content, which would populate the low end of the Zipf distribution. That is, in the presence of additional content, Web site visitors would 'naturally' view these pages and boost the number of views to match the values predicted by the Zipf distribution.
I'm not convinced. For additional reading you might like to peruse Chris Anderson's (of the Long Tail fame) look at movie distribution figures for the US here. Chris attributes the presence of the lognormal distribution to the finite number of movie screens available in the US. Chris provides a much stronger argument for the Zipf distribution as the 'natural' curve of movie revenues in the absence of constraints.
For the Web page view analysis at least one question remains: is the constraint the scarcity of content, or the scarcity of visitors?
SXSW 2017: Should Age Diversity Matter?
1 month ago