I have been thinking about this and totally get @codinghorror’s concern regarding information porn. Its a tough problem to tackle with many different customers.
The immediate problem we need to tackle is this:
This is a very complex problem cause Discourse has a rather “different” view of what a page view means.
If a user visits a topic and then loads more posts is that 1 page view or 2?
If the answer to the above is 1 what about crawlers that make 2 distinct non json requests?
Why not just count every web request as a page view?
The questions just pile up and answering is complex.
I deployed this system about 5 hours ago, since we have had 113 thousand background requests and a total of 176 thousand web requests served. This is not even counting static assets.
Even if we only count GET requests to topics we already ate through 5500 in the last 5 hours, so its likely we would hit 750k requests a month on topics alone which would blow the business plan just for meta.
So, besides probably needing to adjust our limits, we first need a sane and easy to explain stat.
###What do I think a page view should be?
- A non ajax HTTP GET request that is served successfully
- An ajax HTTP GET request that is “decorated” by the Ember router on route transition.
a. When you move from topic list to topic, we count a page view
b. When you move from the topic page to the user page we count a view
c. When you switch filters on the user page we do not count a new page view
Once we have this out of the way we can simply define a table for “page views” and count anon/crawler/logged in and total page views.
This information should always be there on all instances and will easily allow people to do some basic capacity planning. 4, simple to explain numbers that carry through to our buy page.
###Longer term we need performance counters
People are having a really tough time figuring out if they are under-provisioned. I think that there are a bunch of stats we can add to help answer that in a “performance” table
- How many GET req took longer than 200ms?
- How many GET requests took longer than 1 second?
- How many GET requests took longer than 5 seconds?
- How many GET requests total?
- How many server errors?
Given these 5 simple metrics people can easily compare a baseline of performance, track updates that improve performance. Additionally when / if they migrate to us they can quickly tell how much “better” stuff has become. Sure, its nowhere as comprehensive as New Relic or other solutions, but it can be enough to get a good gauge on general performance.
However, simple performance counters will take quite a while to figure out, it may be more interesting to gather “median” and “99th percentile” but we would need a new system for that.
So, the performance piece is going to have to wait for 1.3.
###General server usage
I still think there is room for a simple counter of
- 2XX requests (successful web requests)
- 3XX requests (redirects)
- 4XX requests (client errors)
- 5XX requests (server errors)
- Background requests (split out from all above)
I find the general usage very interesting, but its not everyone’s cup of tea, I think this particular data can be an opt-in thing hidden behind a site setting. When diagnosing capacity issues we can enable it. The amount of traffic we are handling is quite staggering.
So to recap… for 1.2 we will only have 4 counters:
page views anon
page views logged in
page views crawler
page views total
Some performance dashboard should be queued for 1.3 and beyond, aggressive monitoring of all traffic will be opt in (already there so easy to add for 1.2)
Complex analysis is going to need to come in from a plugin or google analytics, so I don’t really think we need to split up API reqs from the rest of the pile right now, cause the more segmentation of data we do the harder to consume and understand the numbers become.
cc @eviltrout (in particular about the page view definition - which we need to wire in to the transition spot using a magic ajax http header or something like that)