There are a lot of great comments about some amazing exceptions on this thread.
BTW - I'm Randy Farmer and I've been advising the team on this and other issues. Here's my qualifications:
Especially interesting in this case is the whole of Chapter 10, which you can read fore free here:
Anyway - Q&A is different than forums, so we'll be adapting and experimenting here - so this feedback is great!
The most important thing about reputation scores is that they are in context. "Flagging" reputation should be it's own (internal) score. That is the score that goes up or down based on how accurate you are at flagging content, not your general trust score. As @tszynalski points out significantly modifying your general trust confuses things.
At Yahoo! Answers we learned that people won't report marginal calls and risk only their flagging reputation, much less if it hurt their overall reputation.
Users definitely were hiding the worst of the worst content. All the content that violated the terms of service was getting hidden (along with quite a bit of the backlog of older items). But not all the content that violated the community guidelines was getting reported. It seemed that users weren't reporting items that might be considered borderline violations or disputable. For example, answers with no content related to the question, such as chatty messages or jokes, were not being reported. No matter how Ori tweaked the model, that didn't change.
In hindsight, the situation was easy to understand. The reputation model penalized disputes (in the form of appeals): if a user hid an item but the decision was overturned on appeal, the user would lose more reputation than he'd gained by hiding the item. That was the correct design, but it had the side effect of nurturing risk avoidance in abuse reporters. Another lesson in the difference between the bad (low-quality content) and the ugly (content that violates the rules)-they each require different tools to mitigate.
Discourse will need to track multiple reputations, including "flagger" quality - and this has been shown to work to get rid of the very worst (spam/troll) content. It doesn't deal with the marginal cases (we're still debating about how to handle "off-topic") - thoughts on that based on operational experience are most welcome!