Data analytics by complex networks and text mining


(Renato Fabbri) #1

Dear users and developers,

by looking through Discourse site, it seems to me that an interface for further data analytics would enhance
our experience.
I can tackle this by using complex networks and text mining techniques,
because I have developed research on these topics.
Immediate ideas are:
*) Derive network by users interactions and vocabulary usage and deliver some graphical interfaces for exploring these structures.
*) Counting of words, tags, terms and user activity.

I understand it might be late for coping with GSoC, but if you find it suitable,
I might apply.
My apologies for not making this contact earlier, but I handed my doctorate
dissertation a few days ago and could not concentrate as needed until now.
Some info about my research and software development efforts are gathered here:
https://pastebin.com/iNNuN4fy
Anyway, this topic might be of use for the Discourse community as a whole and for
developments outside GSoC.

Best Regards!
Renato Fabbri


(Christoph) #2

Personally, I find your thesis interesting as I also intend to do some analysis on my forum at some point in the future. But I am not sure what exactly the feature for discourse would be. Is this supposed to be a tool for admins to be able to identify types of users? If so, what for and don’t you think existing stats are sufficient? Or is it supposed to be a feature for users so they can compare themselves to others like on fitness tracking sites? If so, I again wonder: what for? Aren’t badges and likes enough to allow comparison and perhaps provide some incentive? Maybe I am completely misunderstanding you, so please explain.


(Renato Fabbri) #3

I thought about the tool for general users
and think that the usefulness can be glimpsed by
the following questions:
*) Does the current stats make clear what are the all time and recent most active users?
*) What users relate to each other beyond what is grasped by browsing through individual topics?
*) How does the overall interaction network(s) looks like? What characteristics can a participant take advantage of and how can Discourse encourage fruitful interactions?
*) What are the most used words and terms in Discourse and how they relate to each other?
*) Do we have interactive and interesting graphical interfaces to the analytics?
*) How do linguistic traces differ in user groups and how can it be used by the participants?

Anyway, I think that these kind of analyzes can help users get interested in the legacy
and admins to showcase or make reports.


(Erlend Sogge Heggen) #4

I think some really cool data could come out of this, but it’s too experimental as a GSoC project, because it’s hard to tell exactly what we’d get out of it. At the very least we’d need a proof of concept to peak our interest :wink:

What do you think is the one most interesting piece of data that could be derived from these techniques?


(Renato Fabbri) #5

The most interesting piece for Discourse IMHO is the interaction network
because it is simple, informative and eye catching.
There are a number of proofs of concept(s) in the documentation linked
in my starting message.
I will be happy to make some images and measurements from Discourse data,
if you can send me the database dump or direct me to an interface.
I might not make a JavaScript interface with D3.js interactive graphs now,
which are cool to contemplate and useful for investigation.


(Renato Fabbri) #6

Just dropping by to see if there really is no feedback on reaching Discourse data for making the proof of concept. Anyway, thank you for your time and nice interaction.


(Mittineague) #7

I have seen topics here over the years where members expressed interest in having various data other than what can be seen on the /users, /dashboard and /report pages. Search here for “statistics” and you can find some.
AFAIK the typical approach has been to use crafted queries with the Data Explorer Plugin

There is also this plugin I haven’t tried yet:


Perhaps the way forward would be to collaborate with @saiqulhaq ?


(Saiqul Haq) #8

Admin Statistic Digest plugin generates statistics report in simple way, no multi thread process
it retrieves data from database, then calculating it, if data retrieval process is more than specified time (maybe 1 or 2 minutes), then the process is aborted

However it would be cool if this feature could be implemented

I am wondering, does text mining process should be executed in the single machine with Discourse server? is it requires high hardware requirement?