Suggested Topics - Title & Content based Suggestions

Would be pretty fantastic if suggested topics took topic titles and content into account. E.g. If reading a topic about the subject of ‘analytics’, then topics that contain similar content could be mixed into the current suggested results.

If this could be done efficiently, I’m confident this would be a sure win for increasing engagement for both new visitors and active users. Even if these suggestions only used the topic titles (instead of title + content).

3 Likes

I created a similar thread but it is locked.
Currently I have disabled this feature and replaced it with mine.

1 Like

This is very complicated feature to build efficiently and the rabbit hole is enormous. At some point you start considering machine learning, and the rabbit hole goes deeper and deeper.

I see the appeal of using AI to determine related content, this can be very handy on #support topics, which can almost be made “self service”. You post a question, ML picks up 10 candidates and just posts them as a reply you can either accept or delete. Similarly when anon hits something for support it is handy to show related content.

That said… gigantic enormous task. Not on our roadmap for next year. But @eviltrout is keen to experiment with AI/ML at some point and this is the type of project that could relate.

@codinghorror remains a massive fan of DJ random, cause he reckons it can outperform many many fancy algorithms (and it often does)

6 Likes

DJ random is fantastic, though we do gate it by time and category / tag — that matters.

3 Likes

Hi,

New here, so sorry if I’m beating a dead horse.

I agree with @sam that there is a rabbit hole, but on the other hand, topic modeling technology is now pretty mature, and pretty good off-the-shelf tools exist. A recent project of mine has analyzed ~ 5 million patent titles and abstracts; analyzing order of ~1000’s of topics on my spiffy new discourse site would be a piece of cake. Moreover my community might have energy to make it happen.

From the experts: I would like advice on whether I should be thinking of designing a plugin, or should I think in terms of messing with discourse source (which I have downloaded from github)?

Found this on scraping discourse topics with python, but haven’t got it to work yet. Something like it should allow me to pull the data offline, build the model, loadable for querying subsequently.

Most of the good tools are in python, FWIW…

3 Likes

Functionally it fits best in the “your topic is similar to…” panel when you’re composing a new topic.

1 Like

I would certainly recommend a plugin here over hacking the source code. Odds are extremely low we could ship something like this in core cause there would be a massive python dependency required and tons of UI for training and so on.

There is a lot of work around mechanics of training and so on. Would you have run down of the mechanics in which you would perform the training? What exact models would you recommend using? What happens when a topic has 100 posts? 1000 posts?

What would you use for signal, and what strength would you give to each thing (views/category/tag and so on)

I am extremely fond of this project, but I feel it is a somewhat huge task.

2 Likes

There is a lot of work around mechanics of training and so on. Would you have run down of the mechanics in which you would perform the training? What exact models would you recommend using?

The current tools my team use come from gensim. It has a standard python module interface. It has been pretty well tested for many years.

The setup that comes to my mind would be:

  • First: choose document set: could be all topic roots, could be all posts.

From time to time (e.g. once per week? once per month? depending on forum traffic), build the doc2vec model:

  • scrape the discourse topics into a file (or files) of md text, title+topic body. now thinking of each topic as a doc, or “document” for the gensim algorithms
  • run standard NLP tools to process the docs, stemming words, etc.
  • use doc2vec (from gensim implementation) to build a model that maps each doc into a vector in a d-dimensional space. You have to choose the meta-param d by experimenting; Google uses d=40 for its patent models; not sure what d is used by Google scholar. I typically use d=200. Each dimension of the space may be thought of as a “feature” related to the semantic content.
    • (FYI: the doc2vec algorithm builds the feature space by training a neural net targeted to learn word sequences; the nnet has a d-dimensional hidden layer; the outputs of the hidden layer form the latent space of features)
  • Building the model is the heavy-weight task, depending on how many docs you have. 38 years of patents = 5 million docs; the doc2vec model takes overnight on an oldish machine with 8 cores.
  • Optional interesting further task: cluster the cloud of docs in the d-dim feature space.
    • off the shelf tools for clustering, e.g. from python sklearn library may be used.
    • the clustering gives an emergent classification; interesting research questions include how these classifications overlap with keyword (or discourse tag) categories.

This would happen offline. Then online:

  • The model would be loaded.
  • Once the model is loaded, a rather light-weigh task is to parse a new doc and query the model for its location in the d-dim feature space.
    • note this new doc would not trigger a rebuilding of the model. The model would be static for the online queries. The new doc would be incorporated in the next (e.g. weekly) build of the model
  • Then the last light-weight task is to ask what are the nearby docs in feature space. There are gensim tools for getting a list of nearby docs, but you can also use numpy directly to load up all the doc vectors into a structure like a kd-tree that enables fast query of nearby points directly.

What happens when a topic has 100 posts? 1000 posts?

The offline part scales more or less linearly with number of docs, but should be very manageable for 10k-100k docs. Even 1M’s of docs is ok for a weekly batch.

What would you use for signal, and what strength would you give to each thing (views/category/tag and so on)

In this context ‘signal strength’ for a new topic is directly interpreted as (inverse) distance from the new topic’s vector-space embedding to existing doc vecs. One could dress this signal up with other considerations (likes, views, etc), but these are additional frills to the basic algorithm I am describing.

Once I (or someone) get scraping to work, the offline bit described above is pretty easy and mechanical.

The hard bit (for me) would be the online bit, which would require an interfacing of discourse rails with a handful of python calls (e.g. to the gensim tools). Any examples of this sort of interface would be helpful for me to look at.

3 Likes

@Bcat : I would be very interested to see how you ‘replaced with yours’. Do you have a plugin or repo that I could check out?

The tricky performance piece is the RPC mechanism here. You don’t want to launch a brand new Python process for every single topic view.

Even an HTTP call may be too slow.

Perhaps … populate a related_topics (topic_id, related_topic_id, rank) table? You could then lean on WebHooks to get the table updated quickly when people post new topics and Ruby does not need to call Python.

On the Discourse side implementation would be pretty easy, you would simply rewrite this method to look up the information in your new related_topics table.

1 Like

The old way didn’t work so I replaced it with google ads. The topics that google suggest are very clever.
As for the old way of doing things, I turned off the default suggestion and replaced it with a js snippet that calls /search and then returns a list of topics.

2 Likes

Thanks for the pointer to the table implementation. Not sure the table approach scales, though. For N topics, we need a table of size $N^2$. so for 10^4 topics the table would have 10^8 entries.

I don’t see how to escape needing a python call to parse a new topic, embed it, and find the nearest neighbors. I have in the past built a web interface, but I would probably be tempted here to just run a python process on the side and communicate with discourse through a socket or pipe, looking more or less like read and write to a file rather than an actual python call. (It’s all running on my server, after all…)

Sorry I think I am completely misunderstanding here?

If you have 100 topics and each topic shows 5 related topics, why would the table need to be larger than 500?

1 Like

N topics => N points in the vector space representation.
matrix of distances between each point is N^2. (matrix is symmetric, so N*(N-1)/2 independent values). This is the N^2 I was referring to.

But clever data structures (e.g. kd-tree) enable finding nearest neighbors without brute force search of the N^2 table of differences.

Anyway, I know how to do all this in python, returning the small table you refer to, N x 5 for 5 nearest topics.

1 Like

Then if you run that daily in Python you could just connect the Python direct to the Discourse DB, have it generate this cache table.

Then the Discourse plugin part of things is kind of trivial. Instead of selecting from location X it selects from location Y (a different table).

You no longer need to fight with pipelines that need to straddle between two programming languages for a single request.