There is a lot of work around mechanics of training and so on. Would you have run down of the mechanics in which you would perform the training? What exact models would you recommend using?
The current tools my team use come from gensim. It has a standard python module interface. It has been pretty well tested for many years.
The setup that comes to my mind would be:
- First: choose document set: could be all topic roots, could be all posts.
From time to time (e.g. once per week? once per month? depending on forum traffic), build the
- scrape the discourse topics into a file (or files) of md text, title+topic body. now thinking of each topic as a doc, or “document” for the gensim algorithms
- run standard NLP tools to process the docs, stemming words, etc.
doc2vec (from gensim implementation) to build a model that maps each doc into a vector in a d-dimensional space. You have to choose the meta-param d by experimenting; Google uses d=40 for its patent models; not sure what d is used by Google scholar. I typically use d=200. Each dimension of the space may be thought of as a “feature” related to the semantic content.
- (FYI: the doc2vec algorithm builds the feature space by training a neural net targeted to learn word sequences; the nnet has a d-dimensional hidden layer; the outputs of the hidden layer form the latent space of features)
- Building the model is the heavy-weight task, depending on how many docs you have. 38 years of patents = 5 million docs; the doc2vec model takes overnight on an oldish machine with 8 cores.
- Optional interesting further task: cluster the cloud of docs in the d-dim feature space.
- off the shelf tools for clustering, e.g. from python sklearn library may be used.
- the clustering gives an emergent classification; interesting research questions include how these classifications overlap with keyword (or discourse tag) categories.
This would happen offline. Then online:
- The model would be loaded.
- Once the model is loaded, a rather light-weigh task is to parse a new doc and query the model for its location in the d-dim feature space.
- note this new doc would not trigger a rebuilding of the model. The model would be static for the online queries. The new doc would be incorporated in the next (e.g. weekly) build of the model
- Then the last light-weight task is to ask what are the nearby docs in feature space. There are gensim tools for getting a list of nearby docs, but you can also use numpy directly to load up all the doc vectors into a structure like a kd-tree that enables fast query of nearby points directly.
What happens when a topic has 100 posts? 1000 posts?
The offline part scales more or less linearly with number of docs, but should be very manageable for 10k-100k docs. Even 1M’s of docs is ok for a weekly batch.
What would you use for signal, and what strength would you give to each thing (views/category/tag and so on)
In this context ‘signal strength’ for a new topic is directly interpreted as (inverse) distance from the new topic’s vector-space embedding to existing doc vecs. One could dress this signal up with other considerations (likes, views, etc), but these are additional frills to the basic algorithm I am describing.
Once I (or someone) get scraping to work, the offline bit described above is pretty easy and mechanical.
The hard bit (for me) would be the online bit, which would require an interfacing of discourse rails with a handful of python calls (e.g. to the gensim tools). Any examples of this sort of interface would be helpful for me to look at.