إعادة تصنيف المواضيع تلقائيًا من منتدى مستورد من خلال استهداف الكلمات المفتاحية في العناوين؟

Hi,

I’m importing an old and large forum about unicycling.

The old categories weren’t the best, and a lot of different stuff was mixed together.

So, I’m re-organizing categories.

At first, I was thinking to manually re-categorize the most recent few hundred of topics, and keep the old ones as they are.
The idea would be to aim at the future, not at the past. Doesn’t matter that much if old topics are badly categorized, the most important is that they are still available.

But I’m wondering if re-categorize topics automatically by targeting keywords could do, in fact, a good job.

Currently, the vast majority of our topics -more than half of the total!- are in a single category (:scream:).

I could target these keywords in the titles: “learn”, “learning”, “train”, “training”, “posture”, etc… And put all these topics in a category #riding-advice.
The same could go with “frame”, “wheel”, “tire”, “saddle”, etc… That would go in #unicycles-and-equipments.

I’ll target words wrapped by spaces and try to anticipate multiple words expressions and prevent a bit of “false positives”. Example: “wheelwalking” is a unicycle trick that should probably be found in #riding-advice, so if I target only “wheel” without thinking much, there will be false positives that could have been easily avoided (that said, I could move topics with “wheel” from A to B, and then move topics with “wheelwalking” from B to C…).

Did some people here do such a thing? Do you have suggestions or ideas to minimize the risk of “false positives”? Are there obvious (or not) things that I need to know before doing this?

About 70000 topics must be looked at.

إعجابَين (2)

One bit of advise, do not view this as having to be done right the first time.

Your idea of seeking keywords is the same first approach I would take. Do not be afraid to throw out all of the work you did in your first attempt. If the result is not what you seek take what you learned by doing the first attempt and start over from scratch.


EDIT

In doing quick search for some free tools to do word analysis found this information page on Text Analysis. Nice read.

3 إعجابات

I have previously approached similar projects by using unsupervised learning using K-means clustering. That would be a pretty cool experiment and maybe the algorithm even comes up with a better categorization :wink:

You can read about such an approach here Applying Machine Learning to classify an unsupervised text document | by vishabh goel | Towards Data Science

Just like @EricGT said: don’t be afraid to iterate, but close enough is close enough, and maybe have some TL3 users ready to re-categorize where necessary.

7 إعجابات

That’s interesting!

I probably won’t have the time nor the skills to try this approach though (the forum has been down for more than a month, and I still have a lot of work to do!).

After a first try, manually choosing keywords seems to have fairly good results, though I didn’t re-categorized yet and just played with SQL queries.

select title from topics
where category_id = 10
and lower(title) not like '%saddle%'
and lower(title) not like '%crank%'
and lower(title) not like '%pedal%'
and lower(title) not like '%rim%'
and lower(title) not like '%carbon%'
and lower(title) not like '%spoke%'
and lower(title) not like '%wheel%'
and lower(title) not like '%frame%'
and lower(title) not like '%hub%'
and lower(title) not like '%tubeless%'
and lower(title) not like '%disk%'
and lower(title) not like '%hydraulic%'
and lower(title) not like '%duro%'
and lower(title) not like '%dominator%'
and lower(title) not like '%torker%'
and lower(title) not like '%nimbus%'
and lower(title) not like '%bearing%'
and lower(title) not like '%pad%'
and lower(title) not like '%repair%'
and lower(title) not like '%handlebar%'
and lower(title) not like '%kh%'
and lower(title) not like '%kris holm%'
and lower(title) not like '%coker%'
and lower(title) not like '%tube%'
and lower(title) not like '%build%'
and lower(title) not like '%29er%'
and lower(title) not like '%36er%'

and lower(title) not like '%backwards%'
and lower(title) not like '%riding%'
and lower(title) not like '%foot%'
and lower(title) not like '%train%'
and lower(title) not like '%training%'
and lower(title) not like '%learn%'
and lower(title) not like '%learning%'
and lower(title) not like '%dismount%'
and lower(title) not like '%habit%'
and lower(title) not like '%idle%'
and lower(title) not like '%idling%'
and lower(title) not like '%freemount%'
and lower(title) not like '%free mount%'
and lower(title) not like '%free mounting%'

This query returns 33000 topics of 52000 from the main category that could be re-categorized. The number seems realistic, but I still probably need to add more keywords.

The method seems reliable enough.

إعجابَين (2)

What did you end up doing here?

If you have unique enough keywords in the topics (I assume you are iterating through all the topic replies and counting keywords in each post), it could be viable to automatically categorize a topic based on the presence of enough unique, specific keywords in that topic.

(This is primarily useful for migrations, though, since on a live forum you’d want the topic in the correct category at the outset.)

إعجابَين (2)

I moved topics to other categories by checking keywords in their titles. It worked well enough to be better than the mess it was before.

3 إعجابات

That’s a good point; a certain specific word consistently appearing in a lot of topic titles is strong evidence that a new category is needed. :thinking:

4 إعجابات

هل قمت بذلك عن طريق استعلام؟ إذا كان الأمر كذلك، فما هو قالب الاستعلام؟ هل كانت هناك أي أنشطة أخرى مطلوبة بعد تشغيل الاستعلام لضمان سلامة قاعدة البيانات؟

إعجاب واحد (1)

يبدو أنه تم ذلك باستخدام نص برمجي للاستيراد، لذلك تم تعديله لاستنتاج فئة من العنوان.

هل تقوم بالاستيراد؟ من أي برنامج؟ إذا كان موجودًا بالفعل في discourse، فيمكنك القيام بذلك من rails.

إعجاب واحد (1)

على حد علمي منذ أن ساعدته في العديد من أعماله المتعلقة بـ Discourse، أتذكر أنه استخدم نص Rails بعد الاستيراد. لقد اختار المواضيع حسب الكلمات المفتاحية في عناوينها، ثم استخدم الأوامر الموثقة رسميًا لنقلها، مثل Administrative Bulk Operations.

أتذكر أيضًا أن نقل المواضيع التي تحتوي على علامات، والأوامر الرسمية، ومهام rake لم يقم بتحديث بعض الجداول بالكامل، وكذلك المهمة الدورية ذات الصلة في Sidekiq.
لا أعرف ما إذا كان هذا لا يزال هو الحال، ولكن قد يكون هذا شيئًا يجب الانتباه إليه في Bulk tagged topics, then moved topics into another category, but the category tag selector doesn't show tags - #3 by Canapin.

آمل أن يكون هذا مفيدًا!

إعجاب واحد (1)