Raspar um fórum Discourse com Python

mermaldad · Outubro 2, 2018, 2:28am

I am looking to semi-automate the process of turning a nominations thread into a table of nominations. I am using python (specifically the beautiful soup library) to parse the html. The nominations thread consists of a bunch of posts by users in which they link to a thread or topic that they like. I have successfully written the code to scrape the nominations thread to find the username of the nominator, the link, and the picture(s) of the project. My routine can even handle posts with more than one link.

The roadblock I have reached is that if I follow the links, the resultant page will have a number of posts from before the post being linked. I assume this is to have enough previous info so that the user can scroll up after following the link. I can’t figure out how to spot the linked post or alternately modify the link so that it shows only the post that was linked. Anyone got any suggestions?

P.S. it’s a little rough right now, but I’ll be happy to share my code when I get it working.

pfaffman · Outubro 2, 2018, 1:27pm

If access the raw markdown pages you won’t have to parse html and will have the URLs without accessing the counters. This topic is https://meta.discourse.org/raw/98520.

Edit: like others say, the API is really what you want. Here’s the scraper I wrote: GitHub - pfaffman/discourse-downloader: Download a Discourse topic for offline analysis · GitHub

rbrlortie · Outubro 2, 2018, 1:37pm

You might want to use the API. Also, remember that almost all URLs will respond to a .json suffix.

E.g. this thread https://meta.discourse.org/t/scrape-a-discourse-forum-with-python/98520.json

mcmcclur · Outubro 2, 2018, 2:08pm

I also recommend that you use the API, rather than the HTML. That’s how I set up the archival tool described in this topic.

mermaldad · Outubro 3, 2018, 1:33am

Thanks, all, your suggestions are very helpful! I particularly like that examples were offered in two languages. I’m going to go with the Python, but it’s cool to see the Ruby example as well.

Tópico		Respostas	Visualizações
Search by link Development search	7	122	7 de Dezembro de 2024
Converting Wiki Posts to HTML Support	3	602	13 de Fevereiro de 2022
Grabbing Facebook reactions as posts Feature	7	1380	26 de Janeiro de 2015
Converting links from raw markdown to HTML Development	2	1879	21 de Novembro de 2018
What method is used by Discourse to identify urls in post body Development	6	1170	12 de Dezembro de 2017

Raspar um fórum Discourse com Python

Tópicos relacionados