Scrape a discourse forum with python


#1

I am looking to semi-automate the process of turning a nominations thread into a table of nominations. I am using python (specifically the beautiful soup library) to parse the html. The nominations thread consists of a bunch of posts by users in which they link to a thread or topic that they like. I have successfully written the code to scrape the nominations thread to find the username of the nominator, the link, and the picture(s) of the project. My routine can even handle posts with more than one link.

The roadblock I have reached is that if I follow the links, the resultant page will have a number of posts from before the post being linked. I assume this is to have enough previous info so that the user can scroll up after following the link. I can’t figure out how to spot the linked post or alternately modify the link so that it shows only the post that was linked. Anyone got any suggestions?

P.S. it’s a little rough right now, but I’ll be happy to share my code when I get it working.


(Jay Pfaffman) #2

If access the raw markdown pages you won’t have to parse html and will have the URLs without accessing the counters. This topic is https://meta.discourse.org/raw/98520.

Edit: like others say, the API is really what you want. Here’s the scraper I wrote: GitHub - pfaffman/discourse-downloader: Download a Discourse topic for offline analysis


#3

You might want to use the API. Also, remember that almost all URLs will respond to a .json suffix.

E.g. this thread https://meta.discourse.org/t/scrape-a-discourse-forum-with-python/98520.json


(Mark McClure) #4

I also recommend that you use the API, rather than the HTML. That’s how I set up the archival tool described in this topic.


#5

Thanks, all, your suggestions are very helpful! I particularly like that examples were offered in two languages. I’m going to go with the Python, but it’s cool to see the Ruby example as well.