Accessing Postgres Database for Analytics Pipeline

Hello all, I didn’t quite find a similar post to this, so here goes:

My org is building out our analytics dashboards (within Snowflake) and are hoping to integrate the user data we have in Discourse.

Currently, we are self-hosting Discourse on an EC2 instance within Docker. It looks like there a couple of options here:

  1. Use AWS Database Migration Service to access the Postgres server hosted on the EC2. This would require exposing the Postgres port (which I don’t believe is configured to be publicly accessible by default) and creating a new Postgres user, but otherwise this seems like a pretty good solution.
  • A straight sql dump to an s3 bucket (how backups work in Discourse) won’t work, as SQL isn’t accepted as a data format by Snowflake.
  1. Install the Data Explorer plugin, write the queries that output the data we want, then run + export resulting data in CSVs/JSONs that we can than import into Snowflake.
  • This solution looks to have more steps than 1. above, but would also do the trick and has less of a chance of screwing up the Discourse DB.

I would appreciate input or hearing anyone else’s tales of getting their DB data into an analytics pipeline. Thank you!

If snowflake can talk to Postgres, then I’d open a postgres port (with appropriate firewall to limit access) and let it talk to it.

If Snowflake can read csv/json then you could use data explorer and retrieve the stuff automatically via the API.

1 Like

This tracks. Probably going to try and go the direct route (my option 1). Storage is cheap and dumping out the whole DB to Snowflake now so we can define all of our analytics queries in one repo feels like a lot less overhead than defining some queries in Data Explorer for a set of analytics requests, then they ask for more so we define another set, and onwards…