Searching for image urls in topics

I have searched and read searching for content effectively, but have not found an answer.

I have a discourse where I am a moderator (I have no access to the backend). Someone has posted numerous topics where images have been hotlinked to a 3rd party hosting provider (in this case, Google docs). They left the company, and all those image links are now broken.

I can (and have) manually gone through some of their posts, to find and fix (thanks internet archive) broken images. But that’s laborious. I’d like to get a list of every topic containing these broken image URLs so we can collectively fix them, by re-uploading the images to the site.

I can of course use search to find with:images #tutorials, but I cannot search inside the image URLs for (for example) googleusercontent. Is that possible, without API or backend rake access?

2 Likes

An admin could create a data explorer query that finds those posts.

But if the admin wanted this not to happen they’d have download images to local turned on. It’s a problem they glgave created and it’s not really a moderator’s job to fix it.

3 Likes

Does that mean you can’t install data explorer either? That would be the tool of choice for this.

How are the image formatted in the posts? Do they only show the plain URL, use [img], <img>, ![](url)…?

Just to illustrate your issue. A post could contain a broken image url, such as https://blogger.googleusercontent.com/img/b/R29vZ2xl/AVvXsEiNaW4QQ43EQ-8qqQPntDP7so6Cg19PVSLN9bXv3ZhQqHZtomb8CGY3XArx3GIaZ04d0p9K3V-buaf73-M5dpq2wPuvnjsapStHdTkTVoPj2q9RAmcdczmE12HYz57PNOdVuft1/s1600-h/eastern_coastal_pcn_ap.jpg

The post contains the url,

But it returns something like

<a href="https://blogger.googleusercontent.com/img/b/R29vZ2xl/AVvXsEiNaW4QQ43EQ-8qqQPntDP7so6Cg19PVSLN9bXv3ZhQqHZtomb8CGY3XArx3GIaZ04d0p9K3V-buaf73-M5dpq2wPuvnjsapStHdTkTVoPj2q9RAmcdczmE12HYz57PNOdVuft1/s1600-h/eastern_coastal_pcn_ap.jpg" target="_blank" rel="noopener" class="onebox">
    <img src="https://blogger.googleusercontent.com/img/b/R29vZ2xl/AVvXsEiNaW4QQ43EQ-8qqQPntDP7so6Cg19PVSLN9bXv3ZhQqHZtomb8CGY3XArx3GIaZ04d0p9K3V-buaf73-M5dpq2wPuvnjsapStHdTkTVoPj2q9RAmcdczmE12HYz57PNOdVuft1/s1600-h/eastern_coastal_pcn_ap.jpg" width="" height="" loading="lazy">
</a>

Which contains no string that can be searched for.

Is that what happens?

3 Likes

Correct, I cannot install plugins.

They are formatted using standard markdown ![](url) where googleusercontent is part of the URL. For example:

![|312x416](https://lh7-rt.googleusercontent.com/docsz/AD_4nXeWkc1cZH8jtNveMhet36oWsLDlUxA-2QadGepx8Uuw1naq6vx5JAd6oyQ2pSmLJkKN97ZnTlV2txMqdNb0QMDCqV0xu-0xOFzePw2hnrNPUNbHoHMWh60KJpP3QkLq2E3Gp0-cKrf3tSWjML8oIQ3I9JQ?key=7YTVKNzk_oQvl95Fd_BKLQ)

If I search for googleusercontent zero results are returned. Yet I can find posts which have images in, referenced by the a URL containing the text googleusercontent. I don’t know if this is a bug or a feature that discourse doesn’t search the urls of markdown-formatted image links.

1 Like

I believe Discourse search is performed on the processed post, which contains HTML.
The search ignores html tags, and IMG tags contain no text, hence the impossibility to return what you’re looking for.

Why can’t you use the API?

You could create a local script that triggers a search query for the user’s posts containing images, iterates through the results (slowly enough to not reach rate limits, and also you can query the raw posts content if needed) and outputs the posts containing the substring you’re looking for.

Maybe there’s a simpler solution, but that’s I would go for with no other option. Fairly simple to do.

Because the admin won’t give her an api key?

Because she’s not a programmer?

It seems like a problem that the admin created and isn’t interested in solving.

1 Like

Yeah, I mean, an key isn’t required for the API search and post endpoints needed for my suggestion, unless I’m mistaken?

And sure, it would ideally require basic programming knowledge, even if AIs can probably output a good base script.

The mentioned problem surely isn’t ideal to solve without admin access.

2 Likes

I haven’t requested an API key (bureaucracy), and I wasn’t aware that I’d need one to do what I perceived as a “simple” search query. I wasn’t aware it doesn’t peek into the HTML tags in the content. So that’s explained that, thank you.

It’s not a problem the admin created. It’s just a situation the admins and content creators were not aware of until someone left the company, and access to google docs was shut off for that account, making the images disappear/break.

I agree that I could ask for an API key, or write something locally to scrape the site and find the offending posts. I’ll do one of those things.

Thanks for the responses. :pray:

2 Likes

You don’t need an API key to do searches.

I’m not sure having an API key could help you resolve your issue more easily.

Here’s an :robot: example Python script that loops my posts (1 post every 3 sec) on meta and returns those having the substring upload:// in the raw content:

import requests
import time

def fetch_posts(page):
    url = f"https://meta.discourse.org/search.json?q=%40cocoquark&page={page}"
    response = requests.get(url)
    return response.json()

def fetch_post_content(post_id):
    url = f"https://meta.discourse.org/posts/{post_id}/raw"
    response = requests.get(url)
    return response.text

def process_posts():
    page = 1
    while True:
        print(f"page {page}")
        data = fetch_posts(page)
        
        for post in data['posts']:
            content = fetch_post_content(post['id'])
            if "upload://" in content:
                print("https://meta.discourse.org/posts/" + str(post['id']))
            time.sleep(3)        
        
        if len(data['posts']) < 50:
            print("No more results.")
            break
        
        page += 1

if __name__ == "__main__":
    process_posts()
Output
page 1
https://meta.discourse.org/posts/1682015
https://meta.discourse.org/posts/1677389
https://meta.discourse.org/posts/1679834
https://meta.discourse.org/posts/1678673
https://meta.discourse.org/posts/1679833
https://meta.discourse.org/posts/1678629
https://meta.discourse.org/posts/1678229
https://meta.discourse.org/posts/1676531
https://meta.discourse.org/posts/1674982
https://meta.discourse.org/posts/1670250
https://meta.discourse.org/posts/1674421
https://meta.discourse.org/posts/1671959
https://meta.discourse.org/posts/1674355
https://meta.discourse.org/posts/1673357
https://meta.discourse.org/posts/1669322
https://meta.discourse.org/posts/1665519
page 2
https://meta.discourse.org/posts/1674153
https://meta.discourse.org/posts/1670613
https://meta.discourse.org/posts/1666606
https://meta.discourse.org/posts/1674992
https://meta.discourse.org/posts/1672811
https://meta.discourse.org/posts/1672050
https://meta.discourse.org/posts/1686260
https://meta.discourse.org/posts/1684497
https://meta.discourse.org/posts/1680692
https://meta.discourse.org/posts/1675012
page 3
No more results.
1 Like

Wonderful, thanks @CocoQuark !
I do love a bit of python :pray:
I have now identified all threads with broken images with your help.
Much appreciated.

2 Likes

You don’t need an API key to do a simple search, but I don’t see the point in “using the API” to do a simple search.

Perhaps I misunderstood the issue. It sounded like an issue that wouldn’t have happened if download remote images to local had been on, and it’s on by default. But it’s also likely that it got turned off for some beaurocratic reason that the admin did that. I think it’s going to be unnecessarily hard to solve your problem without the data explorer plugin or access to Rails.

Dude! You rock!

2 Likes