Topic "popular links" panel domain extraction doesn't handle country TLDs

I think it is what we want. Just a hint where the link going – some domain in Japan. That’s enough, and correct.

The perfect is the enemy of the good, my friends.

(turns out Kumamoto is a prefecture in Japan, I guess analogous to a “state” in the “united states” so that’s plenty good. The US equivalent might be tulsa.oklahoma.us which means seeing a link as oklahoma.us is absolutely correct!)

IMHO as long as the link works I don’t have any strong preferences what shows as link text.

Would be easy and good enough to simply do “between // and the next first / when there is one”?

The “authority” portion.

The expectations of what is correct here is alittle confusing for me.

community.seqta.com.au seqta.com.au

This provides me with a hint that I’ll be heading to a domain registered in Australia at the seqta domain

vs

www.city.amakusa.kumamoto.jp kumamoto.jp

This provides me with a hint that I’ll be heading to a domain registered in the Kumamoto prefecture in Japan

Should I be reverting

https://github.com/discourse/discourse/commit/7690cc6ca50d48c86449dd9c7bf16301aa31ebc4

first? Otherwise, we’ll be showing the full domain for all links.

I strongly urge you to implement the algorithm I suggested.

  • it is simple
  • it works for the common cases
  • it is closer to the original intent of the way the feature was written
  • it fixes the country code TLD problem that was originally brought up

Can I confirm that the original intent is to “Extract the domain without the subdomains” even though the approach was too simplistic?

It only fixes that particular case though. If we introduce that solution, we’ll end up showing the full domain for domains like m.abv.bg which is what you didn’t want.

Isn’t this patently obvious?

  • meta.discourse.orgdiscourse.org
  • www.youtube.comyoutube.com

That sounds correct to me, the goal is to shorten, to reduce redundant or over specific information, but if it’s already quite short to begin with, why bother?

The actual edge condition would be something like

crazilylongthinghere.abv.bg

and I am willing to bet that is incredibly rare.

1 Like

There are differences between the two goals here to me :confused:

the goal is to shorten, to reduce redundant or over specific information

vs

Extract the domain without the subdomains

The first goal being subjective while the second has a definite answer.

Assuming that a C extension to this problem is already available, are we against pulling it in?

Yes, I am ABSOLUTELY against it. Far too much complexity and overhead for something we can handle with a simple heuristic. Please implement the heuristic I suggested.

Yet again, perfect is the enemy of good.

https://meta.discourse.org./

:arrow_right: Discourse Meta meta.discourse.org.

I really, really object to this :poop:

So at minimum we PLEASE need to suppress the part of the string beginning with www. from the display here… this is so hideous. I do not care about TLD perfection, what I care about is hideous awful repetitive output appearing for users.

1 Like

OK this is done in

https://github.com/discourse/discourse/commit/6593f0355f80a13f0f62993c7d776654141b61b4

5 Likes

Much better thank you @tgxworld

I built mini_suffix today which is a super simple wrapper around libpsl. Currently, it only exposes psl_registrable_domain for now because that is all we need.

Benchmarks

https://github.com/discourse/mini_suffix/blob/master/benchmark/bench.rb

PublicSuffix.domain total allocated memsize: 6574255
PublicSuffix.domain total retained memsize: 1133266
MiniSuffix.domain total allocated memsize: 8000
MiniSuffix.domain total retained memsize: 0

Warming up --------------------------------------
 PublicSuffix.domain     4.503k i/100ms
   MiniSuffix.domain    77.107k i/100ms
Calculating -------------------------------------
 PublicSuffix.domain     47.521k (± 2.2%) i/s -    238.659k in   5.024541s
   MiniSuffix.domain    875.595k (± 3.8%) i/s -      4.395M in   5.027237s

Comparison:
   MiniSuffix.domain:   875594.7 i/s
 PublicSuffix.domain:    47521.2 i/s - 18.43x  slower

As this introduces a dependency on libpsl, I’ll need to introduce libpsl to our base image first before I can add mini_suffix to Discourse and use it.

9 Likes

Yay feeling pretty happy about this, it is finally perfect :grin:

https://github.com/discourse/discourse/commit/8491c5fba55bd680e8478b2a4cbbe43c50adb32b

https://github.com/discourse/discourse/commit/9fbe1436b6fe5aa3c1dfb3b711cf3af1801fab20

9 Likes

This topic was automatically closed after 4 days. New replies are no longer allowed.