Topic "popular links" panel domain extraction doesn't handle country TLDs

It looks as though the domain extraction logic doesn’t understand country TLDs in domains – so it’s considering .com.au as a domain, rather than the more-appropriate seqta.com.au that the link uses.

This is, of course, a really tiny issue – at worst, it’s a bit confusing or meaningless. :slight_smile:

7 Likes

Our current logic only extracts the last two level in the domain name:

https://github.com/discourse/discourse/blob/99abbc2e2d8a5a9050688c346b02d1e21b3c221d/app/assets/javascripts/discourse/widgets/topic-map.js.es6#L160-L163

I think it’ll be easier and clearer if we just show the domain instead of trying to figure out what the root domain is.

5 Likes

I agree that simply showing the full domain is probably okay.

I you want to keep the current “identify the actual domain” behavior, the Public Suffix List is probably a good place to get started :slight_smile:

4 Likes

I vote against carrying a giant library or case statement just to remove a www once in a while.

My vote is to simply show the domain and do away with this magic.

If we MUST … keep the magic for domains that end with .com and .org

6 Likes

Fixed in

https://github.com/discourse/discourse/commit/7690cc6ca50d48c86449dd9c7bf16301aa31ebc4

5 Likes

Hmm can you provide some examples of old and new here?

1 Like

This is the new version where we show the full domain.

1 Like

Hmm that’s pretty nasty… can’t say I am a fan.

Couldn’t we have a simple regex that allows a few 2 and 3 letter dotted phrases at the end?

\w{3,}\.\w{1,3}(\.\w{1,3})$

TLDs are a pain though, if they are long like funky.community… stuff is gonna break. I guess the general logic would be

  • grab the rightmost period and word chars next to it
  • if it is too short, grab the next leftward period and word too

This would handle com.au as it is clearly way too short to be a real domain. com.com is also too short, I think. so the threshold is “must be more than 7 chars with just one period”

I did some more research and it seems like the only practical way to this problem is to match the domains against the Public Suffix List

If we want to, we could include the list server side, only 188kb, and send it down to the client.

3 Likes

Why does the client need this? the server can just send the split off domain and handle doing that in the serializer.

My issue with public suffix gem though is that it bloats the ruby process with A LOT of strings, this file is big and stored in memory, 1 rvalue per domain minimum publicsuffix-ruby/data/list.txt at main · weppos/publicsuffix-ruby · GitHub

6 Likes

Oops what I meant is we will determine the domain name server side. I don’t mean send down the entire list :sweat_smile:

I am totally open to including public suffix if we build a simple gem that uses https://github.com/rockdaboot/libpsl to perform these lookups :slight_smile: should only take a day or so to build and will help the entire Ruby community.

I am strongly against carrying the ruby implementation here that is a memory hog (and add tons of RVALUES into our heaps)

5 Likes

This is pointless @tgxworld – can you explain why my simple suggested logic is not sufficient? I don’t see why we need to check “real” tlds.

I discussed this with him and there are mountains of edge cases.

sam.github.io (should pick sam.github.io) - github.io is a public suffix

www.nytimes.com , mobile.nytimes.com (should pick nytimes.com as its not a public suffix)

community.smh.com.au (should pick smh.com.au)

bob.blogspot.com (should pick bob.blogspot.com) blogspot.com is a public suffix

something has to give here or we will junk the wrong part… its nice to properly attribute domains and shorten as much as possible.

For context, it appears hacker news follow public suffix rules.

4 Likes

My logic covers all the listed cases.

I disagree that showing blogspot.com vs bob.blogspot.com is incorrect.

The whole point is that you want a hint of where you will be going, there is no rule saying it must be perfectly predictive. Showing blogspot.com and GitHub.io is correct in this case.

I don’t agree it is correct, the whole reason for public suffix is so “blogger” and various other providers can provide “public suffixes”. That way it is clear that you are linking to my blog vs some random blog on blogger.

There are plenty of examples of public suffixes, github.io, blogger, japan seem to be really into this and the list goes on and on.

I am fine to shelf this as too hard for now, but the regex you have there is way optimistic. If we are going to hack this I would just special case to

  • Take last 3 parts

eg: d.co.il (yellow pages in Israel) would show up as co.il which is back to square one here.

2 Likes

That is NOT the point, the point is

where does this go?

versus

where does this go? blogger.com

The fact that it goes to blogger.com tells me it’s a blog, the top level domain this will lead me to if I click. That’s what I needed to know, I do NOT need to know that it goes to slappy.blogger.com.

You are scope creeping the feature far beyond what was intended and I strongly disagree. I believe the simple heuristic I described:

  • grab the rightmost period and word chars next to it
  • if it is too short (7 chars or less), grab the next leftward period and word too

… not a regex but an if-then … will be good enough, and more analogous to what was already there versus hidden scope creeping this up to perfect.

You are missing my bigger point, you are suggesting a very aggressive regex, if we want to cut corners and do a shortcut here, then fine.

I am fine with a shortcut that culls domains to three parts [part 1].[part 2].[part 3]

  • I prefer to err on the side of caution here which is particularly good for international domain and always take last 3 parts. This adds more text but is a lot less edge casey with international domains. … yes this sucks for mobile.nytimes.com but is good for d.co.il, abc.net.au and lots of other short internationals.

  • You are suggesting aggressively culling out [part 1], which works fine for .com and .org domains and a lot less friendly to co.uk and .com.au domains and so on.

EDIT

Just reread the algorithm suggested, always fill up a buffer to a minimum of 7 chars picking up to 3 segments may work.

Not suggesting a regex at all. Just simple logic based on periods and string length.

example.co.uk

  1. Locate the rightmost period → .
  2. Add all non-period characters to the right and left of it → co.uk
  3. Is this string more than 7 chars? If yes, you are done. If not, add the leftmost period and leftmost non-periods → example.co.uk

And for jumbo.com

  1. .
  2. jumbo.com
  3. done, string is > 7 chars

The problem here is that there is no good length that we can use to get all the cases right.

Let’s take www.city.amakusa.kumamoto.jp for example,

The right output we should get is

Where does this go? city.amakusa.kumamoto.jp

Note that just displaying amakusa.kumamoto.jp or kumamoto.jp is incorrect here because it is as good as displaying com.au where we don’t provide any indication of where the site is going.

Assuming we determine that 7 chars is a good length, the heuristic algorithm will only produce kumamoto.jp which is not what we want. Just to get this case right, the length that we use will have to be 17 chars excluding the periods and we have to start considering the number of periods in the domain. If we bump the number of chars too much, we’ll end up displaying the full domain like community.seqta.com.au which brings us back to square one.

3 Likes