Topic "popular links" panel domain extraction doesn't handle country TLDs

barryvan · March 30, 2017, 4:32am

It looks as though the domain extraction logic doesn’t understand country TLDs in domains – so it’s considering .com.au as a domain, rather than the more-appropriate seqta.com.au that the link uses.

This is, of course, a really tiny issue – at worst, it’s a bit confusing or meaningless.

tgxworld · March 30, 2017, 6:33am

Our current logic only extracts the last two level in the domain name:

https://github.com/discourse/discourse/blob/99abbc2e2d8a5a9050688c346b02d1e21b3c221d/app/assets/javascripts/discourse/widgets/topic-map.js.es6#L160-L163

I think it’ll be easier and clearer if we just show the domain instead of trying to figure out what the root domain is.

fefrei · March 30, 2017, 9:32am

I agree that simply showing the full domain is probably okay.

I you want to keep the current “identify the actual domain” behavior, the Public Suffix List is probably a good place to get started

sam · March 30, 2017, 1:44pm

I vote against carrying a giant library or case statement just to remove a www once in a while.

My vote is to simply show the domain and do away with this magic.

If we MUST … keep the magic for domains that end with .com and .org

tgxworld · March 31, 2017, 9:25am

Fixed in

https://github.com/discourse/discourse/commit/7690cc6ca50d48c86449dd9c7bf16301aa31ebc4

codinghorror · March 31, 2017, 9:34am

Hmm can you provide some examples of old and new here?

tgxworld · March 31, 2017, 9:48am

This is the new version where we show the full domain.

codinghorror · March 31, 2017, 9:53am

Hmm that’s pretty nasty… can’t say I am a fan.

Couldn’t we have a simple regex that allows a few 2 and 3 letter dotted phrases at the end?

\w{3,}\.\w{1,3}(\.\w{1,3})$

TLDs are a pain though, if they are long like funky.community… stuff is gonna break. I guess the general logic would be

grab the rightmost period and word chars next to it
if it is too short, grab the next leftward period and word too

This would handle com.au as it is clearly way too short to be a real domain. com.com is also too short, I think. so the threshold is “must be more than 7 chars with just one period”

tgxworld · March 31, 2017, 2:35pm

I did some more research and it seems like the only practical way to this problem is to match the domains against the Public Suffix List

If we want to, we could include the list server side, only 188kb, and send it down to the client.

sam · March 31, 2017, 2:37pm

Why does the client need this? the server can just send the split off domain and handle doing that in the serializer.

My issue with public suffix gem though is that it bloats the ruby process with A LOT of strings, this file is big and stored in memory, 1 rvalue per domain minimum publicsuffix-ruby/data/list.txt at main · weppos/publicsuffix-ruby · GitHub

tgxworld · March 31, 2017, 2:50pm

Oops what I meant is we will determine the domain name server side. I don’t mean send down the entire list

sam · March 31, 2017, 3:09pm

I am totally open to including public suffix if we build a simple gem that uses https://github.com/rockdaboot/libpsl to perform these lookups should only take a day or so to build and will help the entire Ruby community.

I am strongly against carrying the ruby implementation here that is a memory hog (and add tons of RVALUES into our heaps)

codinghorror · March 31, 2017, 7:45pm

This is pointless @tgxworld – can you explain why my simple suggested logic is not sufficient? I don’t see why we need to check “real” tlds.

sam · March 31, 2017, 7:52pm

I discussed this with him and there are mountains of edge cases.

sam.github.io (should pick sam.github.io) - github.io is a public suffix

www.nytimes.com , mobile.nytimes.com (should pick nytimes.com as its not a public suffix)

community.smh.com.au (should pick smh.com.au)

bob.blogspot.com (should pick bob.blogspot.com) blogspot.com is a public suffix

something has to give here or we will junk the wrong part… its nice to properly attribute domains and shorten as much as possible.

For context, it appears hacker news follow public suffix rules.

codinghorror · March 31, 2017, 7:53pm

My logic covers all the listed cases.

I disagree that showing blogspot.com vs bob.blogspot.com is incorrect.

The whole point is that you want a hint of where you will be going, there is no rule saying it must be perfectly predictive. Showing blogspot.com and GitHub.io is correct in this case.

sam · March 31, 2017, 8:11pm

I don’t agree it is correct, the whole reason for public suffix is so “blogger” and various other providers can provide “public suffixes”. That way it is clear that you are linking to my blog vs some random blog on blogger.

There are plenty of examples of public suffixes, github.io, blogger, japan seem to be really into this and the list goes on and on.

I am fine to shelf this as too hard for now, but the regex you have there is way optimistic. If we are going to hack this I would just special case to

Take last 3 parts

eg: d.co.il (yellow pages in Israel) would show up as co.il which is back to square one here.

codinghorror · March 31, 2017, 8:56pm

That is NOT the point, the point is

where does this go?

versus

where does this go? blogger.com

The fact that it goes to blogger.com tells me it’s a blog, the top level domain this will lead me to if I click. That’s what I needed to know, I do NOT need to know that it goes to slappy.blogger.com.

You are scope creeping the feature far beyond what was intended and I strongly disagree. I believe the simple heuristic I described:

grab the rightmost period and word chars next to it
if it is too short (7 chars or less), grab the next leftward period and word too

… not a regex but an if-then … will be good enough, and more analogous to what was already there versus hidden scope creeping this up to perfect.

sam · March 31, 2017, 9:05pm

You are missing my bigger point, you are suggesting a very aggressive regex, if we want to cut corners and do a shortcut here, then fine.

I am fine with a shortcut that culls domains to three parts [part 1].[part 2].[part 3]

I prefer to err on the side of caution here which is particularly good for international domain and always take last 3 parts. This adds more text but is a lot less edge casey with international domains. … yes this sucks for mobile.nytimes.com but is good for d.co.il, abc.net.au and lots of other short internationals.
You are suggesting aggressively culling out [part 1], which works fine for .com and .org domains and a lot less friendly to co.uk and .com.au domains and so on.

EDIT

Just reread the algorithm suggested, always fill up a buffer to a minimum of 7 chars picking up to 3 segments may work.

codinghorror · March 31, 2017, 9:08pm

Not suggesting a regex at all. Just simple logic based on periods and string length.

example.co.uk

Locate the rightmost period → .
Add all non-period characters to the right and left of it → co.uk
Is this string more than 7 chars? If yes, you are done. If not, add the leftmost period and leftmost non-periods → example.co.uk

And for jumbo.com

.
jumbo.com
done, string is > 7 chars

tgxworld · March 31, 2017, 11:42pm

The problem here is that there is no good length that we can use to get all the cases right.

Let’s take www.city.amakusa.kumamoto.jp for example,

The right output we should get is

Where does this go? city.amakusa.kumamoto.jp

Note that just displaying amakusa.kumamoto.jp or kumamoto.jp is incorrect here because it is as good as displaying com.au where we don’t provide any indication of where the site is going.

Assuming we determine that 7 chars is a good length, the heuristic algorithm will only produce kumamoto.jp which is not what we want. Just to get this case right, the length that we use will have to be 17 chars excluding the periods and we have to start considering the number of periods in the domain. If we bump the number of chars too much, we’ll end up displaying the full domain like community.seqta.com.au which brings us back to square one.

Topic		Replies	Views
`Topic#featured_link` containing more than just a valid URL Bug	5	798	December 4, 2017
API Retrieve Post relative URLs Dev rest-api	2	332	May 11, 2023
Restrict domain name autolinking so M.Sc and B.Sc are not auto linked Feature	34	3372	March 6, 2018
Show full website path on user profile regardless of domain UX	12	1700	April 11, 2016
Link to own domain results in "Internal Server Error" Support	6	840	October 18, 2019

Topic "popular links" panel domain extraction doesn't handle country TLDs

Related topics