It looks as though the domain extraction logic doesn’t understand country TLDs in domains – so it’s considering .com.au as a domain, rather than the more-appropriate seqta.com.au that the link uses.
Couldn’t we have a simple regex that allows a few 2 and 3 letter dotted phrases at the end?
\w{3,}\.\w{1,3}(\.\w{1,3})$
TLDs are a pain though, if they are long like funky.community… stuff is gonna break. I guess the general logic would be
grab the rightmost period and word chars next to it
if it is too short, grab the next leftward period and word too
This would handle com.au as it is clearly way too short to be a real domain. com.com is also too short, I think. so the threshold is “must be more than 7 chars with just one period”
I am totally open to including public suffix if we build a simple gem that uses https://github.com/rockdaboot/libpsl to perform these lookups should only take a day or so to build and will help the entire Ruby community.
I am strongly against carrying the ruby implementation here that is a memory hog (and add tons of RVALUES into our heaps)
The whole point is that you want a hint of where you will be going, there is no rule saying it must be perfectly predictive. Showing blogspot.com and GitHub.io is correct in this case.
I don’t agree it is correct, the whole reason for public suffix is so “blogger” and various other providers can provide “public suffixes”. That way it is clear that you are linking to my blog vs some random blog on blogger.
There are plenty of examples of public suffixes, github.io, blogger, japan seem to be really into this and the list goes on and on.
I am fine to shelf this as too hard for now, but the regex you have there is way optimistic. If we are going to hack this I would just special case to
Take last 3 parts
eg: d.co.il (yellow pages in Israel) would show up as co.il which is back to square one here.
The fact that it goes to blogger.com tells me it’s a blog, the top level domain this will lead me to if I click. That’s what I needed to know, I do NOT need to know that it goes to slappy.blogger.com.
You are scope creeping the feature far beyond what was intended and I strongly disagree. I believe the simple heuristic I described:
grab the rightmost period and word chars next to it
if it is too short (7 chars or less), grab the next leftward period and word too
… not a regex but an if-then … will be good enough, and more analogous to what was already there versus hidden scope creeping this up to perfect.
You are missing my bigger point, you are suggesting a very aggressive regex, if we want to cut corners and do a shortcut here, then fine.
I am fine with a shortcut that culls domains to three parts [part 1].[part 2].[part 3]
I prefer to err on the side of caution here which is particularly good for international domain and always take last 3 parts. This adds more text but is a lot less edge casey with international domains. … yes this sucks for mobile.nytimes.com but is good for d.co.il, abc.net.au and lots of other short internationals.
You are suggesting aggressively culling out [part 1], which works fine for .com and .org domains and a lot less friendly to co.uk and .com.au domains and so on.
EDIT
Just reread the algorithm suggested, always fill up a buffer to a minimum of 7 chars picking up to 3 segments may work.
Note that just displaying amakusa.kumamoto.jp or kumamoto.jp is incorrect here because it is as good as displaying com.au where we don’t provide any indication of where the site is going.
Assuming we determine that 7 chars is a good length, the heuristic algorithm will only produce kumamoto.jp which is not what we want. Just to get this case right, the length that we use will have to be 17 chars excluding the periods and we have to start considering the number of periods in the domain. If we bump the number of chars too much, we’ll end up displaying the full domain like community.seqta.com.au which brings us back to square one.