The State of JavaScript on Android in 2015 is... poor

Your data doesn’t cover a very important problem: Apple cheated to get their performance.

Architecture lock-in is a very real software problem. It’s one reason for using Java code in Android and is the reason why Google specifically targets 3 different ISAs. While the ARM lock on mobile isn’t near as strong as the MS and x86 lock on desktops, it is still very strong.

Even when better chips were available (eg. MIPS proAptive which was faster, half the die area, and lower power than A15 with better coremark performance), the market still chooses ARM showing that the lock-in is real. With ARM, that’s not a terrible situation because you can either purchase their designs or create your own as long as you pay them – companies definitely prefer this to x86.

If you aren’t familiar, it takes 4-5 years to create a new CPU architecture, test/verify it, and tap it out. Not even the most powerful players in the game like Intel or IBM can beat this (they instead overlap design teams to keep the changes flowing). ARMv8 launched in Oct 2011.

Apple managed to not only tap out, but ship a chip using the new ISA in TWO years. Either PA Semi had super-engineers or something strange is going on. ARM’s simplest design (the A53 – based on an existing micro-architecture) didn’t ship until late 2014 and their “high-end” A57 didn’t get out the door until 2015. A57 is a poor design with a lot of issues indicating it was rushed out the door.

A72 (shipping mid 2016) is a fixed version of A57 with a lot of improvements that had to be sacrificed to get A57 out the door and it should come as no surprise that the fixed version finally starts to catch up with Apple’s design from 2013. Qualcomm is also taking 5 years to launch their ground-up redesigned processor (Snapdragon Kryo). Samsung’s custom core also doesn’t arrive until next year. Denver managed to catch up part-way, but only because it’s a transmeta-based design and the biggest changes were to the firmware (and it’s been in progress for years – originally intending to get around x86 patents until they struck a deal with Intel).

If you ask around at these companies, pretty much everyone was beyond shocked when Apple not only launched with only 2 years time, but the chip was actually good. After all, ARM would run afoul of quite a few laws if it could ever be proven that they had given Apple a head start. How did Apple do this? How did a third-party launch 3 years before ARMs own A72?

Apple’s changed ISAs before (POWERpc to x86) and it’s not easy (especially when all the code is native). Apple also has some loose ties to ARM since they co-founded the company in the 90s (they sold the shares in the early 2000s, but that’s not the only important thing in business). With x86 being a one-horse race (AMD has basically been out for years), Apple needs a different desktop ISA.

MIPS was on the market and is a great ISA. The only problem is switching from ARM would be a big problem. Instead, it would be nice if ARM would simply clean up their ISA which they did. The resulting ISA is extremely similar to MIPS and perhaps explains why ARM was so keen to spend hundreds of millions to get access to the MIPS patent portfolio after Imagination bought them.

The rest is supposition based on these (and a couple other) facts. Apple was one of the big companies rumored to be looking at getting MIPS. They start building a micro-architecture with a MIPS-like ISA in mind. They then come up with a cleaned-up ARM ISA that is very similar to MIPS, so they can continue work and go either direction late in the design game (in fact, they could probably design both control blocks at the same time as the micro-arch and the uncore would remain unchanged).

All thats left is to talk to the ARM guys. Either ARM loses one of their biggest, most influential customers or they negotiate to adopt the new ISA. Apple has around 18 months to finalize and tap out with another 8 months to ramp up for the new launch. This also explains ARMs rushed A57 and the much longer delay by the other companies.

The big win for Apple is that they outmaneuver their competitors for at least 2 years and most likely 7-8 years.

7 Likes

Cordova + Crosswalk is a nice way to work around that problem. Cordova makes it possible to build apps with just HTML + CSS + JavaScript (and you can optionally add native “plugins”), Crosswalk is a plugin that makes the app use an optimized recent version of Chromium instead of the built-in browser.

Android built-in browser performance is abysmal at best, recent Chromium via e.g. Crosswalk gives a significant performance boost.

Your codebase needs to diverge very minimally, which is to say you can ship the same code for web and app with something like if (window.cordova) { ... } -wrapping the platform specific code (if any is needed at all).

They also turned out a smartphone that beat out their entire competition (blackberry) back in the day. Normal chip manufacturing usually falls under the schedules of the market; You don’t want to waste time researching a faster chip if nobody’s buying. The fact that Apple turned around a chip in 2 years is just a matter of want.

This isn’t a conspiracy, it’s the fact that someone was willing to pay for the effort. Google benefits from the fact that they use the same chips and can aim for a cheaper market using whatever gets pushed in the ARM division. To put it another way, If Google was really keen on raw performance, and put as much emphasis on, say, AMD, you’d probably see a pretty competitive mobile chip from them as well.

Really though, the issue is built in browsers on android suck.

Where are the profiling graphs? Final numbers mean very little without the flame charts. Where are the actual bottlenecks in the code?

1 Like

Wow, thanks! I actually wasn’t expecting this from my quasi-rhetorical question; I’m so jaded by speculations.

To look at this laterally, would it not make sense for Discourse to either hire devs, or internally, focus on optimization of the framework they rely on? If performance is an issue, which it seems to be, give these threads, would it not make sense to appoint people who’s sole job would be to deal with this; Be it internal performance or upstream?

Having a Discourse fork wouldn’t be a bad thing for Ember.

We work on performance all the time, we upstream fixes, report bugs and so on, in fact front page performance was not at an acceptable level till I moved us to raw rendering for each row.

“Just hire more devs” as appealing as it sounds has the real world constraints of … hmmm … money.

All our devs are able to work on this problem, but we need to balance feature work and customer support and hmmm all those other things it takes to run a business :slight_smile:

The big :tm: problem (for Google) though is that somehow Safari has an about 40% edge over Chrome when it comes to running Ember stuff… on top of that Apple somehow managed to put Desktop level performance into people’s pockets with the 6S.

1 Like

I realize how client-esque that sounds. What I mean is, if it’s a major concern, it needs at least one dedicated specialist. Hospitals don’t usually have their GPs doing cancer research between operations.

Not even every iPhone user is running these chips so, while this may be the future, it’s certainly not the present. Fixing Ember render waves is a bigger win regardless of chipset; Hell, it’ll help folks on laptops get a few more minutes of battery life. Everything counts.

1 Like

I recently bought a Doogee X5 cause I like pain. I found meta completely usable on it. First load is painful, topic show could be faster but it is still usable. I am the Discourse performance tzar, I spend a lot of time on performance, it does speak a lot for a company of ~7 full time staff to be able to have this level of focus on performance. I blog about this stuff, I am constantly looking at performance.

I can not afford to work on the Android Chrome 40% performance hole compared to Apple Webkit, or somehow magically make ARM chips on Android up-their-game. These are things Google need to focus on and they have a few more than 7 developers :slight_smile:

We also can not afford, at the moment, to ship someone into a sealed room to redesign a brand new front end framework for us, even if we could afford it, it would end in tears.

Instead, our approach is to rebuild our boat while its floating in the big treacherous ocean. We want to keep sailing, we do not want to take it to the dock.

12 Likes

If you are suggesting that the reason Apple could bring a 64-bit ARM into production a couple of years before anyone else was that Apple blackmailed ARM to adopt an Apple ISA then I think you are wrong, and I will explain why. And I apologise for not understanding your post if that isn’t what you are suggesting.

To get one thing out of the way, I believe Apple did have a significant input into the 64-bit ARM architecture. And I’m certain they were not the only large company that had significant input. ARM works like that. They talk to some “partners” who are both interested in their future products and who may guide ARM to improve those products.

The whole issue of the timing of the introduction of ARM’s 64-bit is interesting. I’m sure they were working on their 64-bit architecture for a long time, initially to prepare the way for when they would need it, and then for real. I’m guessing here, but it seems likely that ARM’s timescale was dictated by when they thought 64-bits would be needed for consumer products, rather than servers. Having now observed a couple of word-length transitions (16 to 32, and 32 to 64) the transition always takes longer than expected, and then happens very quickly. The transition takes longer because architects come up with mechanisms to extend addressability without increasing word length (c.f. PDP-11, x86, ARM A15). However, once the longer word length machines are available, and the software is available, the transition happens quickly - because software development is easier.

I believe that Apples move to 64-bits so quickly was a huge surprise to their competitors. One reason that ARM’s processors took so long to develop is that ARM didn’t think they were needed so early. After all, why would you move the 64-bit if you could wait? Wouldn’t 64-bit just be bigger, more power hungry and slower (after all, you need more memory bandwidth to support those 64-bit pointers)? Apple came to market with a processor on which (the important) 64-bit software ran faster than (the same) 32-bit software - because Apple exploited the extra bits in a pointer to significantly improve the performance of their run-time system (see mikeash.com: Friday Q&A 2013-09-27: ARM64 and You).

Apple design capability extends not only to the processor (the design that you could licence from ARM) but also to the whole SoC and its software. This means that Apple can make tradeoffs which other chip companies cannot - for example, playing off cache-size against clock frequency, playing number of processors against cache sizes. Knowing how your software works makes a huge difference here. If you are not vertically integrated this is very difficult to do. To optimise across the system you would need deep cooperation between (for example) ARM (processor design), ST (SoC) and Google (Android).

If we look at the history of the Apple SoCs we can see increasing amounts of Apple’s design capability being deployed over time. Apple A4 (March 2010) used an ARM A8. Apple A5 (March 2011) used a dual ARM A9 with - I believe from the die photos - an Apple designed L2 cache. [This would be perfectly possible under a standard ARM licence as the A9 processor pair have a standard AXI bus interface to the L2 cache]. With the A6 (September 2012), Apple introduced their own 32-bit processor design, Swift, followed a year later (September 2013) by their first 64-bit processor (Cyclone), with Typhoon and Twister following.

A full implementation and verification of the ARMv7s architecture is pretty complex - there is a huge amount of cruft. There are also parts of the architecture which are difficult to implement very efficiently (e.g. the conditional behaviour which looked like a really good idea back in the 1980s). It’s possible that Apple were able to back-off on some parts of the microarchitecture in the knowledge that they didn’t affect (Apple’s) performance much. But Swift remains a very impressive processor; if I recall, it was earlier into production and higher performance than ARM’s A15.

I don’t think an ARM-V8 implementation is much harder than a V7s. Especially if you are judicious in your choice of what to implement (and hence verify). I suspect that you can choose to have only user mode available in 32-bit mode, along with a 64-bit user mode and the rest of the exception levels in 64-bit mode. I don’t know whether that’s what Apple have done, but it would speed up the production of a 64-bit processor. By the way, ARM have not chosen to do this for A57 and A53 (I have no certain knowledge of more recent 64-bit processors), so I assume they have a harder job than Apple.

So, to summarise. I think Apple (and others) had input to ARMv8. I think Apple took (in particular) Qualcomm by surprise when it introduced and exploited its 64-bit processor as soon as it did. So one reason for Apple’s lead is that they chose to move quickly and allocate their resources accordingly. I also think Apple have been guided by their deep knowledge of the software that runs of their products to pay attention to the areas which matter most, and Apple have probably chosen to develop only a subset of the functionality that ARM are developing. Finally, it is probably the case that Apple are better at processor development than ARM. Don’t get me wrong - ARM are a great company - a great IP licensing company.

6 Likes

Even the iPhone 6 is substantially faster than anything on Android. The 6s widens the gap even further. Android handsets are mostly competitive-ish with the iPhone 5s, if they are very new.

Also @samsaffron it is not a 40% performance hole, it is 3x to 5x performance hole. 300% to 500%. Not 40%. We mitigate this by sending half the posts / topics to Android, so that cuts it to 150% - 250%.

More specifically:

  • When comparing devices with similar Geekbench single threaded scores, like the iPhone 5s and the Nexus 6p (both get around 1350) the performance difference is 1.6x in favor of the iPhone 5s. Of course one of those devices is from 2013, and one is from two months ago…

  • When comparing current flagship devices, it is not even close – the iPhone 6s is almost 5x faster than the Nexus 6p on render complex list in Ember.

1 Like

I guess that is my point Apple have a huge edge CPU wise and there is an artificial 40%-60% percent that can be made up in software just because of constant deopts in Chrome

I think this discussion is treading into the wrong direction. It does not matter what future will hold or what the current state-of-the-art is. The focus should be on what is out there. Unfortunately, this means that you must assume T-3 years (i.e. 2012 for now) performance metrics for both mobile and desktop. On the other hand this means there is an opportunity to optimize Discourse for any user on any device, including desktop, which means increased responsiveness and battery life for anyone.

Sidenote on devices/CPUs/market share: Single-core performance is unlikely to improve massively. It’s cheaper and more efficient to pack more cores into a chip than maxing out existing ones. This approach is not followed by Apple since they seem to optimize on a dual core stack (which works well for them). And while iPhones are #1 in the Western hemisphere, there is not only a shortage in any other region, but also a gigantic tail of other devices – keep in mind that Apple is not the market leader, it’s just the iPhone. And this includes a huge amount of older iPhones including 4s and 5/5C which were sold for a long time even after the successors (and their successors) were announced (and are still selling in non-developed countries to target the lower-price market) – which means lower performance for a huge amount of population. Also mind that market share is not equal to mind share.

As for Discourse, I’d like to see improvements on the overall performance before or while thinking about a “lite” version. While I like a lot what I see so far, there are opportunities for reducing the workload on slower/older devices (again, both mobile and desktop!) without thinking about a greater architectural change in Discourse:

  • Update Ember to v1.13, which includes a new renderer similar to what React does. I expect this to help a lot with Discourse. [currently Ember v1.12 is used AFAICS]
  • Update Ember to v2.2, which (in addition to performance improvements) removes unnecessary cruft and thus lightens the load on network, memory and CPU. You can even use jQuery 2 with Ember v2 to further reduce amount of code and improve performance! (Ember upstream dependency update: https://github.com/emberjs/ember.js/pull/12321)
  • Pre-compile partials on the server
  • Delay everything that is not needed at render time (this may exclude pre-loading smaller payloads over the network but they should not be rendered unless really needed!) [partially done]
  • Do not render anything when a tab is in background, agressively throttle polls (both local and over the network) [partially done]
  • Reduce complexity of markup and layout (so both HTML and CSS), think about creating smaller, more flexible components [partially done, always optimizable]
  • Use push (through SSE or Websockets) instead of polling [this may be done already, no idea]
  • Be less agressive with updates, i.e. updates less often (let the browser GC and the CPU go to sleep) and maybe do not update at all but show indicators that something has changed and load the new data on user request only (i.e. by pushing a “load more” button) – this can be improved to heuristically determine if the computing ressources are available to update the site in-place vs. just providing indicators.
  • Think about parts that can be out-sourced to a (couple of) WebWorker(s) so multi-core CPUs are used more efficiently.
  • Reduce amount of data and generally optimize memory usage. Pushing less data through the hardware always makes a system more efficient.

While don’t know of what the team has done until now or what can still be optimized code-wise (I didn’t really look through the sources :confused:), I hope this list provides some points you can think about. Especially the Ember update should help!

3 Likes

Well, let’s check what happens on my blazing fast Skylake desktop PC and 64-bit Google Chrome latest stable:

http://emberperf.eviltrout.com/ render complex list (this test is the most representative of Discourse real world performance)

Ember version 1.11.3 – 39ms, 16.5% error
Ember version 1.12.0 – 46ms, 15% error
Ember version 1.13.10 – 97ms, 191% error
Ember version 2.0.2 – 94ms, 181% error
Ember “latest release” – 72ms, 29% error
Ember “latest beta” – 77ms, 26% error
Ember “latest canary” – 79ms, 32% error

Of course on Android you need to multiply these numbers by 5, on iOS multiply them by 2 (for iPhone 6s) or 3 (for iPhone 6).

We are at Ember 1.12 now, and moving to any later version would cause us to be almost twice as slow.

1 Like

Nexus 6p (flagship android) vs Doogee X5 (ultra cheap) vs iPhone 6 (prev generation apple), initial page load:

13 Likes

Interesting. Is this because the new Ember renderer is not optimized, or because the test suite / Discourse is not optimized for the new renderer? The Ember v1.3 post mentions the issue with too many observers (or keeping track of those), which (AFAICR) @sam mentioned a couple of times in this discussion as being an issue of Discourse.

That benchmark actually has no observers and is based entirely on rendering a bunch of nested views.

The glimmer engine in 1.13 is focused on re-render performance, not initial performance which is benchmarked in that test. Since then, the Ember team has been working hard on improving initial rendering performance and we’ve seen some minor improvements in the latest canary which is good.

I should add that for the vast majority of Discourse templates don’t have any problems and we could render much slower without anyone ever noticing. It’s mainly a problem on our topic lists and topic views. We’ve hacked around the topic list by rendering raw handlebars templates.

The topic view has a lot of string rendering now which is tricky to maintain and we’ve maxed out the performance gains on that approach. We’ve got a promising idea for improving render performance on that view though with a bunch of custom code. So, the good news is even if Ember performance doesn’t improve we will likely be able to make Discourse much faster.

10 Likes

Ember Version: 1.11.3
User Agent: Mozilla/5.0 (Linux; Android 6.0; Nexus 6P Build/MDB08L) AppleWebKit/537.36 (KHTML, like Gecko) Chrome/46.0.2490.76 Mobile Safari/537.36

Using Chrome stable on my Necus 6P, I get the following. I’m using Ember 1.11 to try and match the benchmarks run in Jeff’s original blog post.

Any idea why the error is so high?

.------------------------------------------------------------.
|             Ember Performance Suite - Results              |
|------------------------------------------------------------|
|            N...........| Speed | Error  | Samples |  Mean  |
|------------------------|-------|--------|---------|--------|
| Render Complex List.   |  3.49.| 414.73 |      43 | 286.72 |
| RCL (HTML)............ |  3.32 | 530.24 |      40 | 301.63 |
'------------------------------------------------------------'

The error is so high because V8 screws up optimizing the JS for Ember (and Angular, etc) and causes constant deoptimizations. You can read the bug reports already filed on this in their bug tracker for more info.

yikes, that’s a bit of an oversite for the company that made angular, though obviously different divisions of the company. the “mean” looks alright though if i can be trusted with such a high error rate. At least we’re finally beating the iphone 5 :confused: