Mali T760 - Architecture

The Midgard GPU architecture has been around for over two years now since we saw the first Mali T604MP4 implemented in the Exynos 5250. Since then ARM has steadily done iterative improvements to the GPU IP resulting in what we now find in the Exynos 5433: the Mali T760. Over the last summer we've had the pleasure of getting an architectural briefing and disclosure from ARM, which Ryan has covered in his excellent piece: ARM’s Mali Midgard Architecture Explored. It's a must-read if you're interested in the deeper workings of the current generation Mali GPUs employed by Samsung and other vendors.

The T760's predecessor, the T628, has been now used for over a year in the Exynos 5420, 5422, 5430, 5260 and HiSilicon Hi3630, and regrettably it wasn't as competitive as we would have wished, in terms of performance and in power efficiency. Hoping that the T760 brings the Mali GPU in better competitive shape, there are two main points that we should be expecting from the T760 in terms of advancements over the T628.

  1. Improved power efficiency due to internal wiring optimizations between the cores and the L2 cache.
  2. Improved performance and power efficiency due to both reduced bandwidth usage and increases in effective usable bandwidth by help of ARM Frame-Buffer Compression (AFBC).

Mali T760 - Synthetic Performance

To be able to do an architectural comparison, I lock the T628 and T760 to 550MHz so that we can get a directly comparable perf/MHz figure for both architectures. Since both SoCs have the same internal bus architecture, same memory speed, and even same cache sizes on the GPU, performance differences should be solely defined by efficiency gains of the new architectural additions of the new Midgard GPU in the Exynos 5433. To do this we use GFXBench as the base comparison benchmark as it is our de-facto benchmark for graphics.

 
GFXBench @ 550MHz Offscreen
  Mali T628MP6 Mali T760MP6 % Advantage
Manhattan 12.8 fps 14.5 fps 13.2%
T-Rex 31.5 fps 35.1 fps 11.1%
ALU performance 42.4 fps 41.6 fps -1.8%
Alpha-blending 3006.5 MB/s 3819 MB/s 27%
Driver overhead 29.1 fps 45.05 fps 55.3%
Fill-rate 2907 MTexels/s 3120 MTexels/s 7.3%

If we begin with the synthetic numbers first, we see that there is a non-change in ALU throughput, with only a minor 1.8% decrease in performance over the T628MP6. The tri-pipe design hasn't changed on the T760 so this was an expected result. The small decrease could be attributed to many things but I don't think it's worth any serious attention.

It's on the alpha-blending test things get interesting: performance jumps 27% compared to the T628 at the same frequency. I initially thought this would have been an effect of AFBC and the resulting increased bandwidth, but ARM has explained this is an architectural improvement of the new Midgard generation on the T7XX series. The 7.3% fill-rate increase might also be a gain of this particular addition to the T760.

Completely unrelated to any architectural changes is a massive 55% boost in the driver overhead score. The Galaxy Alpha shipped with an r4p0 driver release while the Note 4 sports r5p0 Mali drivers. This is an outstanding increase for Android as it not only gives the T760 and ARM's new drivers a big advantage over its predecessors, but it also now suddenly leads all Android devices in that particular benchmark by a comfortable margin.

We've seen that Android devices have traditionally suffered in this test while iOS phones and tablets lead by a factor of 3x. This confirms a long-standing suspicion that we're still a long way from achieving acceptable driver overhead on the platform. I was curious as to what ARM did here to achieve such a big increase, so I reached out to them and they happily responded. The r5pX drivers improve on three main areas:

  1. Workload efficiency: The shader compiler was updated to make better use of the shader pipelines and achieve better cycle efficiency; in the case of the T760 there are also optimizations to AFBC usage.
  2. Improved driver job scheduling: Improvements to the kernel job scheduler making sure that hardware utilization is as close to 100% as possible.
  3. Overall optimizations to reduce CPU load, allowing for more performance in CPU-bound scenarios and otherwise lower power consumption.

At the same time it should be noted that CPU performance certainly also plays a part in driver overhead scores - when measuring overhead we're essentially measuring how much CPU time is being spent on the drivers - though how big of a role is up for debate. A57 in particular offers some solid performance increases over our A15-backed Exynos 5430, which can influence this score. However in testing, by lowering the CPU performance I was only able to get down to 29fps in the driver overhead test when limiting the frequency to 1 GHz on the A57 cores, giving much lower performance than the what the 1.8GHz A15 would achieve.

Based on the above, I believe that if ARM can bring these driver improvements to previous generation devices there should be a sizable performance boost to be had. It won't be as great due to the lack of the corresponding A57 CPU, but it should still be a solid boost.

Overall, the aggregate of these new technologies can be measured in the Manhattan and T-Rex practical tests. Here we see an 11-13% increase of performance per clock. Ultimately this means that the raw performance gains from the T760 alone are not going to be huge, but then as T760 is an optimization pass on Midgard rather than a massive overhaul, there aren't any fundamental changes in the design to drive a greater performance improvement. What we do get then from T760 is a mix of performance and rendering efficiency increases, power efficiency increases, and the possibility to be a bit more liberal on the clock speeds.

Power Consumption

I was first interested to see how ARM's GPU would scale in power and performance depending on core count. To do this I locked the system again to the little cores as to avoid any power overhead the A57 cores might have on the measurement. I locked the GPU to 500MHz as I wanted to do a direct comparison in terms of scaling with the T628 and chose that frequency to run both devices; however, that ended up with bogus numbers for the T628 as there seems to be some misbehavior in the drivers of the Galaxy Alpha when trying to limit the core count to less than the full 6-core configuration.

In any case, I was more looking to get an impression of the inevitable overhead of the CPU and rest-of-SoC and how power consumption scales between cores. I'm using GFXBench's T-Rex Offscreen test to measure these power figures, again with already-accounted screen and system idle power consumption.

As we see in the chart, using additional cores on the Mali T760 seems to scale quite linearly. Adding an additional core adds on average another 550mW at 500MHz to the total power consumption of the system. If we actually go subtract this figure from the MP1 power value we end up with a difference of around 550mW that should be part CPU, RoS (Rest-of-SoC), memory, and common blocks on the GPU that are not the shader cores. Of course this is all assuming there is no power increase on the CPU and RoS due to increased load of the GPU. It's quite hard to accurately attribute this number without doing hardware modifications to the PCB board to add sense resistors to at least five voltage rails of the PMIC, but for our purposes it gives a good enough estimate for the now following measurements.

As mentioned I wanted to include the T628 in the same graphs, but due to driver issues I can only give the full-core power consumption numbers. The Exynos 5430 used 3.74W at the same 500MHz lock; this 110mW difference could very well be just the increased consumption of the A53 cores over the A7 cores, so it seems both GPUs consume the same amount of power at a given frequency.

In terms of performance, the Mali T760 scaled in a similar way as the power consumption:

Mali T760 Core Performance Scaling @ 500MHz
  MP1 MP2 MP3 MP4 MP5 MP6
fps 5.8 11.5 17 22.5 27.6 32.2
fps per core 5.8 5.75 5.66 5.62 5.52 5.36

We see a slight degradation of ~1-1.5% on the per-core fps with each additional core but this seems within reasonable overhead of the drivers and hardware. All in all, the scaling on Mali T760 seems to be excellent and we should be seeing predictable numbers from future SoCs that employ more shader cores. The T760 supports up to an MP16 configuration, and while I don't expect vendors to implement this biggest setup any time in the near future, we should definitely see higher core numbers in future SoCs this year.

Again, I can only give the MP6 number for the T628 here: 26.2fps. This means that the T760 is roughly 23% more efficient in terms of perf/W than its predecessor while operating at the same frequency. In the aspect of silicon efficiency, this is also an improvement. The Mali T760MP6 on the 5433 comes in at 30.90mm² 25mm² of the total SoC area, only a meager 0.29mm² more than the T628MP6 in the Exynos 5430.

While until now I was measuring performance and power consumption at a fixed frequency for facultative reasons, it's time to see what the GPU is actually capable of at it's full frequency. But there's a catch.

While I initially reported that the T760 on the 5433 was running at 700MHz, and that frequency is indeed used as a maximum of the DVFS mechanism, there's also a secondary maximum frequency of 600MHz. How can this be? In fact the DVFS mechanisms of both the 5430 and 5433 have a new addition in their logic that is both very unusual and maybe, dare I say, innovative in the mobile space.

Until now, and for all other SoCs other than these two, classical DVFS systems would base their scaling decision on the general load metric provided by the GPU drivers. This is normally computed by reading out some hardware performance counters from the hardware. The DVFS logic would take this load metric and apply simple up- and down-thresholds of certain percentages to either increase or decrease the frequency.

In the case of both 5430 and 5433, Samsung goes deeper into the architecture of the Mali GPU and retrieves individual performance counters of each element of a tri-pipe.

It evaluates the individual load of the (combined) arithmetic pipelines (ALUs), load/stores unit (LS), and the texture (TEX) unit. What it does is that when the load on the LS unit exceeds a certain threshold, the higher DVFS frequency of 700MHz is blocked off from being reached. Similarly it also denies the higher frequency when the ALU utilization is greater than 70% of the LS utilization.

In practice, what this does is that workloads that are more ALU heavy are allowed to reach the full 700MHz while anything that has a more balanced or texture-heavy usage stays at 600MHz. This decision making is only valid for the last frequency jump in the DVFS table, as anything below that uses the classical utilization metric supplied by the ARM drivers. This is the main reason why I needed to cap the GPU frequency to a P-state that is not affected by this mechanism, as it's not possible to disable it by traditional means and I needed to run the same frequency on both the T628 and T760 for the synthetic performance benchmarks.

Now as to why this matters for the general user, the 700MHz state in the Exynos 5433 comes with a vastly increased power requirement, as it's clearly running a high overdrive state at 1075mV versus 975mV at 600MHz. This means that the 700MHz state is 42% more power hungry than the 600MHz state, and in fact, we'll see this in the full power figures when running under the default DVFS configurations.

On the T628 this overdrive state happens on the 550MHz to 600MHz frequency jump. Here the power penalty should be less heavy as the voltage and frequency jump account for a "mere" theoretical 13% increase.

I've always suspected that the Manhattan test was more of an ALU heavy test compared to past Kishonti benchmarks. What this peculiar DVFS mechanism on the 5433 allowed me to do is actually confirm this as I could see exactly what is happening on the pipelines of the GPU. While the texture load is more or less the same on both Manhattan and T-Rex, the load/store unit utilization is practically double on the T-Rex test. On the other hand, Manhattan is up to 55% more ALU heavy depending on the scene of the benchmark.

The result is that power draw in Manhattan on the 5433 is coming in at a whopping 6.07W, far ahead of the 5.35W that is consumed in the T-Rex test. Manhattan should actually be the less power hungry test on this architecture if the tests were to be run at the same frequency and voltages. We can see this in the Exynos 5430, as the power draw on Manhattan is lower than on T-Rex, even though it's using that 13% more power intensive 600MHz hardware state.

Why Samsung chose to bother to implement this state is a mystery to me. My initial theory was that it would allow more ALU heavy loads to gain performance while still keeping the same power budget. ARM depicts the ALU inside the tri-pipe diagram in a smaller footprint as the LS or TEX units, which traditionally pointed out to a very rough correlation of the actual size of the blocks in silicon. The huge jump in power however does not make this seem viable in any realistic circumstance and definitely not worth the performance boost. I can only assume Samsung was forced to do this to close the gap between the Mali T760 and Qualcomm's Adreno 420 in the Snapdragon 805 variant of the Note 4.

I have yet to do a full analysis of today's current games to see how many of them actually have a heavy enough ALU workload to trigger this behavior, but I expect it wouldn't be that many. BaseMark X for example is just on this threshold as half the scenes run in the overdrive state while the other half aren't ALU heavy enough. For everything else the 600MHz state is the true frequency where the device will cap out.

Exynos 5430 vs Exynos 5433
T-Rex Offscreen Power Efficiency
  FPS Energy Avg. Power Perf/W
Exynos 5430 (stock) 31.3 80.7mWh 5.19W 6.03fps/W
Exynos 5433 (stock) 37.7 91.0mWh 5.85W 6.44fps/W
Exynos 5430 (little lock) 29.3 73.2mWh 4.71W 6.23fps/W
Exynos 5433 (little lock) 37.3 88.6mWh 5.70W 6.55fps/W

The end power/W advantage for the T760 at its shipped frequencies falls down to under 6% because of the higher clocks and voltages. Another peculiarity I encountered while doing my measurements was that the Exynos 5430 was affected much more by limiting the system to the little cores. Here the Mali hits CPU-bound scenarios immediately as the scores drop by 7% on the off-screen test and a staggering 26% on the on-screen results of T-Rex. Usually GFXBench is a relatively CPU light benchmark that doesn't require that much computational power, but the A7 cores are still not able to cope with simple 3D driver overhead. The 5433 on the other hand is basically not affected by this. This is a very good improvement as it means much less big core activity while gaming, one of big.LITTLE's most damning weak-points.

I'm still not satisfied with the power consumption on these phones, a little over 5.7W TDP is certainly unsustainable in a phone factor and will inevitably lead to thermal throttling (as we'll see in the battery rundown). Apple's SoCs here definitely have a power efficiency advantage no matter how you look at it. I'm curious to see how an Imagination GPU would fare under the same test conditions.

All in all, as the T700 series is largely an optimization pass on Midgard, the resulting T760 brings some improvements over its T628 counterpart but it's not as much of an improvement as I'd like to see. Since doing these measurements I've received a Meizu MX4Pro that also sports an Exynos 5430, but it came with an average binned chip as opposed to the almost worst-case found in the Galaxy Alpha that I've used for comparison here. In preliminary measurements I've seen almost a 1W reduction in power over the Alpha.

This leaves me with a blurred conclusion on the T760. There is definitely an improvement, but I'm left asking myself just how much of this is due to the newer drivers on the Note 4 or just discrepancies between chip bins.

Cortex A57 - Performance and Power CPU and System Performance
Comments Locked

135 Comments

View All Comments

  • ddriver - Tuesday, February 10, 2015 - link

    I'd like to see A57 performance without being so crippled by a ram bottleneck.
  • blanarahul - Wednesday, February 11, 2015 - link

    Loved this article. Only thing missing was gaming fps and power consumption comparison b/w LITTLE cluster only, big cluster only and big.LITTLE modes.
  • ddriver - Thursday, February 12, 2015 - link

    Also in true 64bit mode, cuz a lot of the perf improvements in v8 are not available in legacy 32bit mode.

    It is a shame really, samsung decided the uArch improvements would be enough to barely pass this chip as "incremental", they didn't bother to feed a higher throughput chip with a wider memory bus. As much as it pains me, apple did better in that aspect by not crippling their A7 chip, even if only because they needed it for a "wow factor" after so many generations of mediocre hardware, especially given the many exclusive initial shipment deals they secured to stay relevant.
  • thegeneral2010 - Wednesday, February 18, 2015 - link

    i like wat u say and i really like to see note 4 running on 64bit this would give samsung processors a great push forward and trust of consumers.
  • bigstrudel - Tuesday, February 10, 2015 - link

    If it wasn't completely obvious already:

    Apple A Series stands alone years ahead of the rest of the pack.
  • Flunk - Tuesday, February 10, 2015 - link

    But if they don't sell it to anyone else, it doesn't really matter does it?

    Apple doesn't compete with Samsung or Qualcomm when it comes to selling SoCs because they don't sell SoCs to other companies. A slight lead in CPU performance is not going to get people to buy an iPhone over and Android, if that's what they're set on buying.
  • xype - Tuesday, February 10, 2015 - link

    It does matter insofar as to be a benchmark of what is possible (as long as they are ahead). And let’s not pretend Apple’s CPUs sucking wouldn’t invite the same kind of comments—just like every situation where 2 competing technologies are compared.

    Platform/fanboy trolling aside, that’s something Android users benefit from as well. Apple being "stubborn" about 2 core CPUs, for example, is a nice counterweight to the 8 cores and 8 mini-cores and 8 quasi-cores trend that some CPU vendors seem to have a hard-on for, and it gives a nice real-world example of how such an approach to mobile CPU design works out, no?

    If Apple stays ahead in the mobile CPU game, the people using non-Apple phones will always have a target to point to and demand equality with. Otherwise they’d just have to live with whatever Qualcomm et al feed them.
  • bigstrudel - Tuesday, February 10, 2015 - link

    My comment isn't fanboy jingo-ism. Its fact.

    There's not a single Android ARM core on the market that can even match the power of the Apple A7's Cyclone cores much less A8's 2nd gen design.

    Were still waiting for anything custom to come out of the Android camp aside from the frankensteinish design of Nvidia's Denver core.

    I really shouldn't need to explain why to people on Anandtech.
  • ergo98 - Tuesday, February 10, 2015 - link

    The Tegra K1 64 bit is faster, core per core, versus the A8 (you do realize that the K1-64 has only 2 cores, right? I'm going to have to guess no, or you just are completely unable to read a chart). The A8x offers marginal per core performance advantages over the A8, and the primary benefit is the third core. The K1 64 is a A57 derivative, *exactly like the A8*.

    Your comments can only be construed as trolling. Can't match the A7? Give me a break.
  • tipoo - Tuesday, February 10, 2015 - link

    Ergo, you're completely off. The Denver K1 is a VLIW code morphing architecture - it has nothing to do with the Cortex A57, nor does the Apple Cyclone, they're both custom architectures.

    The K1 offers better performance in benchmarks, but as a result of code morphing, it can be hit or miss in real world, causing jank.

Log in

Don't have an account? Sign up now