NVIDIA Tegra X1 Preview & Architecture Analysis
by Joshua Ho & Ryan Smith on January 5, 2015 1:00 AM EST- Posted in
- SoCs
- Arm
- Project Denver
- Mobile
- 20nm
- GPUs
- Tablets
- NVIDIA
- Cortex A57
- Tegra X1
In the past few years, we’ve seen NVIDIA shift their mobile strategy dramatically with time. With Tegra 2 and 3, we saw multiple design wins in the smartphone space, along with the tablet space. These SoCs often had dedicated GPUs that were quite unlike what we saw in NVIDIA’s desktop and laptop GPUs, with a reduced feature set and unique architecture. However, with Tegra K1 we saw a distinct shift in NVIDIA’s SoC strategy, as the Tegra K1 was the first mobile SoC to achieve parity in GPU architecture with desktop GPUs. In the case of the Tegra K1, this meant a single Kepler SMX which made for truly incredible GPU performance. However, in the time since we’ve seen companies like Apple release new SoCs such as the A8X, which managed to bring largely similar performance with less power.
NVIDIA of course is never content to stay idle for too long and let themselves be surpassed, which has led to the company developing their Tegra SoCs on a roughly yearly cadence. In NVIDIA’s development timeline, 2014 brought us the Tegra K1-32, the company’s first SoC to integrate a modern high-performance GPU architecture (Kepler), and later on the TK1-64, a K1 with NVIDIA’s own in-house developed ARMv8 Denver CPU in place of the 32bit ARM A15 CPU.
NVIDIA's GTC 2014 Tegra Roadmap
2014 also brought a revision to NVIDIA’s public Tegra roadmap: for 2015 NVIDIA would bump the previously announced Parker SoC and release a new in its place, Erista. This schedule change ended up being quite a surprise due to the fact that NVIDIA had up until this time released their roadmaps roughly 2 years out, whereas Erista was put on the map less than a year before it was to be announced. More unusual, NVIDIA offered up minimal details of Erista; it would be powered by a Maxwell GPU with no additional information on the CPU or manufacturing process. Parker on the other hand – NVIDIA’s planned Denver + Maxwell + 16nm FinFet part – fell off the schedule, with NVIDIA not officially canceling it but also saying little else about it.
Now in 2015 and with the launch of the Tegra X1, we can finally begin putting the picture together. Erista as it turns out is something of a rapid release product for NVIDIA; what had been plans to produce a 16nm FF part in 2015 became plans to produce a 20nm part, with Erista to be that part. To pull together Erista NVIDIA would go for a quick time-to-market approach in SoC design, pairing up a Maxwell GPU with ARM Cortex A57 & A53 GPUs, to be produced on TSMC’s 20nm SoC process.
CPU
We’ll kick off our look at Erista then with a look at Erista’s CPUs, where one of the bigger surprises in Erista is the CPU configuration. While Tegra K1 had a version with NVIDIA’s custom Denver core for the CPU, NVIDIA has elected to use ARM’s Cortex A57 and A53 in the Tegra X1. The A57 CPU cluster has 2MB of L2 cache shared across the four cores, with 48KB/32KB L1s (I$+D$) per core. The A53 cluster has 512KB of L2 cache shared by all four cores and 32KB/32KB L1s (I$+D$) per core. NVIDIA representatives stated that this was done for time to market reasons.
However, rather than a somewhat standard big.LITTLE configuration as one might expect, NVIDIA continues to use their own unique system. This includes a custom interconnect rather than ARM’s CCI-400, and cluster migration rather than global task scheduling which exposes all eight cores to userspace applications. It’s important to note that NVIDIA’s solution is cache coherent, so this system won't suffer from the power/performance penalties that one might expect given experience with previous SoCs that use cluster migration.
Although cluster migration is usually a detriment to power and performance (especially without a cache coherent interconnect), NVIDIA claims that Tegra X1 significantly outperforms Samsung System LSI’s Exynos 5433 in performance per watt with 1.4x more performance at the same amount of power or half the power for the same performance on the CPU. It's likely that this difference comes from optimizing the implemented logic and differences between Samsung and TSMC 20nm processes rather than differences in thread management. NVIDIA continues to use System EDP (Electrical Design Point) management to control throttling and turbo rather than ARM's IPA (Intelligent Power Allocation) drivers, and in general it seems that while the CPU architecture is ARM's reference design it's quite clear that everything else is distinctly NVIDIA in design.
All of this of course raises the question of where’s Denver, and what is to happen to it? The use of the ARM Cortex A57 and A53, as NVIDIA tells it, was based on a time-to-market decision, and that NVIDIA could bring an off-the-shelf Cortex-based SoC to the market sooner than they could another Denver SoC. On the surface there is at least some evidence to agree with this, as NVIDIA would be able to buy a 20nm design from ARM versus spending the time porting Denver over from TSMC’s 28nm process to their 20nm process. At the same time however integrating an ARM CPU core into an SoC is not an easy task – logic LEGO this is not – so it’s hard to say whether this actually saved NVIDIA much time.
In any case, much like the still in the dark Parker design, NVIDIA is holding their cards close to their chest and hinting that Denver will still be in future generation products. If Parker is still part of NVIDIA’s plans – and they are without a doubt working on some kind of 16nm FF Tegra SoC – then we may still see Denver as part of the original Parker plan. Otherwise at this point while there’s no reason not to believe NVIDIA about the fate of future Denver SoCs, it’s anyone’s guess when we might see it again.
Uncore
Outside of the CPU and GPU, NVIDIA has also dramatically improved the rest of Tegra X1 in comparison with Tegra K1. We see a move from 64-bit wide LPDDR3 to 64-bit wide LPDDR4 on the memory interface, which improves peak memory bandwidth from 14.9 GB/s to 25.6 GB/s and improves power efficiency by around 40%. In addition, the maximum internal display resolution moves from 3200x2000 at 60 Hz to 3840x2160 at 60 Hz with support for VESA’s display stream compression. For external displays, we see a significant improvement with support for HDMI 2.0 and HDCP 2.2, which means that 4K60 is supported in contrast with the Tegra K1, which only supported 4K30.
The ISP is largely similar to the one we see in the Tegra K1 when it comes to feature set, but JPEG encode and decode rate is now five times as fast, going from 120 MP/s to 600 MP/s. For video encode and decode, we see support for 4K60 H.265 and VP9, with support for 10 bit color on decode on H.265. In addition to support for new codecs in the hardware blocks, we see that H.264 and VP8 now support 4K60, a notable improvement over the 4K30 that was the maximum for Tegra K1. We also see that the storage controller now supports eMMC 5.1 for faster storage performance.
194 Comments
View All Comments
chizow - Monday, January 5, 2015 - link
Careful, you do mean A8X right? Because Denver K1 is an actual product that absolutely stomps A8, only after Apple somewhat unexpectedly "EnBiggened" their A8 by increasing transistors and functional units 50%, did they manage to match K1's GPU and edge the CPU in multi-core (by adding a 3rd core).To say Denver K1 didn't deliver is a bit of a joke, since it is miles ahead of anything on the Android SoC front, and only marginally bested in CPU due to Apple's brute-force approach with A8X while leveraging 20nm early. We see that once the playing field has been leveled 20nm, its no contest in favor of Tegra X1.
Jumangi - Monday, January 5, 2015 - link
I mean a product that is widely available to CONSUMERS dude. And please stop with the "stomping" stuff. It means nothing about its performance with its also vastly higher power consumption. The A8 can exist in a smartphone. What smartphones have the K1? Oh that's right none because you would get a hour of use before your battery was dead. Mobile is about performance and speed. You can diss Apple all you want but from an SoC perspective they do it better than anyone else right now.pSupaNova - Tuesday, January 6, 2015 - link
@Jumangi try to comprehend what he is saying.Apple used a superior process on its A8X and more transistors to just edge the K1 in some CPU benchmarks. While core for core Nvida's is actually more powerful.
The GPU in the K1 also has near desktop parity etc OpenGL 4.4. Features like Hardware Tessellation are absent from the A8X.
Alexey291 - Tuesday, January 13, 2015 - link
That's great. It really is but lets be honest. A8x is faster than K1.And end of the day that is sadly all that matters to the vaaaaaast majority of consumers.
Frankly even that barely matters. What does though is that games run better on my tablet than they do on yours so to speak. (Actually likely they run better on yours since I'm still using a nexus 10 xD)
But sure the new paper launch from nv late this year our early next year will be great and the 2.5 devices that x1 will appear in will be amazing. Making sales in hundreds of thousands.
SM123456 - Sunday, February 1, 2015 - link
The point is that the Tegra K1 Denver on 28nm beats the Apple A8 fairly comprehensively on 20nm with the same number of cores. Apple stuck on 50% more cores and 50% more transistors to allow the A8X on 20nm to have a slight edge over the Tegra K1 Denver. This means if Tegra K1 is put on 20nm, it will beat the 3 core Apple A8X with two cores, and the same thing will happen when both move to 16nm.utferris - Monday, April 13, 2015 - link
Oh. Really? Denver K1 is not even as fast as A8X. Do not mention that it uses more than 2 times energy. I really do not understand people like you going around and saying how good nvidia shit is.eanazag - Wednesday, January 7, 2015 - link
It'll likely be in the next Shield.name99 - Monday, January 5, 2015 - link
(1) I wouldn't rave too enthusiastically about Denver. You'll notice nV didn't...Regardless of WHY Denver isn't on this core, the fact that it isn't is not a good sign. Spin it however you like, but it shows SOMETHING problematic. Maybe Denver is too complicated to shift processes easily? Maybe it burns too much power? Maybe it just doesn't perform as well as ARM in the real world (as opposed to carefully chosen benchmarks)?
(2) No-one gives a damn about "how many GPU cores" a SoC contains, given that "GPU core" is a basically meaningless concept that every vendor defines differently. The numbers that actually matter are things like performance and performance/watt.
(3) You do realize you're comparing a core that isn't yet shipping with one that's been shipping for three months? By the time X1 actually does ship, that gap will be anything from six to nine months. Hell, Apple probably have the A9/A9X in production TODAY at the same level of qualification as X1 --- they need a LONG manufacturing lead time to build up the volumes for those massive iPhone launches. You could argue that this doesn't matter since the chip won't be released until September except that it is quite likely that the iPad Pro will be launched towards the end of Q1, and quite likely that it will be launched with an A9X, even before any Tegra X1 product ships.
chizow - Tuesday, January 6, 2015 - link
@Name991) Huh? Denver is still one of Nvidia's crowning achievements and the results speak for themselves, fastest single-core ARM performance on the planet, even faster than Apple's lauded Cyclone. Why it isn't in this chip has already been covered, its a time to market issue. Same reason Nvidia released a 32-bit ARM early and 64-bit Denver version of Tegra K1 late, time to market. Maybe, in the tight 6 month window they would have needed between bringing Denver and working on Erista, they simply didn't have enough time for another custom SoC? I'm not even an Apple fan and I was impressed with Cyclone when it was first launched. But suddenly, fastest single-core and a dual-core outperforming 4 and even 8-core SoC CPUs is no longer an impressive feat! That's interesting!
2) Actually, anyone who is truly interested does care, because on paper, a 6-core Rogue XT was supposed to match the Tegra K1 in theoretical FLOPs performance. And everyone just assumed that's what the A8X was when Apple released the updated SoC that matched TK1 GPU performance. The fact it took Apple a custom 8-core variant is actually interesting, because it shows Rogue is not as efficient as claimed, or conversely, Tegra K1 was more efficient (not as likely since real world synthetics match their claimed FLOPs counts). So if 6 core was supposed to match Tegra K1 but it took 8 cores, Rogue XT is 33% less efficient than claimed.
3) And you do realize, only a simpleton would expect Nvidia to release a processor at the same performance level while claiming a nearly 2x increase in perf/w right? There's live demos and benchmarks of their new X1 SoC for anyone at CES to test, but I am sure the same naysayers will claim the same as they did for the Tegra K1 a year ago, saying it would never fit into a tablet, it would never be as fast as claimed yada yada yada.
Again, the A9/A9X may be ready later this year, but the X1 is just leveling the playing field at 20nm, and against the 20nm A8/X we see it is no contest. What trick is Apple going to pull out of its hat for A9/A9X since they can't play the 20nm card again? 16nm FinFET? Possible, but that doesn't change the fact Apple has to stay a half step ahead just to remain even with Nvidia in terms of performance.
lucam - Wednesday, January 7, 2015 - link
1) He was saying: why NV didn't continue with Denver design? Being so efficient and only 2 cores why don't shift at 20nn easily? Because they can't and that's it. The other things are speculations.2) You still compare apple (not Apple) with pears. Any vendors put inside his proprietary technology with their market strategy, important is to figure how GFLOPS and Texel is capable at same frequency and watt. You don't even know how Img cluster is built and nobody does and you still compare with NV cuda cores. Rogue XT frequency is set at 200mhz, Tegra K1 at 950mhz. Again what the heck you re talking about.
3) it is still a prototype type with a fan and nobody could check all the real frequency even though 1ghz seem reasonable. Hod dare you can compare a tablet with a reference board?
Again A9/A9X already exist now as prototypes, Apple doesn't sell chips and doesn't to any those sort of market. They need to see their product in a cycle year life. You live in another planet to not understand that.