Planet VideoLAN

Welcome on Planet VideoLAN. This page gathers the blogs and feeds of VideoLAN's developers and contributors. As such, it doesn't necessarly represent the opinion of all the developers, the VideoLAN project, ...

VLC for Android

June 22, 2020

dav1d 0.7.1

Jean-Baptiste Kempf

Release 0.7.1

We just did a small release of dav1d called 0.7.1, just one month after 0.7.0.

It is a quick release that fixes a couple of bugs and that does more optimizations on ARM32 and SSE2.

ARM 32-bit

After spending a lot of time on ARM64 during 0.5.0 and 0.7.0, we're spending some times for the people who are stuck with older phones, still running on 32-bit platforms.

With these new optimizations, we're 28% faster than before when decoding the Chimera sample on a Snapdragon 835.

The result is that we're only 20%-25% slower in 32bit compared to 64bit, which is quite a feast.

Compared to gav1, we're now 2x-2.4x faster, in 32bit mode.

dav1d vs gav1 ARM32

When comparing with numerous threading options, on a Galaxy S5, from 2014, we can see the following:

dav1d vs gav1 ARM32

With dav1d 0.7.1, we're able to decode the AV1 Chimera 1080p sample at more than 24 fps on a Galaxy S5 from 2014 on Android (32-bit)! Reaching 24fps does not even use the full CPU!
Once again, we see that the gav1 library has issues with threading.


On the desktop, we did some SSE2 optimizations, for the people who don't have SSSE3 CPU, which should see quite a bump in decoding.

We also did optimizations for the scaled mode, in AVX2. (This is used only by bitstreams that use the spatial scalability feature).


See you soon, for more speed improvements!

PS: thanks again to Nathan for the graphs.

June 22, 2020 09:37 AM

May 21, 2020

dav1d 0.7.0: mobile focus

Jean-Baptiste Kempf


Dav1d new release:

  • 10% faster on Intel CPUs with 25% less RAM, assembly finished for 8bit
  • ARM64 assembly mostly done for 10/12bit in addition to 8bit
  • dav1d is twice as fast as gav1 on ARM CPU and 4 times faster for 10b
  • 1080p AV1 decodable real-time with 2 little-core on Pixel 1

A few reminders about dav1d

dav1d cores

If you follow this blog, you should know everything about dav1d.

The VideoLAN, VLC and FFmpeg communities have been working on a new AV1 decoder, dav1d, in order to create the best and fastest decoder.

A new very fast release

0.7.0 is a major new release, whose focus is, once again, speed. It is doubly interesting, for improvements are important for both computers and smartphones.

The ref_mv rewrite

For once, the biggest speed improvement for desktop and laptops is not coming from writing more assembly code, but from Ronald's rewrite of the ref_mv algorithm.

This new algo gives a 8-12% speed improvement measured on Haswell machines while reducing memory usage by 25%.
We're talking about 10% faster for the complete AV1 decoding, that's a more important impact than a lot of assembly we wrote.

x86 Assembly

With 0.7.0 release, the assembly for x86 CPUs (32bit and 64bit) is now totally complete for the 8bit bitdepth.

We finished up all the small optimizations that remained for SSSE3 and AVX2, notably film grain, during the 0.6.0 and 0.7.0 development cycles. We added more AVX-512 assembly, for those with very recent CPUs.

In the future, getting faster on those Intel CPU is going to be very difficult (I know I said that already many times, but this time it's true).

Dav1d is still around 3x to 5x faster than aomdec on normal computers; but we are now even more faster :). See older posts for more information.

ARM Assembly

As for 0.6.0, an important focus of dav1d 0.7.0 was ARM assembly, and notably for the 10bitdepth cases.

As of 0.7.0, most assembly you should care about is done for 8bit/10bit/12bit on ARM64 and this makes decoding AV1 on the phones affordable.

ARM speed vs gav1

gav1 is an open source decoder made by Google to compete with dav1d on Android and ARM.

As of 0.7.0, dav1d is between 1.8x and 2.5x faster on 8b content and 2.4x to 5x faster on 10b content than gav1 on different CPUs.

dav1d vs gav1 this graph was made on ODroid N2, for example.

Deep dive on ARM cores and performance

ARM CPUs for mobile devices have an architecture with both LITTLE and big cores, which offer different speed and different power usage.

Using different types of cores allows to consume only the power you need for normal tasks, and be able to go in max power, when requested.

It is therefore extremely important to analyze the performance of our ARM code on both types of cores and when mixing it.

So let's see have a look at how dav1d and gav1 compare on the reference AV1 sample, made by Netflix, Chimera and on the SnapDragon 821 (Pixel 1 phone): dav1d cores dav1d cores


What we can learn from those graphs are the following:

  • dav1d can decode this sample, in all the above configurations, starting from 2 threads
  • gav1 is never able to decode that sample at 24fps, in LITTLE, big and big.LITTLE configurations
  • threading in gav1 is catastrophic: the more threads you add, the less efficient the decoding is
  • threading in dav1d is quite good: it always increases the performance, when you add more threads
  • max performance is around 2.3x faster in dav1d than gav1

For 10b, the situation is even worse for gav1.

I want to emphasis on the fact that dav1d can decode Chimera with 2 threads on the Pixel 1, from 2016, using only the LITTLE cores.

Focus on LITTLE cores on Android

So, what's interesting is to look at the LITTLE cores performance on Android to see the actual speed of the decoder, under low-power cases.

We tested here, all the threads configuration, on the following Android devices:

  • Google Pixel 1 (SnapDragon 821) (2016)
  • Google Pixel 2 (SnapDragon 835) (2017)
  • Google Pixel 3 (SnapDragon 845) (2018)
  • Xiaomi Mi 9T Pro (SnapDragon 855) (2019)

Here are the results: dav1d cores

Once again, we can see, on LITTLE cores:

  • dav1d is always at least 2x faster than gav1
  • we still see the previously mentioned threading issues on gav1
  • dav1d can decode Chimera at 24fps starting with 2 threads on the LITTLE cores, gav1 cannot

AV1 10bit on LITTLE

For the sake of completeness, here are the results for 10b on the LITTLE cores: dav1d cores

You can find all the details here, in the spreadsheet done by Nathan.

Conclusion and Thanks

dav1d is now a very fast decoder on desktops, laptops, but mostly on mobile where it shows very impressive performance on 8b and 10b. It can decode 1080p with a couple of cores on mobile.

Thanks a lot to Nathan Egge, from Mozilla, who gathered all the data required for this post. He therefore did all the work for this blogpost.

May 21, 2020 03:36 PM