Welcome on Planet VideoLAN. This page gathers the blogs and feeds of VideoLAN's developers and contributors. As such, it doesn't necessarly represent the opinion of all the developers, the VideoLAN project, ...
We just did a small release of dav1d called 0.7.1, just one month after 0.7.0.
It is a quick release that fixes a couple of bugs and that does more optimizations on ARM32 and SSE2.
After spending a lot of time on ARM64 during 0.5.0 and 0.7.0, we're spending some times for the people who are stuck with older phones, still running on 32-bit platforms.
With these new optimizations, we're 28% faster than before when decoding the Chimera sample on a Snapdragon 835.
The result is that we're only 20%-25% slower in 32bit compared to 64bit, which is quite a feast.
Compared to gav1, we're now 2x-2.4x faster, in 32bit mode.
When comparing with numerous threading options, on a Galaxy S5, from 2014, we can see the following:
With dav1d 0.7.1, we're able to decode the AV1 Chimera 1080p sample at more than 24 fps on a Galaxy S5 from 2014 on Android (32-bit)!
Reaching 24fps does not even use the full CPU!
Once again, we see that the gav1 library has issues with threading.
On the desktop, we did some SSE2 optimizations, for the people who don't have SSSE3 CPU, which should see quite a bump in decoding.
We also did optimizations for the scaled mode, in AVX2. (This is used only by bitstreams that use the spatial scalability feature).
See you soon, for more speed improvements!
PS: thanks again to Nathan for the graphs.
Dav1d new release:
If you follow this blog, you should know everything about dav1d.
The VideoLAN, VLC and FFmpeg communities have been working on a new AV1 decoder, dav1d, in order to create the best and fastest decoder.
0.7.0 is a major new release, whose focus is, once again, speed. It is doubly interesting, for improvements are important for both computers and smartphones.
For once, the biggest speed improvement for desktop and laptops is not coming from writing more assembly code, but from Ronald's rewrite of the ref_mv algorithm.
This new algo gives a 8-12% speed improvement measured on Haswell machines while reducing memory usage by 25%.
We're talking about 10% faster for the complete AV1 decoding, that's a more important impact than a lot of assembly we wrote.
With 0.7.0 release, the assembly for x86 CPUs (32bit and 64bit) is now totally complete for the 8bit bitdepth.
We finished up all the small optimizations that remained for SSSE3 and AVX2, notably film grain, during the 0.6.0 and 0.7.0 development cycles. We added more AVX-512 assembly, for those with very recent CPUs.
In the future, getting faster on those Intel CPU is going to be very difficult (I know I said that already many times, but this time it's true).
Dav1d is still around 3x to 5x faster than aomdec on normal computers; but we are now even more faster :). See older posts for more information.
As for 0.6.0, an important focus of dav1d 0.7.0 was ARM assembly, and notably for the 10bitdepth cases.
As of 0.7.0, most assembly you should care about is done for 8bit/10bit/12bit on ARM64 and this makes decoding AV1 on the phones affordable.
gav1 is an open source decoder made by Google to compete with dav1d on Android and ARM.
As of 0.7.0, dav1d is between 1.8x and 2.5x faster on 8b content and 2.4x to 5x faster on 10b content than gav1 on different CPUs.
this graph was made on ODroid N2, for example.
ARM CPUs for mobile devices have an architecture with both LITTLE and big cores, which offer different speed and different power usage.
Using different types of cores allows to consume only the power you need for normal tasks, and be able to go in max power, when requested.
It is therefore extremely important to analyze the performance of our ARM code on both types of cores and when mixing it.
So let's see have a look at how dav1d and gav1 compare on the reference AV1 sample, made by Netflix, Chimera and on the SnapDragon 821 (Pixel 1 phone):
What we can learn from those graphs are the following:
For 10b, the situation is even worse for gav1.
I want to emphasis on the fact that dav1d can decode Chimera with 2 threads on the Pixel 1, from 2016, using only the LITTLE cores.
So, what's interesting is to look at the LITTLE cores performance on Android to see the actual speed of the decoder, under low-power cases.
We tested here, all the threads configuration, on the following Android devices:
Here are the results:
Once again, we can see, on LITTLE cores:
For the sake of completeness, here are the results for 10b on the LITTLE cores:
You can find all the details here, in the spreadsheet done by Nathan.
dav1d is now a very fast decoder on desktops, laptops, but mostly on mobile where it shows very impressive performance on 8b and 10b. It can decode 1080p with a couple of cores on mobile.
Thanks a lot to Nathan Egge, from Mozilla, who gathered all the data required for this post. He therefore did all the work for this blogpost.