libjpeg-turbo-devel Mailing List for libjpeg-turbo (Page 20)
SIMD-accelerated libjpeg-compatible JPEG codec library
Brought to you by:
dcommander
You can subscribe to this list here.
| 2011 |
Jan
|
Feb
|
Mar
|
Apr
|
May
|
Jun
|
Jul
(3) |
Aug
(9) |
Sep
(13) |
Oct
(35) |
Nov
(1) |
Dec
|
|---|---|---|---|---|---|---|---|---|---|---|---|---|
| 2012 |
Jan
(3) |
Feb
(4) |
Mar
|
Apr
(8) |
May
(3) |
Jun
(29) |
Jul
(3) |
Aug
|
Sep
|
Oct
(3) |
Nov
(7) |
Dec
(4) |
| 2013 |
Jan
(44) |
Feb
(15) |
Mar
|
Apr
(1) |
May
(3) |
Jun
(3) |
Jul
(5) |
Aug
(12) |
Sep
(9) |
Oct
(10) |
Nov
(5) |
Dec
|
| 2014 |
Jan
(1) |
Feb
(10) |
Mar
(12) |
Apr
(26) |
May
|
Jun
|
Jul
(12) |
Aug
(11) |
Sep
(15) |
Oct
(4) |
Nov
(23) |
Dec
(5) |
| 2015 |
Jan
(21) |
Feb
(4) |
Mar
(11) |
Apr
(1) |
May
(7) |
Jun
(10) |
Jul
(17) |
Aug
(4) |
Sep
(2) |
Oct
(4) |
Nov
|
Dec
|
| 2016 |
Jan
(7) |
Feb
(2) |
Mar
(1) |
Apr
(4) |
May
(4) |
Jun
(8) |
Jul
|
Aug
(2) |
Sep
(1) |
Oct
(6) |
Nov
|
Dec
(11) |
| 2017 |
Jan
(3) |
Feb
|
Mar
|
Apr
(3) |
May
|
Jun
(4) |
Jul
(1) |
Aug
(1) |
Sep
(1) |
Oct
|
Nov
|
Dec
|
|
From: Atul A. <sea...@gm...> - 2011-09-09 15:20:58
|
Hello everybody, I am trying to solve a firefox bug ( https://bugzilla.mozilla.org/show_bug.cgi?id=666621) which reports that several warnings are thrown when libjpeg-turbo is compiled. I have seen these warnings while trying to compile code myself (using gcc 4.6.1 but I really doubt gcc version has anything to do with it). I am interested in fixing these warnings. These warning is due to the fact that there are several variables which are named 'main' in the code which is causing gcc to throw these warning. I want to know what is the process of submitting patch to libjpeg-turbo so that these warning can be fixed. Fix will be simple and will involve replacing main to some other meaningful name but it will involve several files. Thanks a lot. -- Regards, Atul Aggarwal |
|
From: Tom G. <tom...@li...> - 2011-09-02 19:54:31
|
On Fri, Sep 2, 2011 at 10:51 AM, Christian Robottom Reis <ki...@li...> wrote: > On Fri, Sep 02, 2011 at 10:34:47AM -0500, Tom Gall wrote: >> The initial port of Android's extensions to the current libjpeg-turbo >> codebase is complete. > > Tom, you've done a spectacular job carrying the -turbo work into the > developer platform and now Android. I'm thrilled to see this come to > fruition now, and the CyanogenMod test is a great bonus -- seeing this > used on a real form-factor device means I can actually believe in it. Thanks! It's been fun and the fun shall continue! >> The code can currently be found in git at: >> >> http://git.linaro.org/gitweb?p=people/tomgall/libjpeg-turbo/libjpeg-turbo.git;a=summary >> >> from the 1.2-beta-linaro-andoid branch. >> >> Be sure to read the ANDROID.txt file for build instructions. This >> branch is SPECIFICALLY for android. > > Can you give a summary of what's being added here? i.e. what happened to > Android when using libjpeg-turbo without these added patches? For a more technical summary, the write up with the initial draft submitted patch is probably more interesting. It's perhaps a little too high level yet but for a patch of this size it's hard not to write a book. But in the interest of release early, release often, there it is. https://sourceforge.net/tracker/?func=detail&aid=3403461&group_id=303195&atid=1278160 > I see some pretty major changes in: > > http://git.linaro.org/gitweb?p=people/tomgall/libjpeg-turbo/libjpeg-turbo.git;a=commitdiff;h=ff1f5e7ce17701b48e53b7fcb3509f40715fd2e4 > > It looks to me like the main changes are protected by the following > defines, which probably answers part of my question above: > > ENABLE_ANDROID_NULL_CONVERT > ANDROID_TILE_BASED_DECODE > ANDROID_RGB Yes, everything is bounded by #ifdef ANDROID or akin. I'm kept to the mostly kept to the Android define philosophy for now but I think it reasonable to evolve the patch a bit and just go all ANDROID with the exception of testing for say android neon or support for ash. > (Ald yes, I have read ANDROID.txt which confirms the above but doesn't > tell me much else ;-) :-) Well it does say how to build it. > Also, I see you adding config.h and jconfig.h files which I don't think > you want to have committed, right? I do. Those are both generated by the autotools. Android does not use the autotools at all but these files must exist. Given there is no ./configure step in Android's build the next best thing is putting copies into the android directory and expect builders to copy/move them to the right place at build time. In my git tree however this step is not necessary, however that branch is the git tree is only for android. > Do you really want an android/ config subdirectory? Unfortunately. There might be a better way around this, I don't know my Android build foo enough yet to see another method but perhaps some android types might comment. >> I have not yet started to do performance comparisons between the old >> jpeg and libjpeg-turbo on android. That needs to be done. > > This might be Zach's next favorite demo! I have two Nexus Ones that I am going to install side by side. I think it might have some youtube potential :-) > You might want to try out the toolchain guys' latest -03 and assorted > optimization madness to see if they make a difference when you do that. Indeed. There's more to do. >> Also from the android extensions, support for ash and one optimization >> for armv6 was not included. Both however are reasonable optimizations >> and I can see including them at a future date. > > What is the "support for ash" piece? Ash is Android Shared memory. http://elinux.org/Android_Kernel_Features#ashmem > Good job! Thanks. > -- > Christian Robottom Reis, Engineering VP > Brazil (GMT-3) | [+55] 16 9112 6430 | [+1] 612 216 4935 > Linaro.org: Open Source Software for ARM SoCs > -- Regards, Tom "We want great men who, when fortune frowns will not be discouraged." - Colonel Henry Knox Linaro.org │ Open source software for ARM SoCs w) tom.gall att linaro.org w) tom_gall att vnet.ibm.com h) tom_gall att mac.com |
|
From: Tom G. <tom...@li...> - 2011-09-02 15:34:53
|
The initial port of Android's extensions to the current libjpeg-turbo codebase is complete. With Chao Yang's help we've validated using both the Linaro Android LEB panda as well as CyanogenMod 7 on the Nook. (I'd validate on my Nexus One too but I'm short of microSD cards) It runs, image encode and decode tests pass and use of the library by android components such as the browser and skia appears to be fine with no issues. The past bugs seen with the 1.1.1 based proof of concept are gone. The code can currently be found in git at: http://git.linaro.org/gitweb?p=people/tomgall/libjpeg-turbo/libjpeg-turbo.git;a=summary from the 1.2-beta-linaro-andoid branch. Be sure to read the ANDROID.txt file for build instructions. This branch is SPECIFICALLY for android. It's not a wise idea (yet) to take these contents and use for a traditional linux system (tho I have also tested on linux with the extensions compiled in, I have not yet tested with all the extensions turned off to make sure a pre android build and a post android build without android support on are equal) As soon as sourceforge sends me an email so I can confirm my account there, I'll get a patch posted to the libjpeg-turbo patch tracker for review. I anticipate there will be a comment or two and some slight rework here and there but generally speaking the code should be in good shape. I have not yet started to do performance comparisons between the old jpeg and libjpeg-turbo on android. That needs to be done. Further neon hardware support is assumed, likely there will be a patch to enable / disable optimization that counts on neon. Also from the android extensions, support for ash and one optimization for armv6 was not included. Both however are reasonable optimizations and I can see including them at a future date. I would include a pointer to the original android jpeg source but android.git.kernel.org is still down. -- Regards, Tom "We want great men who, when fortune frowns will not be discouraged." - Colonel Henry Knox Linaro.org │ Open source software for ARM SoCs w) tom.gall att linaro.org w) tom_gall att vnet.ibm.com h) tom_gall att mac.com |
|
From: Siarhei S. <sia...@gm...> - 2011-08-30 17:18:09
|
On Tue, Aug 30, 2011 at 6:50 PM, DRC <dco...@us...> wrote:
> On 8/29/11 2:24 AM, Siarhei Siamashka wrote:
>> It seems to be easily fixable by putting this conversion code into an
>> inline helper function, which can be called for each color format case
>> with these offsets passed as compile time constants.
>
> Can you show me an example of what you're talking about?
Sure, I have submitted a patch:
https://sourceforge.net/tracker/?func=detail&aid=3400891&group_id=303195&atid=1278160
Right now it can benefit Tegra2, which has ARM Cortex-A9 processor,
but does not have NEON. That is until somebody submits some good ARMv6
optimized color conversion code to libjpeg-turbo.
Benchmarking with gcc 4.5.2 and not using SIMD optimizations.
$ ./configure --disable-shared --without-simd && make
$ ./tjbench nightshot_iso_100.ppm 75 -rgb -quiet
=== Intel Core i7 @2.8GHz (64-bit) ===
=== before ===
Bitmap Bitmap JPEG JPEG Image Image Comp Comp Decomp
Format Order Subsamp Qual Width Height Perf Ratio Perf
RGB TD GRAY 75 3136 2352 106.4 58.51 175.6
RGB TD 4:2:0 75 3136 2352 63.49 48.08 73.06
RGB TD 4:2:2 75 3136 2352 50.59 41.85 67.18
RGB TD 4:4:4 75 3136 2352 39.84 34.09 60.53
=== after ===
Bitmap Bitmap JPEG JPEG Image Image Comp Comp Decomp
Format Order Subsamp Qual Width Height Perf Ratio Perf
RGB TD GRAY 75 3136 2352 105.6 58.51 175.9
RGB TD 4:2:0 75 3136 2352 63.47 48.08 88.36
RGB TD 4:2:2 75 3136 2352 50.55 41.85 79.38
RGB TD 4:4:4 75 3136 2352 39.79 34.09 70.79
=== ARM Cortex-A9 @1GHz ===
=== before ===
Bitmap Bitmap JPEG JPEG Image Image Comp Comp Decomp
Format Order Subsamp Qual Width Height Perf Ratio Perf
RGB TD GRAY 75 3136 2352 15.36 58.51 34.67
RGB TD 4:2:0 75 3136 2352 8.255 48.08 11.36
RGB TD 4:2:2 75 3136 2352 6.778 41.85 10.37
RGB TD 4:4:4 75 3136 2352 5.235 34.09 9.733
=== after ===
Bitmap Bitmap JPEG JPEG Image Image Comp Comp Decomp
Format Order Subsamp Qual Width Height Perf Ratio Perf
RGB TD GRAY 75 3136 2352 15.34 58.51 34.70
RGB TD 4:2:0 75 3136 2352 8.239 48.08 12.66
RGB TD 4:2:2 75 3136 2352 6.774 41.85 11.43
RGB TD 4:4:4 75 3136 2352 5.236 34.09 10.74
--
Best regards,
Siarhei Siamashka
|
|
From: DRC <dco...@us...> - 2011-08-30 15:57:59
|
On 8/29/11 2:24 AM, Siarhei Siamashka wrote: > It seems to be easily fixable by putting this conversion code into an > inline helper function, which can be called for each color format case > with these offsets passed as compile time constants. Can you show me an example of what you're talking about? |
|
From: Siarhei S. <sia...@gm...> - 2011-08-29 08:25:05
|
On Mon, Aug 29, 2011 at 6:10 AM, DRC <dco...@us...> wrote: > On 8/28/11 8:49 PM, Tom Gall wrote: >>> The benchmarks show that libjpeg-turbo is a bit slower. And the main >>> culprit is the C implementation of 'ycc_rgb_convert' function (33186 >>> vs. 41153 samples difference is too high to be explained by just >>> random variation). Looking at the code, the primary suspect is the use >>> of local variables instead of the constants known at compile time in >>> jpeg-6b: >>> >>> int rindex = rgb_red[cinfo->out_color_space]; >>> int gindex = rgb_green[cinfo->out_color_space]; >>> int bindex = rgb_blue[cinfo->out_color_space]; >>> >>> /* snip */ >>> >>> outptr[rindex] = range_limit[y + Crrtab[cr]]; >>> outptr[gindex] = range_limit[y + ((int) RIGHT_SHIFT(Cbgtab[cb] + Crgtab[cr], >>> SCALEBITS))]; >>> outptr[bindex] = range_limit[y + Cbbtab[cb]]; > > Yes, those are the libjpeg-turbo colorspace extensions. The non-SIMD > color conversion routines should perform a bit better in 1.2 than in > 1.1, but yes, they are slower than jpeg-6b, for precisely the reasons > you cite. It seems to be easily fixable by putting this conversion code into an inline helper function, which can be called for each color format case with these offsets passed as compile time constants. Anyway, if libjpeg-turbo intends to be a good ijg libjpeg drop-in replacement, then the performance might be also important on the hardware, which doesn't have SIMD optimizations yet. These include older ARM processors without NEON (I guess linaro is also interested in good performance for these), MIPS, PPC and any other platforms. IIRC, one of the other C parts which differ from jpeg-6b is quantization. I have not tried to benchmark it though. > However, the overall compression/decompression performance of > the non-SIMD code should be better than jpeg-6b, because we have > hand-optimized Huffman encoding/decoding routines that more than make up > for the color conversion routines. > If you're not seeing that, then yes, something is amiss. The decoding performance seemed to be a little bit worse overall for libjpeg-turbo in that setup. It's hard to say how huffman decoding is contributing, but this is definitely worth investigating. Also maybe the compiler version and/or the optimization options could play a major role here. A somewhat radical solution may be to reimplement some parts of the huffman decoder in hand optimized assembly for ARM, but that is if the compiler is doing a really poor job there. -- Best regards, Siarhei Siamashka |
|
From: Siarhei S. <sia...@gm...> - 2011-08-29 08:04:49
|
On Mon, Aug 29, 2011 at 5:49 AM, Tom Gall <tom...@li...> wrote:
> Hi Siarhei,
>
> I wouldn't read tooo much into the numbers yet. That's where i was
> late in the day this last friday and you're right, something doesn't
> seem right but haven't had a chance to dig into it as I discovered
> that oprofile wasn't working on the latest linaro kernels for my
> panda.
OK, that's understandable. Regarding oprofile on your panda, there may
be some problems getting performance counters to work properly on
omap4:
http://groups.google.com/group/pandaboard/browse_thread/thread/9541e4328094ff88?pli=1
But I have not looked into it too much, because I'm generally just
using the timer interrupt mode as it is the only way to get non-bogus
oprofile data on ARM Cortex-A8.
If you can recompile your kernel, then getting oprofile up and running
in the timer interrupt mode is very simple. If you need some help,
just ping me on irc. Also tweaking the samples collection frequency
for the oprofile timer mode would be a good idea as explained in:
http://ssvb.github.com/2011/08/23/yet-another-oprofile-tutorial.html
> So a little digging to do come tomorrow morning.
Great.
> Also the named image files I'm using are online now at
> http://people.linaro.org/~tgall
OK, thanks. And this is again a bit strange. I can't seem to confirm
your results on a 1GHz ARM Cortex-A9 system with gcc 4.5.2:
$ wget http://launchpad.net/libjpeg-turbo/1.2/1.1.90/+download/libjpeg-turbo-1.1.90.tar.gz
$ wget http://people.linaro.org/~tgall/Saturn.jpg
$ tar -xzf libjpeg-turbo-1.1.90.tar.gz
$ cd libjpeg-turbo-1.1.90
$ autoreconf -fiv
$ ./configure --disable-shared && make -j2
$ time ./djpeg ../Saturn.jpg > /dev/null
$ time ./djpeg ../Saturn.jpg > /dev/null
real 0m1.148s
user 0m1.140s
sys 0m0.005s
$ time JSIMD_FORCE_NO_SIMD=1 ./djpeg ../Saturn.jpg > /dev/null
real 0m2.302s
user 0m2.280s
sys 0m0.010s
$ mkdir tmp
$ sudo mount -t tmpfs -o size=250M tmpfs tmp
$ cp ../Saturn.jpg tmp
$ cp djpeg tmp
$ cd tmp
$ time ./djpeg Saturn.jpg > Saturn.ppm
real 0m1.447s
user 0m1.140s
sys 0m0.300s
$ time JSIMD_FORCE_NO_SIMD=1 ./djpeg Saturn.jpg > Saturn.ppm
real 0m2.619s
user 0m2.325s
sys 0m0.280s
$ sudo umount tmp
The worst result for decoding this 7227x3847 Saturn.jpg image with the
default settings and without SIMD is 2.619s for me, which is similar
to libjpeg62 decoding time from your table. Are you sure that
libjpeg-turbo could not have been compiled with optimizations totally
disabled (something like -O0 option)?
Also even with SIMD enabled, the linaro packaged
libjpeg-turbo-1.1.90.tar.gz does not have ARM NEON optimized ISLOW
iDCT yet, and it makes a major difference for decoding with the
default settings. ISLOW iDCT is typically used for the normal high
quality jpeg images. And IFAST iDCT is important for mjpeg decoding in
gstreamer.
--
Best regards,
Siarhei Siamashka
|
|
From: DRC <dco...@us...> - 2011-08-29 03:11:06
|
On 8/28/11 8:49 PM, Tom Gall wrote: >> The benchmarks show that libjpeg-turbo is a bit slower. And the main >> culprit is the C implementation of 'ycc_rgb_convert' function (33186 >> vs. 41153 samples difference is too high to be explained by just >> random variation). Looking at the code, the primary suspect is the use >> of local variables instead of the constants known at compile time in >> jpeg-6b: >> >> int rindex = rgb_red[cinfo->out_color_space]; >> int gindex = rgb_green[cinfo->out_color_space]; >> int bindex = rgb_blue[cinfo->out_color_space]; >> >> /* snip */ >> >> outptr[rindex] = range_limit[y + Crrtab[cr]]; >> outptr[gindex] = range_limit[y + ((int) RIGHT_SHIFT(Cbgtab[cb] + Crgtab[cr], >> SCALEBITS))]; >> outptr[bindex] = range_limit[y + Cbbtab[cb]]; Yes, those are the libjpeg-turbo colorspace extensions. The non-SIMD color conversion routines should perform a bit better in 1.2 than in 1.1, but yes, they are slower than jpeg-6b, for precisely the reasons you cite. However, the overall compression/decompression performance of the non-SIMD code should be better than jpeg-6b, because we have hand-optimized Huffman encoding/decoding routines that more than make up for the color conversion routines. If you're not seeing that, then yes, something is amiss. |
|
From: Tom G. <tom...@li...> - 2011-08-29 02:49:57
|
Hi Siarhei, I wouldn't read tooo much into the numbers yet. That's where i was late in the day this last friday and you're right, something doesn't seem right but haven't had a chance to dig into it as I discovered that oprofile wasn't working on the latest linaro kernels for my panda. So a little digging to do come tomorrow morning. Also the named image files I'm using are online now at http://people.linaro.org/~tgall On Sun, Aug 28, 2011 at 5:29 PM, Siarhei Siamashka <sia...@gm...> wrote: > Hello, > > The suspicions benchmark results from > https://wiki.linaro.org/TomGall/LibJpegTurbo caught my attention. > Apparently NEON optimizations did not get enabled for some reason, but > the fact that liibjpeg-turbo performed worse than jpeg-6b even running > C code was still also very suspicions. One possible guess could be > different optimization settings (-O2 vs. -O3 or something like this). > But anyway, I decided to run some benchmark myself. > > Test setup: > ARM Cortex-A8 @1GHz, gcc 4.5.2, CFLAGS="-O2" > > Test image: > http://mynokiablog.files.wordpress.com/2010/07/4807153298_fbc3c69796_o.jpg from > http://mynokiablog.com/2010/07/19/photos-new-12mp-samples-from-the-nokia-n8/ > > Running tests in the following way in order to disable NEON optimizations: > $ JSIMD_FORCE_NO_SIMD=1 ./djpeg 4807153298_fbc3c69796_o.jpg > /dev/null > > The benchmarks show that libjpeg-turbo is a bit slower. And the main > culprit is the C implementation of 'ycc_rgb_convert' function (33186 > vs. 41153 samples difference is too high to be explained by just > random variation). Looking at the code, the primary suspect is the use > of local variables instead of the constants known at compile time in > jpeg-6b: > > int rindex = rgb_red[cinfo->out_color_space]; > int gindex = rgb_green[cinfo->out_color_space]; > int bindex = rgb_blue[cinfo->out_color_space]; > > /* snip */ > > outptr[rindex] = range_limit[y + Crrtab[cr]]; > outptr[gindex] = range_limit[y + ((int) RIGHT_SHIFT(Cbgtab[cb] + Crgtab[cr], > SCALEBITS))]; > outptr[bindex] = range_limit[y + Cbbtab[cb]]; > > And finally the profiling logs are below: > > == jpeg-6b == > > real 0m1.507s > user 0m1.492s > sys 0m0.008s > > Profiling through timer interrupt > samples % image name symbol name > 53171 42.7781 djpeg jpeg_idct_islow > 33186 26.6994 djpeg ycc_rgb_convert > 15131 12.1735 djpeg h2v1_fancy_upsample > 13937 11.2128 djpeg decode_mcu > 2505 2.0154 djpeg jpeg_fill_bit_buffer > 2125 1.7096 djpeg decompress_onepass > 1445 1.1626 libc-2.12.2.so memset > 1095 0.8810 no-vmlinux /no-vmlinux > 1003 0.8070 libc-2.12.2.so _wordcopy_fwd_dest_aligned > 154 0.1239 djpeg jpeg_huff_decode > 82 0.0660 libc-2.12.2.so fwrite > 67 0.0539 libc-2.12.2.so write > 43 0.0346 djpeg sep_upsample > 37 0.0298 libc-2.12.2.so new_do_write > 34 0.0274 libc-2.12.2.so _IO_file_write@@GLIBC_2.4 > 28 0.0225 libc-2.12.2.so mempcpy > 27 0.0217 libc-2.12.2.so _IO_file_xsputn@@GLIBC_2.4 > 17 0.0137 djpeg process_data_simple_main > 17 0.0137 libc-2.12.2.so _IO_default_xsputn > 14 0.0113 djpeg jpeg_read_scanlines > 14 0.0113 djpeg jzero_far > 13 0.0105 libc-2.12.2.so _IO_file_xsgetn > 11 0.0088 libc-2.12.2.so _dl_addr > 11 0.0088 libc-2.12.2.so fread > 10 0.0080 libc-2.12.2.so _IO_sgetn > 10 0.0080 libc-2.12.2.so read > > == linaro libjpeg-turbo-1.1.90 == > > real 0m1.586s > user 0m1.563s > sys 0m0.016s > > Profiling through timer interrupt > samples % image name symbol name > 51961 40.0063 djpeg jpeg_idct_islow > 41153 31.6849 djpeg ycc_rgb_convert > 15208 11.7091 djpeg decode_mcu > 15058 11.5936 djpeg h2v1_fancy_upsample > 2050 1.5784 djpeg decompress_onepass > 1438 1.1072 libc-2.12.2.so memset > 1094 0.8423 no-vmlinux /no-vmlinux > 1013 0.7799 libc-2.12.2.so _wordcopy_fwd_dest_aligned > 297 0.2287 djpeg jpeg_fill_bit_buffer > 90 0.0693 libc-2.12.2.so fwrite > 53 0.0408 libc-2.12.2.so write > 46 0.0354 libc-2.12.2.so new_do_write > 41 0.0316 djpeg sep_upsample > 40 0.0308 libc-2.12.2.so mempcpy > 34 0.0262 libc-2.12.2.so _IO_file_xsputn@@GLIBC_2.4 > 33 0.0254 libc-2.12.2.so _IO_file_write@@GLIBC_2.4 > 24 0.0185 djpeg jpeg_read_scanlines > 23 0.0177 djpeg jzero_far > 22 0.0169 djpeg jpeg_huff_decode > 20 0.0154 djpeg main > > -- > Best regards, > Siarhei Siamashka Thanks for passing this along. -- Regards, Tom "We want great men who, when fortune frowns will not be discouraged." - Colonel Henry Knox Linaro.org │ Open source software for ARM SoCs w) tom.gall att linaro.org w) tom_gall att vnet.ibm.com h) tom_gall att mac.com |
|
From: Siarhei S. <sia...@gm...> - 2011-08-28 22:30:01
|
Hello, The suspicions benchmark results from https://wiki.linaro.org/TomGall/LibJpegTurbo caught my attention. Apparently NEON optimizations did not get enabled for some reason, but the fact that liibjpeg-turbo performed worse than jpeg-6b even running C code was still also very suspicions. One possible guess could be different optimization settings (-O2 vs. -O3 or something like this). But anyway, I decided to run some benchmark myself. Test setup: ARM Cortex-A8 @1GHz, gcc 4.5.2, CFLAGS="-O2" Test image: http://mynokiablog.files.wordpress.com/2010/07/4807153298_fbc3c69796_o.jpg from http://mynokiablog.com/2010/07/19/photos-new-12mp-samples-from-the-nokia-n8/ Running tests in the following way in order to disable NEON optimizations: $ JSIMD_FORCE_NO_SIMD=1 ./djpeg 4807153298_fbc3c69796_o.jpg > /dev/null The benchmarks show that libjpeg-turbo is a bit slower. And the main culprit is the C implementation of 'ycc_rgb_convert' function (33186 vs. 41153 samples difference is too high to be explained by just random variation). Looking at the code, the primary suspect is the use of local variables instead of the constants known at compile time in jpeg-6b: int rindex = rgb_red[cinfo->out_color_space]; int gindex = rgb_green[cinfo->out_color_space]; int bindex = rgb_blue[cinfo->out_color_space]; /* snip */ outptr[rindex] = range_limit[y + Crrtab[cr]]; outptr[gindex] = range_limit[y + ((int) RIGHT_SHIFT(Cbgtab[cb] + Crgtab[cr], SCALEBITS))]; outptr[bindex] = range_limit[y + Cbbtab[cb]]; And finally the profiling logs are below: == jpeg-6b == real 0m1.507s user 0m1.492s sys 0m0.008s Profiling through timer interrupt samples % image name symbol name 53171 42.7781 djpeg jpeg_idct_islow 33186 26.6994 djpeg ycc_rgb_convert 15131 12.1735 djpeg h2v1_fancy_upsample 13937 11.2128 djpeg decode_mcu 2505 2.0154 djpeg jpeg_fill_bit_buffer 2125 1.7096 djpeg decompress_onepass 1445 1.1626 libc-2.12.2.so memset 1095 0.8810 no-vmlinux /no-vmlinux 1003 0.8070 libc-2.12.2.so _wordcopy_fwd_dest_aligned 154 0.1239 djpeg jpeg_huff_decode 82 0.0660 libc-2.12.2.so fwrite 67 0.0539 libc-2.12.2.so write 43 0.0346 djpeg sep_upsample 37 0.0298 libc-2.12.2.so new_do_write 34 0.0274 libc-2.12.2.so _IO_file_write@@GLIBC_2.4 28 0.0225 libc-2.12.2.so mempcpy 27 0.0217 libc-2.12.2.so _IO_file_xsputn@@GLIBC_2.4 17 0.0137 djpeg process_data_simple_main 17 0.0137 libc-2.12.2.so _IO_default_xsputn 14 0.0113 djpeg jpeg_read_scanlines 14 0.0113 djpeg jzero_far 13 0.0105 libc-2.12.2.so _IO_file_xsgetn 11 0.0088 libc-2.12.2.so _dl_addr 11 0.0088 libc-2.12.2.so fread 10 0.0080 libc-2.12.2.so _IO_sgetn 10 0.0080 libc-2.12.2.so read == linaro libjpeg-turbo-1.1.90 == real 0m1.586s user 0m1.563s sys 0m0.016s Profiling through timer interrupt samples % image name symbol name 51961 40.0063 djpeg jpeg_idct_islow 41153 31.6849 djpeg ycc_rgb_convert 15208 11.7091 djpeg decode_mcu 15058 11.5936 djpeg h2v1_fancy_upsample 2050 1.5784 djpeg decompress_onepass 1438 1.1072 libc-2.12.2.so memset 1094 0.8423 no-vmlinux /no-vmlinux 1013 0.7799 libc-2.12.2.so _wordcopy_fwd_dest_aligned 297 0.2287 djpeg jpeg_fill_bit_buffer 90 0.0693 libc-2.12.2.so fwrite 53 0.0408 libc-2.12.2.so write 46 0.0354 libc-2.12.2.so new_do_write 41 0.0316 djpeg sep_upsample 40 0.0308 libc-2.12.2.so mempcpy 34 0.0262 libc-2.12.2.so _IO_file_xsputn@@GLIBC_2.4 33 0.0254 libc-2.12.2.so _IO_file_write@@GLIBC_2.4 24 0.0185 djpeg jpeg_read_scanlines 23 0.0177 djpeg jzero_far 22 0.0169 djpeg jpeg_huff_decode 20 0.0154 djpeg main -- Best regards, Siarhei Siamashka |
|
From: DRC <dco...@us...> - 2011-08-19 23:59:28
|
Sounds good. Siarhei has just submitted a new patch for implementing accelerated ISLOW decoding, and he plans to tweak that over the coming days. Unless anyone sees a reason not to, I would like to release the official libjpeg-turbo 1.2 beta in September or early October. On 7/22/64 1:59 PM, Tom Gall wrote: > All, > > The current 1.1.90 code with mandeep's reworked change that was > accepted yesterday passes all make test and correctly displays the > android reference image that was showing quality problems with the > older 1.1.1 androidized proof of concept. > > Given this situation for Linaro's 11.08 release we are going to ship > the upstream 1.1.90 version. I do not believe we should do any further > development with the older 1.1.x branch of code. This works well for > linux and will make Monday's RC build. (for those on the > libjpeg-turbo-devel list, Linaro ships a reference image every month) > > For android the situation is a little less clear. Basically we'll need > to re-forward-port the android specific changes to the 1.1.90 code. It > will take some time and we will submit this upstream to the > libjpeg-turbo project of course. I don't want a hack like the POC for > android was. So for the android team short term I think it's your > choice, you can continue to include the 1.1.1 POC in your builds but > for things that might be busted, you get to keep both halves. I think > the POC has served it's purpose and now's the time to focus on what > will benefit both the Linaro Android WG, the upstream libjpeg-turbo > community longer term. > > Thanks! > |
|
From: Tom G. <tom...@li...> - 2011-08-19 17:33:54
|
All, The current 1.1.90 code with mandeep's reworked change that was accepted yesterday passes all make test and correctly displays the android reference image that was showing quality problems with the older 1.1.1 androidized proof of concept. Given this situation for Linaro's 11.08 release we are going to ship the upstream 1.1.90 version. I do not believe we should do any further development with the older 1.1.x branch of code. This works well for linux and will make Monday's RC build. (for those on the libjpeg-turbo-devel list, Linaro ships a reference image every month) For android the situation is a little less clear. Basically we'll need to re-forward-port the android specific changes to the 1.1.90 code. It will take some time and we will submit this upstream to the libjpeg-turbo project of course. I don't want a hack like the POC for android was. So for the android team short term I think it's your choice, you can continue to include the 1.1.1 POC in your builds but for things that might be busted, you get to keep both halves. I think the POC has served it's purpose and now's the time to focus on what will benefit both the Linaro Android WG, the upstream libjpeg-turbo community longer term. Thanks! -- Regards, Tom "We want great men who, when fortune frowns will not be discouraged." - Colonel Henry Knox Linaro.org │ Open source software for ARM SoCs w) tom.gall att linaro.org w) tom_gall att vnet.ibm.com h) tom_gall att mac.com |
|
From: Siarhei S. <sia...@gm...> - 2011-07-21 18:24:02
|
On Wed, Jul 20, 2011 at 7:13 AM, Tom Gall <tom...@li...> wrote: > Hi, > > At Linaro I'll be more involved with libjpeg-turbo and seeing that the > arm implementation continues to improve. > > My git tree which has the 1.1.1 code base + the past linaro work is > located here: > > http://git.linaro.org/gitweb?p=people/tomgall/libjpeg-turbo/libjpeg-turbo.git;a=summary > > Or if you're partial to launchpad.net there is > http://launchpad.net/libjpeg-turbo. If I understand it correctly, the code there contains the following parts combined in one big patch: 1. build scripts tweaks to add neon support 2. neon yuv->rgb 3. neon idct 4. 'ycck_rgb_convert' and 'cmyk_rgb_convert' 5. something in 'jsimdcfg.inc' and PROFILE_DECODING for djpeg Now some comments for each part separately: 1. build scripts tweaks to add neon support The decision whether to include arm neon optimizations is done at compile time in your patch. And neon always gets enabled for arm by default unless --without-simd configure option is provided. This may cause some confusion for unprepared users, who might want to compile the library for some older arm processors assuming that providing suitable CFLAGS would be enough. The missing runtime detection is an important practical problem. Relatively popular Tegra2 based devices use ARM Cortex-A9 without NEON. And the need to support systems both with and without NEON adds some headache for Mozilla, because they have no better choice than doing NEON detection in runtime for their Android builds. There are some alternatives, for example in linux it is possible to provide multiple versions of the libraries and rely on hwcaps-based shared library selection in libc, like was done for ffmpeg here: https://bugs.launchpad.net/ubuntu/+source/ffmpeg/+bug/383240 But the biggest problem is that you are a bit late. It happened that libjpeg-turbo svn trunk already got support for arm neon in the build scripts and also runtime neon detection before linaro rolled out any libjpeg patches to public. So right now I would suggest to have a look at libjpeg-turbo trunk and submit patches against it if you need to tweak build scripts for your purposes. For example, this already happened earlier and libjpeg-turbo should also support iOS: http://sourceforge.net/tracker/?func=detail&aid=3303840&group_id=303195&atid=1278160 2. neon yuv->rgb The check for 'cinfo->out_color_space' is missing and the code is always trying to use 'yvup2bgr888_venum', which looks wrong. This problem could be detected by 'make test', but my understanding is that 'make test' is not supposed to work with your patches yet because you are not trying to keep decoder results identical to the results provided by C implementation. 3. neon idct This looks like some neon optimized idct function was found somewhere and plugged into libjpeg-turbo via some heavyweight wrapper code. I have not benchmarked it, but just based on how it looks, this wrapper could be easily as expensive as the idct function itself. Which is surely not good for performance. Also this optimization does not try to provide bitexact results, but just claims IEEE 1180 compliance in the comments. This bring the question about how this code can be tested and how well it fits the existing test suite. In any case, modern codecs like Theora, VP8 and H.264 now specify bitexact iDCT because the idea of giving the freedom of selecting precision to implementers turned out to be not so great: http://guru.multimedia.cx/the-mpeg124-and-h26123-idct/ This surely affects jpeg decoders to much smaller extent, but I myself do not think that it is a good idea to add even more (i)DCT variants to libjpeg without a really good reason. Not to mention extra problems not just for libjpeg-turbo automated tests, but also the tests from the other applications which rely on certain deterministic output from jpeg decoder (ex. Mozilla). That said, an example of where it *could* be a good idea to actually replace iDCT is the case of scaled decoding: http://sourceforge.net/tracker/?func=detail&aid=3291302&group_id=303195&atid=1278160 Mostly not just to speed up iDCT functions (they are already quite fast with SIMD optimizations), but because this provides an opportunity for additional performance improvements in huffman decoder part which is a real bottleneck. I have suspended this work for now just not to cause any extra confusion, but still want to revisit it later. 4. 'ycck_rgb_convert' and 'cmyk_rgb_convert' I tried to search a bit, and openoffice also seems to have some similar code: http://lists.fedoraproject.org/pipermail/scm-commits/2010-January/380342.html Now I'm curious about what is the difference and whether linaro 'ycck_rgb_convert' and 'cmyk_rgb_convert' functions are better than their openoffice counterparts. Also what was the motivation to add this code to libjpeg-turbo? Was it specifically done attempting to help the applications such as openoffice? But given the differences, I wonder if openoffice folks would be actually happy with your implementation or will have to maintain patches replacing it with their own implementation from now on (assuming that they take libjpeg-turbo in use). So I would appreciate a little bit more information about this code. -- Best regards, Siarhei Siamashka |
|
From: Siarhei S. <sia...@gm...> - 2011-07-20 22:14:43
|
On Wed, Jul 20, 2011 at 7:13 AM, Tom Gall <tom...@li...> wrote: > Hi, > > At Linaro I'll be more involved with libjpeg-turbo and seeing that the > arm implementation continues to improve. > > My git tree which has the 1.1.1 code base + the past linaro work is > located here: > > http://git.linaro.org/gitweb?p=people/tomgall/libjpeg-turbo/libjpeg-turbo.git;a=summary > > Or if you're partial to launchpad.net there is > http://launchpad.net/libjpeg-turbo. Thanks for joining this mailing list and providing a link to your code. I'll try to have a closer look at it and post some comments in the next message. I guess implementing a complete set of arm neon optimizations (or at least the most important part of them) and having this code in the next libjpeg-turbo formal release as soon as possible is in the best interests of both of us :) > On to some questions. On arm has anyone done any recent profiling work? > I noticed that the dev mailing list isn't archived so sorry to ask > what might be a question with a common answers. There is no such thing as a separately done profiling work. Profiling is an important part of implementing any optimizations. The ARM NEON patches attached in the patch tracker had profiling information too: http://sourceforge.net/tracker/?func=detail&aid=3291291&group_id=303195&atid=1278160 Also some oprofile profiling data is available at the end of the benchmark logs attached there (dev mailing list did not exist at that time): http://sourceforge.net/mailarchive/message.php?msg_id=27613210 These logs just show which functions are generally 'hot' when running libjpeg-turbo benchmark and are good candidates for optimization, but do not tell anything about the relative importance of these functions (better optimized code is just run for more rounds during benchmark and this skews the statistics). Profiling djpeg/cjpeg programs when decoding/encoding individual jpeg files can provide real information about the relative contribution of different functions to cpu usage in each case. > For the validation of the library, besides what is contained in the > source for testcases, is there a general set of rules when it comes to > correctness that's actually documented somewhere where a mere mortal > can reference? There is a section named "TESTING THE SOFTWARE" in: http://libjpeg-turbo.svn.sourceforge.net/viewvc/libjpeg-turbo/trunk/install.txt?revision=252 -- Best regards, Siarhei Siamashka |
|
From: Tom G. <tom...@li...> - 2011-07-20 04:39:20
|
Hi, At Linaro I'll be more involved with libjpeg-turbo and seeing that the arm implementation continues to improve. My git tree which has the 1.1.1 code base + the past linaro work is located here: http://git.linaro.org/gitweb?p=people/tomgall/libjpeg-turbo/libjpeg-turbo.git;a=summary Or if you're partial to launchpad.net there is http://launchpad.net/libjpeg-turbo. On to some questions. On arm has anyone done any recent profiling work? I noticed that the dev mailing list isn't archived so sorry to ask what might be a question with a common answers. For the validation of the library, besides what is contained in the source for testcases, is there a general set of rules when it comes to correctness that's actually documented somewhere where a mere mortal can reference? Thanks! -- Regards, Tom "We want great men who, when fortune frowns will not be discouraged." - Colonel Henry Knox Linaro.org │ Open source software for ARM SoCs w) tom.gall att linaro.org w) tom_gall att vnet.ibm.com h) tom_gall att mac.com |