Faster ZIP Decompression

/g' | grep 'avx|sse' | sort avx avx2 sse sse2 sse4_1 sse4_2 ssse3 The first tool Ill benchmark is the Debian-patched Info-ZIP DEFLATE decompression implementation.

$ vi run_1 for i in {1.

25}; do unzip -qq -o hadoop.

zip -d working/ done $ time bash run_1 The above completed in 54.

55 seconds.

For the record the above binary doesnt appear to be built with zlib.

I examined the source code and it is possible to exclude zlib while retaining all the DEFLATE functionality.

$ objdump –dynamic-syms /usr/bin/unzip | grep -c ZLIB 0 The above doesnt show any mention of zlib.

I greped the binary looking for the ZLIB_1 string that appears when zlib has been statically linked and nothing turned up.

$ strings /usr/bin/unzip | grep -ic ZLIB_ 0 Next Ill try CPythons unzip functionality.

$ vi run_2 for i in {1.

25}; do python -m zipfile -e hadoop.

zip working/ done $ time bash run_2 The above completed in 56.

84 seconds.

Next Ill use Archiver version 3.

2.

0 which was built by Matt Holt in Golang.

Golang is interesting in that it was co-designed by Ken Thompson, the inventor of UNIX and co-inventor of UTF-8, at Google where hes employed as a Distinguished Engineer.

Go can take high-level code and compile it into binaries that can, among other things, take advantage of CPU vectorisation.

$ wget -c -O arc https://github.

com/mholt/archiver/releases/download/v3.

2.

0/arc_linux_amd64 $ chmod +x arc $ vi run_3 for i in {1.

25}; do .

/arc -overwrite unarchive hadoop.

zip done $ time bash run_3 The above completed in 100.

975 seconds.

Next Ill benchmark version 9.

20.

1 of the Debian-patched 7-Zip utility.

$ vi run_4 for i in {1.

25}; do 7z x -bd -y -oworking/ hadoop.

zip 1>/dev/null done $ time bash run_4 The above completed in 72.

2 seconds.

Version 9.

20.

1 of 7-Zip above was released in 2010.

Ill fetch a more recent Debian release of 7-Zip.

The following will install version 16.

02 which was released in mid-2016.

$ wget -c http://http.

us.

debian.

org/debian/pool/main/p/p7zip/p7zip_16.

02+dfsg-7_amd64.

deb $ sudo dpkg -i p7zip_16.

02+dfsg-7_amd64.

deb Ill then run the 7-Zip benchmark again.

$ time bash run_4 The above completed in 58.

05 seconds.

Next Ill compile zlib "next generation".

This zlib fork is also being worked on by Mark Adler and has several performance-related patches produced by firms such as Cloudflare and Intel.

The past two years has seen over 70 commits from Sebastian Pop whom has a long CV with roles at firms such as AWS, Samsung R&D, Qualcomm, AMD, IBM as well as the FSF and the LLVM project.

The following was run using GCC version 5.

4.

0.

$ git clone https://github.

com/zlib-ng/zlib-ng.

git ~/zlib-ng $ cd ~/zlib-ng $ mkdir -p out $ CFLAGS="-O4 -m64 -DZLIB_COMPAT -DWITH_GZFILEOP -DHAVE_HIDDEN -DHAVE_BUILTIN_CTZL -DMEDIUM_STRATEGY -DX86_64 -DX86_NOCHECK_SSE2 -DUNALIGNED_OK -DUNROLL_LESS -DX86_CPUID -DX86_SSE2_FILL_WINDOW -DX86_SSE4_2_CRC_HASH -DX86_PCLMULQDQ_CRC -DX86_QUICK_STRATEGY" .

/configure –prefix=.

/out –zlib-compat $ make $ make install Python is dynamically linked to zlib so I can use the above as a drop-in replacement for its unzip functionality.

$ objdump –dynamic-syms /usr/bin/python2.

7 | grep ZLIB 0000000000000000 DF *UND* 0000000000000000 ZLIB_1.

2.

0 inflateCopy $ cd ~ $ vi run_5 for i in {1.

25}; do LD_PRELOAD=/home/mark/zlib-ng/out/lib/libz.

so.

1.

2.

11.

zlib-ng /usr/bin/python2.

7 -m zipfile -e hadoop.

zip working/ done $ time bash run_5 The above completed in 48.

9 seconds.

Finally, Ill compile my own version of Info-ZIPs unzip utility and see if I can find any performance improvements over the Debian-patched release.

$ git clone https://github.

com/LuaDist/unzip.

git $ cd unzip $ mkdir -p out/bin $ export CFLAGS="-O4 -m64 -DZLIB_COMPAT -DWITH_GZFILEOP -DHAVE_HIDDEN -DHAVE_BUILTIN_CTZL -DMEDIUM_STRATEGY -DX86_64 -DX86_NOCHECK_SSE2 -DUNALIGNED_OK -DUNROLL_LESS -DX86_CPUID -DX86_SSE2_FILL_WINDOW -DX86_SSE4_2_CRC_HASH -DX86_PCLMULQDQ_CRC -DX86_QUICK_STRATEGY" $ make -f unix/Makefile generic $ make prefix=.

/out -f unix/Makefile install The unzip binary Ive produced is about 90% smaller and wasnt built with bzip2s library.

$ ls -alht ~/unzip/out/bin/unzip /usr/bin/unzip -rwxr-xr-x 2 mark mark 175K Oct 27 05:01 /home/mark/unzip/out/bin/unzip -rwxr-xr-x 2 root root 159K Nov 20 2015 /usr/bin/unzip $ objdump –dynamic-syms ~/unzip/out/bin/unzip | grep -v GLIBC /home/mark/unzip/out/bin/unzip: file format elf64-x86-64 DYNAMIC SYMBOL TABLE: 0000000000000000 w D *UND* 0000000000000000 __gmon_start__ $ objdump –dynamic-syms /usr/bin/unzip | grep -v GLIBC /usr/bin/unzip: file format elf64-x86-64 DYNAMIC SYMBOL TABLE: 0000000000000000 w D *UND* 0000000000000000 _ITM_deregisterTMCloneTable 0000000000000000 DF *UND* 0000000000000000 BZ2_bzlibVersion 0000000000000000 DF *UND* 0000000000000000 BZ2_bzDecompressInit 0000000000000000 DF *UND* 0000000000000000 BZ2_bzDecompressEnd 0000000000000000 w D *UND* 0000000000000000 __gmon_start__ 0000000000000000 DF *UND* 0000000000000000 BZ2_bzDecompress 0000000000000000 w D *UND* 0000000000000000 _Jv_RegisterClasses 0000000000000000 w D *UND* 0000000000000000 _ITM_registerTMCloneTable 0000000000627304 g D .

data 0000000000000000 Base _edata 00000000007198b8 g D .

bss 0000000000000000 Base _end 0000000000627304 g D .

bss 0000000000000000 Base __bss_start 00000000004019c8 g DF .

init 0000000000000000 Base _init 000000000041acd4 g DF .

fini 0000000000000000 Base _fini $ cd ~ $ vi run_6 for i in {1.

25}; do /home/mark/unzip/out/bin/unzip -qq -o hadoop.

zip -d ~/working/ done $ time bash run_6 The above completed in 52.

9 seconds.

Below is a summary of the benchmark times.

The fastest time is in seconds while all other times are a multiple of how much slower they were.

Seconds Method 48.

9s CPython 2.

7.

12 & zlib-ng 1.

2.

11 1.

08x Info-ZIPs unzip 6.

0 (self-compiled w/o zlib) 1.

12x Info-ZIPs unzip 6.

0 (Debian distribution w/o zlib) 1.

16x CPython & zlib 1.

2.

0 1.

19x 7-Zip 16.

02 1.

48x 7-Zip 9.

20.

1 2.

04x Archiver 3.

2.

0 Vectorisation Advantage.Imagine having a toolbox at one side of a room and a workbench at the other end.

Instead of taking tools out of the toolbox one at a time and bringing them over to the workbench you instead brought 4 or 8 tools at a time.

Thats Vectorisation.

Intel have been including op codes in their CPUs for vectorisation since the mid-1990s.

In 2011, Intel and AMD began shipping CPUs that supported AVX.

AVX2 chips followed in 2012 and AVX-512 chips from Intel followed in 2016.

The performance improvements require that vectorised op codes are used in any code executed.

The following is Assembly code written by Denis Bakhvalov in a scalar (non-vectorised) fashion.

mov edx, dword ptr [r9 + 4*rsi] # loading rhs[i] add edx, 1 # rhs[i] + 1 shr edx # (rhs[i] + 1) >> 1 mov dword ptr [rax + 4*rsi], edx # store result to lhs mov esi, edi add edi, 1 # incrementing i by 1 cmp rcx, rsi ja <next iteration> The following performs the same workload as above but in a vectorised fashion.

Note how 256 bits / 8 bytes of data are being handled in each iteration.

vmovdqu ymm1, ymmword ptr [r9 + 4*rdi] # loading 256 bits from rhs vpsubd ymm1, ymm1, ymm0 # ymm0 has all bits set, like +1 vpsrld ymm1, ymm1, 1 # vector shift right.

vmovdqu ymmword ptr [rax + 4*rdi], ymm1 # storing result add rdi, 8 # incrementing i by 8 cmp rsi, rdi jne <next iteration> Ive built the following regular expression pattern to help match any AVX or AVX2 instructions being used.

I printed the pattern out across multiple lines for easier desktop skimming.

$ vi avx_patterns.

tmp vaddpd|vaddps|vaddsubpd|vaddsubps|vandnpd|vandnps|vandpd| vandps|vblendpd|vblendps|vblendvpd|vblendvps|vbroadcastf128| vbroadcasti128|vbroadcastsd|vbroadcastss|vcmppd|vcmpps|vcmpsd| vcmpss|vcvtdq2pd|vcvtdq2ps|vcvtpd2dq|vcvtpd2ps|vcvtps2dq| vcvtps2pd|vcvttpd2dq|vcvttps2dq|vdivpd|vdivps|vdpps| vextractf128|vextracti128|vgatherdpd|vgatherdps|vgatherqpd| vgatherqps|vhaddpd|vhaddps|vhsubpd|vhsubps|vinsertf128| vinserti128|vlddqu|vmaskmovpd|vmaskmovps|vmaxpd|vmaxps|vminpd| vminps|vmovapd|vmovaps|vmovddup|vmovddup|vmovdqa|vmovdqu| vmovmskpd|vmovmskps|vmovntdq|vmovntdqa|vmovntpd|vmovntps| vmovshdup|vmovsldup|vmovupd|vmovups|vmpsadbw|vmulpd|vmulps| vorpd|vorps|vpabsb|vpabsd|vpabsw|vpackssdw|vpacksswb| vpackusdw|vpackuswb|vpaddb|vpaddd|vpaddq|vpaddsb|vpaddsw| vpaddusb|vpaddusw|vpaddw|vpalignr|vpand|vpandn|vpavgb| vpavgw|vpblendd|vpblendvb|vpblendw|vpbroadcastb|vpbroadcastd| vpbroadcastq|vpbroadcastw|vpcmpeqb|vpcmpeqd|vpcmpeqq|vpcmpeqw| vpcmpgtb|vpcmpgtd|vpcmpgtq|vpcmpgtw|vperm2f128|vperm2i128| vpermd|vpermilpd|vpermilps|vpermpd|vpermps|vpermq|vpgatherdd| vpgatherdq|vpgatherqd|vpgatherqq|vphaddd|vphaddsw|vphaddw| vphsubd|vphsubsw|vphsubw|vpmaddubsw|vpmaddwd|vpmaskmovd| vpmaskmovq|vpmaxsb|vpmaxsd|vpmaxsw|vpmaxub|vpmaxud|vpmaxuw| vpminsb|vpminsd|vpminsw|vpminub|vpminud|vpminuw|vpmovmskb| vpmovsxbd|vpmovsxbq|vpmovsxbw|vpmovsxdq|vpmovsxwd|vpmovsxwq| vpmovzxbd|vpmovzxbq|vpmovzxbw|vpmovzxdq|vpmovzxwd|vpmovzxwq| vpmuldq|vpmulhrsw|vpmulhuw|vpmulhw|vpmulld|vpmullw|vpmuludq| vpor|vpsadbw|vpshufb|vpshufd|vpshufhw|vpshuflw|vpsignb|vpsignd| vpsignw|vpslld|vpslldq|vpsllq|vpsllvd|vpsllvq|vpsllw|vpsrad| vpsravd|vpsraw|vpsrld|vpsrldq|vpsrlq|vpsrlvd|vpsrlvq|vpsrlw| vpsubb|vpsubd|vpsubq|vpsubsb|vpsubsw|vpsubusb|vpsubusw|vpsubw| vptest|vpunpckhbw|vpunpckhdq|vpunpckhqdq|vpunpckhwd|vpunpcklbw| vpunpckldq|vpunpcklqdq|vpunpcklwd|vpxor|vpxor|vrcpps|vroundpd| vroundps|vrsqrtps|vshufpd|vshufps|vsqrtpd|vsqrtps|vsubpd| vsubps|vtestpd|vtestps|vunpckhpd|vunpckhps|vunpcklpd|vunpcklps| vxorpd|vxorps|vzeroall|vzeroupper Ill remove the new lines for grep compatibility.

$ cat avx_patterns.

tmp | tr -d '.' > avx_patterns Only the Golang Archiver utilities appears to be using any of AVX op codes.

$ objdump -d ~/arc | grep -cf avx_patterns # 3539 $ objdump -d /usr/bin/unzip | grep -cf avx_patterns # 0 $ objdump -d /usr/lib/p7zip/7z | grep -cf avx_patterns # 0 $ objdump -d /bin/gzip | grep -cf avx_patterns # 0 $ objdump -d /home/mark/unzip/out/bin/unzip | grep -cf avx_patterns # 0 $ objdump -d /usr/bin/python2.

7 | grep -cf avx_patterns # 0 $ objdump -d /home/mark/zlib-ng/out/lib/libz.

so.

1.

2.

11.

zlib-ng | grep -cf avx_patterns # 0 Ill then look for signs of SSE 4.

2 op codes in use in any of the binaries benchmarked in this post.

$ vi sse42_patterns crc32[qwlb]|pcmpestri|pcmpestrm|pcmpistri|pcmpistrm|pcmpgtq $ objdump -d ~/arc | grep -cf sse42_patterns # 24 $ objdump -d /usr/bin/unzip | grep -cf sse42_patterns # 0 $ objdump -d /usr/lib/p7zip/7z | grep -cf sse42_patterns # 0 $ objdump -d /bin/gzip | grep -cf sse42_patterns # 0 $ objdump -d /home/mark/unzip/out/bin/unzip | grep -cf sse42_patterns # 0 $ objdump -d /usr/bin/python2.

7 | grep -cf sse42_patterns # 0 $ objdump -d /home/mark/zlib-ng/out/lib/libz.

so.

1.

2.

11.

zlib-ng | grep -cf sse42_patterns # 5 Both Archiver and zlib-ng have instances of SSE 4.

2 op codes.

The above isnt proof that those op codes were executed in the code paths followed during the ZIP extraction process but an absence of them proves there is no support at all for these optimisations.

Its also worth noting that CRC32 checksums, which have a dedicated op code in SSE 4.

2, are used with GZIP compression in zlib but Alder-32 is used for DEFLATE compression.

The instances of CRC32-related op codes above prove existence in the binary but can be presumed to go unused in the code path the unzip process took.

Alder-32 was generally able to outperform CRC32 at the expense of accuracy but that changed when Intel released a CRC32-specific op code in the SSE 4.

2 instruction set additions a few years ago.

Intel published research two years ago where they worked to improve the performance of an Alder-32 implementation using AVX2.

Though not measured in the research the firm believes AVX512 could improve on their findings further.

There have been similar efforts on ARM chips in recent years as well.

Conclusion Despite Archivers support for AVX and SSE optimisations it achieved the slowest time among the tools benchmarked.

Im surprised by this and it will be a point of further research.

CPython was able to show healthy improvements when zlib 1.

2.

0 was swapped out for zlib-ng.

Prefixing the Python command is a pretty non-invasive way of taking advantage of these speed ups.

I have a habit of accepting whichever software version apt install delivers but in the case of 7-Zip there was a significant performance increase to be had simply by looking for a more recent .

deb file.

The fastest time seen in this benchmark represents 5.

46 MB/s of ZIP content being decompressed.

Given this is two orders-of-magnitude off of bottlenecking mechanical storage there is still a long way to go before compute optimisations are no longer a concern.

.

. More details

Leave a Reply