New on LowEndTalk? Please Register and read our Community Rules.
All new Registrations are manually reviewed and approved, so a short delay after registration may occur before your account becomes active.
All new Registrations are manually reviewed and approved, so a short delay after registration may occur before your account becomes active.
Upgrade your OpenSSL on Scaleway ARM: 5x performance gain
In case anyone is using those machines, here's what I just found.
Debian Jessie comes with OpenSSL 1.0.1t:
# openssl speed -evp aes-256-cbc Doing aes-256-cbc for 3s on 16 size blocks: 9437607 aes-256-cbc's in 3.00s Doing aes-256-cbc for 3s on 64 size blocks: 2704353 aes-256-cbc's in 3.00s Doing aes-256-cbc for 3s on 256 size blocks: 713491 aes-256-cbc's in 2.99s Doing aes-256-cbc for 3s on 1024 size blocks: 181168 aes-256-cbc's in 3.00s Doing aes-256-cbc for 3s on 8192 size blocks: 22743 aes-256-cbc's in 3.00s OpenSSL 1.0.1t 3 May 2016 built on: Fri Jan 27 00:08:40 2017 options:bn(64,64) rc4(ptr,char) des(idx,cisc,16,int) aes(partial) blowfish(ptr) compiler: gcc -I. -I.. -I../include -fPIC -DOPENSSL_PIC -DOPENSSL_THREADS -D_REENTRANT -DDSO_DLFCN -DHAVE_DLFCN_H -DL_ENDIAN -DTERMIO -g -O2 -fstack-protector-strong -Wformat -Werror=format-security -D_FORTIFY_SOURCE=2 -Wl,-z,relro -Wa,--noexecstack -Wall The 'numbers' are in 1000s of bytes per second processed. type 16 bytes 64 bytes 256 bytes 1024 bytes 8192 bytes aes-256-cbc 50333.90k 57692.86k 61088.19k 61838.68k 62103.55k
After upgrading that to Debian Stretch's version 1.1.0f:
# openssl speed -evp aes-256-cbc Doing aes-256-cbc for 3s on 16 size blocks: 18554406 aes-256-cbc's in 3.00s Doing aes-256-cbc for 3s on 64 size blocks: 9779929 aes-256-cbc's in 3.00s Doing aes-256-cbc for 3s on 256 size blocks: 3375690 aes-256-cbc's in 3.00s Doing aes-256-cbc for 3s on 1024 size blocks: 938046 aes-256-cbc's in 3.00s Doing aes-256-cbc for 3s on 8192 size blocks: 121052 aes-256-cbc's in 3.00s Doing aes-256-cbc for 3s on 16384 size blocks: 60624 aes-256-cbc's in 3.00s OpenSSL 1.1.0f 25 May 2017 built on: reproducible build, date unspecified options:bn(64,64) rc4(char) des(int) aes(partial) blowfish(ptr) compiler: gcc -DDSO_DLFCN -DHAVE_DLFCN_H -DNDEBUG -DOPENSSL_THREADS -DOPENSSL_NO_STATIC_ENGINE -DOPENSSL_PIC -DOPENSSL_BN_ASM_MONT -DSHA1_ASM -DSHA256_ASM -DSHA512_ASM -DVPAES_ASM -DECP_NISTZ256_ASM -DPOLY1305_ASM -DOPENSSLDIR="\"/usr/lib/ssl\"" -DENGINESDIR="\"/usr/lib/aarch64-linux-gnu/engines-1.1\"" The 'numbers' are in 1000s of bytes per second processed. type 16 bytes 64 bytes 256 bytes 1024 bytes 8192 bytes 16384 bytes aes-256-cbc 98956.83k 208638.49k 288058.88k 320186.37k 330552.66k 331087.87k
It seems the Cavium ThunderX contain hardware acceleration for AES (similar to AES-NI on x86):
processor : 0 BogoMIPS : 200.00 Features : fp asimd evtstrm aes pmull sha1 sha2 crc32 atomics cpuid CPU implementer : 0x43 CPU architecture: 8 CPU variant : 0x1 CPU part : 0x0a1 CPU revision : 1 processor : 1 BogoMIPS : 200.00 Features : fp asimd evtstrm aes pmull sha1 sha2 crc32 atomics cpuid CPU implementer : 0x43 CPU architecture: 8 CPU variant : 0x1 CPU part : 0x0a1 CPU revision : 1 processor : 2 BogoMIPS : 200.00 Features : fp asimd evtstrm aes pmull sha1 sha2 crc32 atomics cpuid CPU implementer : 0x43 CPU architecture: 8 CPU variant : 0x1 CPU part : 0x0a1 CPU revision : 1 processor : 3 BogoMIPS : 200.00 Features : fp asimd evtstrm aes pmull sha1 sha2 crc32 atomics cpuid CPU implementer : 0x43 CPU architecture: 8 CPU variant : 0x1 CPU part : 0x0a1 CPU revision : 1
But it's not supported yet in OpenSSL 1.0. With that enabled however, these should make for decent VPN machines, or better handle HTTPS load.
Comments
You’re still using Scaleway?
Well I never used it in any serious capacity (i.e. other than for Tor), and with all the Kimsufis that I have, don't need one at the moment either.
Still it's a solid offer, even more so with this OpenSSL improvement. Now that OVH's VPS SSD had a price hike, this one might remain to be the best bang for the buck KVM out there (whether the x86 or ARM variant). If I decide to cancel my KS dedis, I will most likely use a couple of these either for hosting stuff directly, or as reverse proxies.
You still shilling for Vultr?
He gives up a KS4C for a Scaleway.
lol where was this?
Which instruction set? Does this also apply to SBC ARMs like the Raspberry PI and Friendly ARM NanoPI?
https://en.wikipedia.org/wiki/AES_instruction_set
Oh it's the actual aes instruction set not an arm one. I need glasses.
Looks like the NanoPI's H3 also supports it, but not the Raspberry PI.
Why am I now envisioning VPN-in-a-cigarette box?
Yes. BGP FTW.
What's the clock speed of those Cavium cores? Obviously, 200 BogoMIPS isn't correct. I'd have thought they'd be faster than a 2.2 GHz Cortex-A15, which is quite old, so either the clock speed is a bit slower or the Caviums aren't that impressive.
You can see some benchmarks here: https://wiki.neoon.pw/doku.php?id=dedicated_benchmarks
Also IIRC they got 212 MB/sec in my MD5 CPU benchmark, which is more than e.g. some 1.7 GHz Xeon VPS I had at another provider.
The ThunderX cores are relatively weak as the focus is on having a lot of them. The Cortex-A15, on the other hand, is designed as a performance core, so I wouldn't be surprised if the ThunderX is slightly less powerful than an A15 core.
Performance-wise a weirdness that I noticed is this that it gets
mbw
results like this:Basically memory access is "not very fast", except if you use the specialized "copy block" function. A modern x86 system does not have such distinction, it will show 3-5 GB/sec in this test across the board, no matter which access method.
What version of mbw are you using? I found this in a search.
If this code is accurate, I can't see why MCBLOCK would be any different to a memcpy. If it was covering the same block, then you're really just comparing cache speed vs memory speed (and cache is much faster); if the pointers move across the data, then you're essentially doing exactly the same thing as a single
memcpy
and can't see it being any faster (it doesn't use any "specialized function").And in fact, I wouldn't be surprised if the "DUMB" method just gets rewritten to a
memcpy
by an optimizing compiler.I used version 1.2.2 from Debian Jessie.
From the code:
Yeah the code's broken. Essentially the MCBLOCK method is just testing memory write bandwidth, since the
a
is likely served from cache, whilst MEMCPY tests read+write (copy) bandwidth.