Howdy, Stranger!

It looks like you're new here. If you want to get involved, click one of these buttons!


Force kernel AES-NI usage on a VPS without the aes CPU flag
New on LowEndTalk? Please Register and read our Community Rules.

All new Registrations are manually reviewed and approved, so a short delay after registration may occur before your account becomes active.

Force kernel AES-NI usage on a VPS without the aes CPU flag

First of all, thanks to @rm_ for his brilliant blog post on forcing OpenSSL to use the AES-NI instruction set when the CPU of a VPS does not report its existence while it is actually supported. This is a counterpart that forces the Linux kernel to use AES-NI when QEMU does not pass through that flag, which is useful for IPSec, disk encryption, etc.

It turns out to be fairly simple with a kernel module. Just shove these two lines into any hello world boilerplate that you can find in a "how to write Linux kernel modules" tutorial.

#include <linux/bitops.h>
set_bit(153, (unsigned long *)(boot_cpu_data.x86_capability));

The magic number 153 is taken from arch/x86/include/asm/cpufeatures.h. It is trivial to enforce the usage of another CPU feature (e.g., AVX) with another magic number.

After inserting your own module, manually modprobe aesni_intel should do the trick.

On one of my KVM servers, the result of cryptsetup benchmark increased from

#     Algorithm | Key |  Encryption |  Decryption
        aes-cbc   128b   169.8 MiB/s   167.3 MiB/s

... to ...

#     Algorithm | Key |  Encryption |  Decryption
        aes-cbc   128b   678.2 MiB/s  2201.4 MiB/s
«1

Comments

  • MasonRMasonR Community Contributor

    Just curious -- would your VPS get nuked if you did this on a node that doesn't have AES-NI support?

  • rm_rm_ IPv6 Advocate, Veteran

    @psb777 Cheers! Added a link to this post to my original one.

    MasonR said: would your VPS get nuked if you did this on a node that doesn't have AES-NI support?

    Your VPS kernel would just crash with an "Invalid opcode" exception.

    Thanked by 1MasonR
  • @rm_ said:
    @psb777 Cheers! Added a link to this post to my original one.

    MasonR said: would your VPS get nuked if you did this on a node that doesn't have AES-NI support?

    Your VPS kernel would just crash with an "Invalid opcode" exception.

    ... and then you’d have to go into the Rescue Disk of Shame.

    Thanked by 3pike brueggus netomx
  • The fact that the vCPU passthrough exposes these at the KVM level makes me a bit unhappy about the state of KVM. I had an f00f flashback for a moment- there has to be some magical featureset you can hit to crash the upstream kvm kernel module. Someone will find it soon enough.

    Thanked by 2vimalware netomx
  • pikepike Veteran
    edited December 2017

    @doghouch said:
    ... and then you’d have to go into the Rescue Disk of Shame.

    I entered rescue disk of shame into google image search, see what I found

  • mkshmksh Member
    edited December 2017

    @MasonR said:
    Just curious -- would your VPS get nuked if you did this on a node that doesn't have AES-NI support?

    You can expect anything with optinal codepaths depending on this flag to crash for use of illegal instructions and if said stuff is in the kernel it might spell reboot time (not that a constantly crashing ssh daemon would'nt cause the same).

  • psb777 said: It is trivial to enforce the usage of another CPU feature (e.g., AVX) with another magic number.

    VMX too? @WSS going to try and make a vps a rootserver by adding AES and VMX? would be fun to see if this would work...

    Thanked by 1netomx
  • @Falzo You're the German- go for it.

  • @pike said:

    @doghouch said:
    ... and then you’d have to go into the Rescue Disk of Shame.

    I entered rescue disk of shame into google image search, see what I found

    Sums it up pretty well :I

  • WSSWSS Member
    edited December 2017

    I'm a little curious what you get with your custom module doing this after your assertion:

    unsigned eax, ebx, ecx, edx;
    ...
    eax = 1;
    native_cpuid(&eax, &ebx, &ecx, &edx);
    printk(KERN_INFO, "AES-NI post-insert response: %u\n", test_bit(153, &ecx));
    

    I'd play around a bit, myself, but I'm too damn lazy to crash my own shit. :D

  • WSS said: I'm a little curious what you get with your custom module doing this after your assertion

    It won't change the output of the cpuid instruction, and it won't modify /proc/cpuinfo either...

    By the way, as OpenSSL directly calls the cpuid instruction to check the availability of AES-NI, you still need @rm_'s OpenSSL trick to force your userspace program to use it.

  • @psb777 said:
    It won't change the output of the cpuid instruction, and it won't modify /proc/cpuinfo either...

    So, it just sets it as an active bit which may or may not actually do anything for most software, then?

    By the way, as OpenSSL directly calls the cpuid instruction to check the availability of AES-NI, you still need @rm_'s OpenSSL trick to force your userspace program to use it.

    So, anything in userspace still has to be forced to run code that may, or may not execute properly, and there's no change actually set in the subkernel, other than it works? So, how does flipping the bit do anything at all, other than setup a path for at least this instance to follow any code which may work with the AES-NI subset- and how the hell does it do that when you can't test for it?

    I get the override for OpenSSL, I'm just wondering how/where this might actually be useful in common utilization, like speeding up ffmpeg, et al..

  • WSS said: So, it just sets it as an active bit which may or may not actually do anything for most software, then?

    Right, it won't do anything for most software. It just enables the kernel to load the aesni_intel module, which provides accelerated AES functions that the kernel alone uses. This is possibly only useful for IPSec and dm-crypt where en-/decryption is done in the kernel space.

    Thanked by 2WSS Falzo
  • @psb777 said:

    WSS said: So, it just sets it as an active bit which may or may not actually do anything for most software, then?

    Right, it won't do anything for most software. It just enables the kernel to load the aesni_intel module, which provides accelerated AES functions that the kernel alone uses. This is possibly only useful for IPSec and dm-crypt where en-/decryption is done in the kernel space.

    That makes sense. I'm assuming that there's something in the modulespace that sets that bit for only modules, then, so my snippit above would likely assert true where anywhere outside of the module-level ring, it'd just be ignored as you suggested.

    Interesting find. You're more bored than I am! :D

  • WSS said: eax = 1; native_cpuid(&eax, &ebx, &ecx, &edx);

    There is an interesting set of patch on lkml about emulating CPUID through a flag in MSR. Although these patches did add support for the ENABLES_CPUID_FAULT bit in KVM-emulated MSR, they were merged into mainline only since 4.12. Ideally with this patch we can make most userspace software automatically detects and uses AES-NI or other fancy features.

    Thanked by 1WSS
  • RhysRhys Member, Host Rep
    edited December 2017

    Well, this is good to know going to have a nose around at this for other CPU instructions as well. Maybe I can finally get some good performance out of some of my VPS' at hosts that refuse to passthrough.

  • any help for crypto mining?

  • RhysRhys Member, Host Rep

    @allnetstore said:
    any help for crypto mining?

    shoo.

  • psb777 said: It is trivial to enforce the usage of another CPU feature (e.g., AVX)

    Does that actually work though? AES-NI doesn't introduce new registers, so it should be safe as long as the underlying CPU supports it, but AVX does change register definitions.

    WSS said: like speeding up ffmpeg

    ffmpeg does have a -cpuflags flag which you could use.

  • xyz said: Does that actually work though? AES-NI doesn't introduce new registers, so it should be safe as long as the underlying CPU supports it, but AVX does change register definitions.

    Good point. I suppose the host machine can handle YMM registers or whatnot during context switches (regardless AVX support in its KVM's), but I'm not sure if the guest kernel can handle them. Probably not, as the xsave flag is also missing. But you control the guest kernel, so you can always do extra hacks to make it work.

    Theoretically if the KVM has only one CPU and no userspace program is using AVX (which may or may not be the case), I think it is safe to assume nobody is going to touch your YMM registers and thus one kernel module can hold exclusive usage of those exotic registers...

  • cryptsetup benchmark improved like 6x faster, but openvpn is still slow af (aes-128-cbc)
    why?

  • raindog308raindog308 Administrator, Veteran

    psb777 said: First of all, thanks to @rm_ for his brilliant blog post

    Didn't know @rm_ had a blog - thanks for that link.

    WSS said: I had an f00f flashback for a moment

    I don't think a virtualized kernel will protect you from something like that. IANAVE (I am not a virtualization engineer) but ultimately the CPU is going to execute an x86 opcode, no? So if there's a bug due to opcodes, you're going to see it regardless.

    KVM doesn't evaluate and wrap every opcode, does it? Things like bochs are "virtualized processors" while other more performant virtualization methods are more like "virtualized environments" but ultimately it's still opcodes on the metal.

  • @raindog308 said:
    KVM doesn't evaluate and wrap every opcode, does it? Things like bochs are "virtualized processors" while other more performant virtualization methods are more like "virtualized environments" but ultimately it's still opcodes on the metal.

    Not as such, which is what speeds up the emulation, when it's a passthrough, like using svm_amd/kvm_intel. That said I haven't audited (or understand) all of the code, so I'm sure not running in Ring 0 will do quite a bit to help keep this from being an issue, but the fact that you can arbitrarily set this in the CPU from the module level (I assumed there'd be more [e.g. preload for microcode]) makes me question just how difficult the next nasty crash issue might be- but you still need to load modules to get it to that sublayer- so far.

  • @lemon said:
    cryptsetup benchmark improved like 6x faster, but openvpn is still slow af (aes-128-cbc)
    why?

    This method only forces the Linux kernel to use AES-NI. It applies to VPNs that use the kernel IPSec stack, but not OpenVPN.

    OpenVPN does encryption in the userspace, and it uses OpenSSL by default. So you should check out this blog post to get a boost.

    Thanked by 1raindog308
  • lemonlemon Member
    edited December 2017

    @psb777 said:

    @lemon said:
    cryptsetup benchmark improved like 6x faster, but openvpn is still slow af (aes-128-cbc)
    why?

    This method only forces the Linux kernel to use AES-NI. It applies to VPNs that use the kernel IPSec stack, but not OpenVPN.

    OpenVPN does encryption in the userspace, and it uses OpenSSL by default. So you should check out this blog post to get a boost.

    yes i already read that blog entry and added
    export OPENSSL_ia32cap="+0x200000200000000"
    to /etc/init.d/openvpn and rebooted, but it didn't work.

    # openssl speed -elapsed -evp aes-128-cbc
    type             16 bytes     64 bytes    256 bytes   1024 bytes   8192 bytes
    aes-128-cbc      95350.26k   103357.74k   105868.37k   233262.08k   257701.21k
    
    # OPENSSL_ia32cap="+0x200000200000000" openssl speed -elapsed -evp aes-128-cbc
    type             16 bytes     64 bytes    256 bytes   1024 bytes   8192 bytes
    aes-128-cbc     582909.68k   638098.05k   654170.54k   658908.84k   660277.93k
    
  • psb777psb777 Member
    edited December 2017

    lemon said: but it didn't work

    I have just ran a series of tests with openvpn and iperf

    [ ID] Interval       Transfer     Bandwidth
     (through loopback interface)
    [  4]  0.0-10.0 sec  42.6 GBytes  36.6 Gbits/sec
    [  4]  0.0-10.0 sec  41.9 GBytes  35.9 Gbits/sec
     (openvpn without encryption)
    [  4]  0.0-10.0 sec   702 MBytes   588 Mbits/sec
    [  4]  0.0-10.0 sec   703 MBytes   588 Mbits/sec
     (openvpn with aes-128-cbc + sha1, without the AES-NI hack)
    [  4]  0.0-10.1 sec   244 MBytes   203 Mbits/sec
    [  4]  0.0-10.1 sec   249 MBytes   208 Mbits/sec
     (openvpn with aes-128-cbc + sha1, with OPENSSL_ia32cap environ)
    [  4]  0.0-10.0 sec   315 MBytes   263 Mbits/sec
    [  4]  0.0-10.0 sec   314 MBytes   262 Mbits/sec
    

    See, there's a performance boost, despite being a puny one.

    Edit: here's the result for IPSec

     (veth without xfrm)
    [  4]  0.0-10.0 sec  39.9 GBytes  34.3 Gbits/sec
    [  4]  0.0-10.0 sec  39.6 GBytes  34.0 Gbits/sec
     (xfrm with aes-128-cbc + sha1, without aes-ni)
    [  4]  0.0-10.0 sec   467 MBytes   391 Mbits/sec
    [  4]  0.0-10.0 sec   472 MBytes   396 Mbits/sec
     (xfrm with aes-128-cbc + sha1, with the aesni-intel module)
    [  4]  0.0-10.0 sec   752 MBytes   630 Mbits/sec
    [  4]  0.0-10.0 sec   751 MBytes   629 Mbits/sec
    
  • @psb777 said:

    lemon said: but it didn't work

    I have just ran a series of tests with openvpn and iperf

     (openvpn with aes-128-cbc + sha1, without the AES-NI hack)
    [  4]  0.0-10.1 sec   244 MBytes   203 Mbits/sec
    [  4]  0.0-10.1 sec   249 MBytes   208 Mbits/sec
     (openvpn with aes-128-cbc + sha1, with OPENSSL_ia32cap environ)
    [  4]  0.0-10.0 sec   315 MBytes   263 Mbits/sec
    [  4]  0.0-10.0 sec   314 MBytes   262 Mbits/sec
    

    See, there's a performance boost, despite being a puny one.

    I'd say this is more a measuring inaccuracy than a real boost.

  • It turns out I'm not too sharp at adapting the linked tutorial...

    /usr/lib/aesni_intel/aesni_intel.c:2:9: error: expected declaration specifiers or ‘...’ before numeric constant
     set_bit(153, (unsigned long *)(boot_cpu_data.x86_capability));
             ^
    /usr/lib/aesni_intel/aesni_intel.c:2:14: error: expected declaration specifiers or ‘...’ before ‘(’ token
     set_bit(153, (unsigned long *)(boot_cpu_data.x86_capability));
    

    I'd appreciate a pointer and once I've got it working I'll write it up as a full walk through that even someone like myself could follow :)

  • @adamluk said:
    I'd appreciate a pointer and once I've got it working I'll write it up as a full walk through that even someone like myself could follow :)

    There's a reason why it wasn't.

  • I knew you'd be along shortly @WSS but don't be so discouraging :P

Sign In or Register to comment.