c++ - Near constant time rotate that does not violate the standards -


i'm having heck of time trying come constant time rotate not violate c/c++ standards.

the problem edge/corner cases, operations called out in algorithms , algorithms cannot changed. example, following crypto++ , executes test harness under gcc ubsan (i.e., g++ fsanitize=undefined):

$ ./cryptest.exe v | grep runtime misc.h:637:22: runtime error: shift exponent 32 large 32-bit type 'unsigned int' misc.h:643:22: runtime error: shift exponent 32 large 32-bit type 'unsigned int' misc.h:625:22: runtime error: shift exponent 32 large 32-bit type 'unsigned int' misc.h:637:22: runtime error: shift exponent 32 large 32-bit type 'unsigned int' misc.h:643:22: runtime error: shift exponent 32 large 32-bit type 'unsigned int' misc.h:637:22: runtime error: shift exponent 32 large 32-bit type 'unsigned int' 

and code @ misc.h:637:

template <class t> inline t rotlmod(t x, unsigned int y) {     y %= sizeof(t)*8;     return t((x<<y) | (x>>(sizeof(t)*8-y))); } 

intel's icc particularly ruthless, , removed entire function call out y %= sizeof(t)*8. fixed few years back, left other errata in-place due lack of constant time solution.

there's 1 pain point remaining. when y = 0, condition 32 - y = 32, , sets undefined behavior. if add check if(y == 0) ..., code fails meet constant time requirement.

i've looked @ number of other implementations, linux kernel other cryptographic libraries. contain same undefined behavior, appears dead end.

how can perform rotate in constant time minimum number of instructions?

edit: near constant time, mean avoid branch same instructions executed. i'm not worried cpu microcode timings. while branch prediction may great on x86/x64, may not perform on other platforms, embedded.


none of these tricks required if gcc or clang provided intrinsic perform rotate in near constant time. i'd settle "perform rotate" since don't have that.

i've linked answer full details several other "rotate" questions, including this community wiki question, should kept date best-practices.

i found blog post issue, , looks it's solved problem (with new-enough compiler versions).

john regehr @ university of utah recommends version "c" of attempts @ making rotate function. replaced assert bitwise and, , found still compiles single rotate insn.

typedef uint32_t rotwidth_t;  // parameterize comparing compiler output various sizes  rotwidth_t rotl (rotwidth_t x, unsigned int n) {   const unsigned int mask = (char_bit*sizeof(x)-1);  // e.g. 31    assert ( (n<=mask)  &&"rotate type width or more");   n &= mask;  // avoid undef behaviour ndebug.  0 overhead types / compilers   return (x<<n) | (x>>( (-n)&mask )); }  rotwidth_t rot_const(rotwidth_t x) {   return rotl(x, 7); } 

this templated on x's type, makes more sense real use, have width in function name (like rotl32). when you're rotating, know width want, , matters more size variable you're storing value in.

also make sure use unsigned types. right-shift of signed types arithmetic shift, shifting in sign-bits. (it's technically implementation-dependent behaviour, uses 2's complement now.)

pabigot independently came same idea before did, and posted @ gibhub. version has c++ static_assert checking make compile-time error use rotate count outside range type.

i tested mine gcc.godbolt.org, ndebug defined, variable , compile-time-const rotate counts:

  • gcc: optimal code gcc >= 4.9.0, non-branching neg+shifts+or earlier.
    (compile-time const count: gcc 4.4.7 fine)
  • clang: optimal code clang >= 3.5.0, non-branching neg+shifts+or earlier.
    (compile-time const rotate count: clang 3.0 fine)
  • icc 13: optimal code.
    (compile-time const count -march=native: generates slower shld $7, %edi, %edi. fine without -march=native)

even newer compiler versions can handle commonly-given code wikipedia (included in godbolt sample) without generating branch or cmov. john regehr's version has advantage of avoiding undefined behaviour when rotate count 0.

there caveats 8 , 16 bit rotates, compilers seem fine 32 or 64, when n uint32_t. see comments in code on godbolt link notes testing various widths of uint*_t. idiom better-recognized compilers more combinations of type widths in future. gcc uselessly emit and insn on rotate count, though x86 isa defines rotate insns exact , first step.

"optimal" means efficient as:

# gcc 4.9.2 rotl(unsigned int, unsigned int):     movl    %edi, %eax     movl    %esi, %ecx     roll    %cl, %eax     ret # rot_const(unsigned int):     movl    %edi, %eax     roll    $7, %eax     ret 

when inlined, compiler should able arrange values in right registers in first place, resulting in single rotate.

with older compilers, you'll still ideal code when rotate count compile-time constant. godbolt lets test arm target, , used rotate there, too. variable counts on older compilers, bit of code bloat, no branches or major performance problems, idiom should safe use in general.

btw, modified john regehr's original use char_bit*sizeof(x), , gcc / clang / icc emit optimal code uint64_t well. however, did notice changing x uint64_t while function return type still uint32_t makes gcc compile shifts/or. careful cast result 32bits in separate sequence point, if want low 32b of 64b rotate. i.e. assign result 64bit variable, cast/return it. icc still generates rotate insn, gcc , clang don't, for

// generates slow code: cast separately. uint32_t r = (uint32_t)( (x<<n) | (x>>( -n&(char_bit*sizeof(x)-1) )) ); 

if can test msvc, useful know happens there.


Comments

Popular posts from this blog

searchKeyword not working in AngularJS filter -

sequelize.js - Sequelize: sort by enum cases -

user interface - how to replace an ongoing process of image capture from another process call over the same ImageLabel in python's GUI TKinter -