{"id":642512,"date":"2023-04-28T10:06:10","date_gmt":"2023-04-28T15:06:10","guid":{"rendered":"https:\/\/news.sellorbuyhomefast.com\/index.php\/2023\/04\/28\/beautiful-branchless-binary-search\/"},"modified":"2023-04-28T10:06:10","modified_gmt":"2023-04-28T15:06:10","slug":"beautiful-branchless-binary-search","status":"publish","type":"post","link":"https:\/\/newsycanuse.com\/index.php\/2023\/04\/28\/beautiful-branchless-binary-search\/","title":{"rendered":"Beautiful branchless binary search"},"content":{"rendered":"<div>\n<p>I read a blog post by Alex Muscar, \u201c<a href=\"https:\/\/muscar.eu\/shar-binary-search-meta.html\">Beautiful Binary Search in D<\/a>\u201c. It describes a binary search called \u201cShar\u2019s algorithm\u201d. I\u2019d never heard of it and it\u2019s impossible to google, but looking at the algorithm I couldn\u2019t help but think \u201cthis is branchless.\u201d And who knew that there could be a branchless binary search? So I did the work to translate it into a algorithm for C++ iterators, no longer requiring one-based indexing or fixed-size arrays.<\/p>\n<p>In GCC it is more than twice as fast as std::lower_bound, which is already a very high quality binary search. The search loop is simple and the generated assembly is beautiful. I\u2019m astonished that this exists and nobody seems to be using it\u2026<\/p>\n<p>Lets start with the <a href=\"https:\/\/github.com\/skarupke\/branchless_binary_search\">code<\/a>:<\/p>\n<div>\n<pre title>\ntemplate<typename It, typename T, typename Cmp>\nIt branchless_lower_bound(It begin, It end, const T & value, Cmp && compare)\n{\n    size_t length = end - begin;\n    if (length == 0)\n        return end;\n    size_t step = bit_floor(length);\n    if (step != length && compare(begin[step], value))\n    {\n        length -= step + 1;\n        if (length == 0)\n            return end;\n        step = bit_ceil(length);\n        begin = end - step;\n    }\n    for (step \/= 2; step != 0; step \/= 2)\n    {\n        if (compare(begin[step], value))\n            begin += step;\n    }\n    return begin + compare(*begin, value);\n}\ntemplate<typename It, typename T>\nIt branchless_lower_bound(It begin, It end, const T & value)\n{\n    return branchless_lower_bound(begin, end, value, std::less<>{});\n}\n<\/pre>\n<\/div>\n<p>I said the search loop is simple, but unfortunately the setup in lines 4 to 15 is not. Lets skip it for now. Most of the work happens in the loop in lines 16 to 20.<\/p>\n<p>The loop may not look branchless because I clearly have a loop conditional and an if-statement in the loop body. Let me defend both of these:<\/p>\n<ul>\n<li>The if-statement will be compiled to a CMOV (conditional move) instruction, meaning there is no branch. At least GCC does this. I could not get Clang to make this one branchless, no matter how clever I tried to be. So I decided to not be clever, since that works for GCC. I wish C++ just allowed me to use CMOV directly\u2026<\/li>\n<li>The loop condition is a branch, but it only depends on the length of the array. So it can be predicted very well and we don\u2019t have to worry about it. The linked blog post fully unrolls the loop, which makes this branch go away, but in my benchmarks unrolling was actually slower because the function body became too big to be inlined. So I kept it as is.<\/li>\n<\/ul>\n<h2>Algorithm<\/h2>\n<p>So now that I\u2019ve explained that the title refers to the fact that one branch is gone and the other is nearly free and could be removed if we wanted to, how does this actually work?<\/p>\n<p>The important variable is the \u201cstep\u201d variable, line 7. We\u2019re going to jump in powers of two. If the array is 64 elements long, it will have the values 64, 32, 16, 8, 4, 2, 1. It gets initialized to the nearest smaller power-of-two of the input length. So if the input is 22 elements long, this will be 16. My compiler doesn\u2019t have the new std::bit_floor function, so I wrote my own to round down to the nearest power of two. This should just be replaced with a call to <a href=\"https:\/\/en.cppreference.com\/w\/cpp\/numeric\/bit_floor\">std::bit_floor<\/a> once C++20 is more widely supported.<\/p>\n<p>We\u2019re always going to do steps that are power-of-two sized, but that\u2019s going to be a problem if the input length is not a power of two. So in lines 8 to 15 we check if the middle is less than the search value. If it is, we\u2019re going to search the last elements. Or to make it concrete: If the input is length 22, and that boolean is false, we\u2019ll search the first 16 elements, from index 0 to 15. If that conditional is true, we\u2019ll search the last 8 elements, from index 14 to 21.<\/p>\n<pre><code>input          0 1 2 3 4 5 6 7 8 9 10 11 12 13 14 15 16 17 18 19 20 21\nline 8 compare                                       16\nwhen false     0 1 2 3 4 5 6 7 8 9 10 11 12 13 14 15\nwhen true                                      14 15 16 17 18 19 20 21<\/code><\/pre>\n<p>Yes, that means that indices 14, 15 and 16 get included in the second half even though we already ruled them out with the comparison in line 8, but that\u2019s the price we pay for having a nice loop. We have to round up to a power of two.<\/p>\n<h2>Performance<\/h2>\n<p>How does it perform? It\u2019s incredibly fast in GCC:<\/p>\n<figure><a href=\"https:\/\/probablydance.files.wordpress.com\/2023\/04\/gcc.png\"><img decoding=\"async\" data-attachment-id=\"11421\" data-permalink=\"https:\/\/probablydance.com\/2023\/04\/27\/beautiful-branchless-binary-search\/gcc\/\" data-orig-file=\"https:\/\/probablydance.files.wordpress.com\/2023\/04\/gcc.png\" data-orig-size=\"993,503\" data-comments-opened=\"1\" data-image-meta=\"{\"aperture\":\"0\",\"credit\":\"\",\"camera\":\"\",\"caption\":\"\",\"created_timestamp\":\"0\",\"copyright\":\"\",\"focal_length\":\"0\",\"iso\":\"0\",\"shutter_speed\":\"0\",\"title\":\"\",\"orientation\":\"0\"}\" data-image-title=\"gcc\" data-image-description data-image-caption data-medium-file=\"https:\/\/probablydance.files.wordpress.com\/2023\/04\/gcc.png?w=300\" data-large-file=\"https:\/\/probablydance.files.wordpress.com\/2023\/04\/gcc.png?w=650\" src=\"https:\/\/probablydance.files.wordpress.com\/2023\/04\/gcc.png?w=650\" alt  ><\/a><\/figure>\n<p>Somewhere around 16k elements in the array, it\u2019s actually 3x as fast as std::lower_bound. Later, cache effects start to dominate so the reduced branch misses matter less.<\/p>\n<p>Those spikes for std::lower_bound are on powers of two, where it is somehow much slower. I looked into it a little bit but can\u2019t come up with an easy explanation. The Clang version has the same spikes even though it compiles to very different assembly.<\/p>\n<p>In fact in Clang branchless_lower_bound is slower than std::lower_bound because I couldn\u2019t get it to actually be branchless:<\/p>\n<figure><a href=\"https:\/\/probablydance.files.wordpress.com\/2023\/04\/clang-1.png\"><img decoding=\"async\" data-attachment-id=\"11425\" data-permalink=\"https:\/\/probablydance.com\/2023\/04\/27\/beautiful-branchless-binary-search\/clang-1\/\" data-orig-file=\"https:\/\/probablydance.files.wordpress.com\/2023\/04\/clang-1.png\" data-orig-size=\"943,530\" data-comments-opened=\"1\" data-image-meta=\"{\"aperture\":\"0\",\"credit\":\"\",\"camera\":\"\",\"caption\":\"\",\"created_timestamp\":\"0\",\"copyright\":\"\",\"focal_length\":\"0\",\"iso\":\"0\",\"shutter_speed\":\"0\",\"title\":\"\",\"orientation\":\"0\"}\" data-image-title=\"clang-1\" data-image-description data-image-caption data-medium-file=\"https:\/\/probablydance.files.wordpress.com\/2023\/04\/clang-1.png?w=300\" data-large-file=\"https:\/\/probablydance.files.wordpress.com\/2023\/04\/clang-1.png?w=650\" src=\"https:\/\/probablydance.files.wordpress.com\/2023\/04\/clang-1.png?w=650\" alt  ><\/a><\/figure>\n<p>The funny thing is that Clang compiles std::lower_bound to be branchless. So std::lower_bound is faster in Clang than in GCC, and my branchless_lower_bound is not branchless. Not only did the red line move up, the blue line also moved down.<\/p>\n<p>But that means if we compare the Clang version of std::lower_bound against the GCC version of branchless_lower_bound, we can compare two branchless algorithms. Lets do that:<\/p>\n<figure><a href=\"https:\/\/probablydance.files.wordpress.com\/2023\/04\/branchless.png\"><img decoding=\"async\" data-attachment-id=\"11426\" data-permalink=\"https:\/\/probablydance.com\/2023\/04\/27\/beautiful-branchless-binary-search\/branchless\/\" data-orig-file=\"https:\/\/probablydance.files.wordpress.com\/2023\/04\/branchless.png\" data-orig-size=\"943,530\" data-comments-opened=\"1\" data-image-meta=\"{\"aperture\":\"0\",\"credit\":\"\",\"camera\":\"\",\"caption\":\"\",\"created_timestamp\":\"0\",\"copyright\":\"\",\"focal_length\":\"0\",\"iso\":\"0\",\"shutter_speed\":\"0\",\"title\":\"\",\"orientation\":\"0\"}\" data-image-title=\"branchless\" data-image-description data-image-caption data-medium-file=\"https:\/\/probablydance.files.wordpress.com\/2023\/04\/branchless.png?w=300\" data-large-file=\"https:\/\/probablydance.files.wordpress.com\/2023\/04\/branchless.png?w=650\" src=\"https:\/\/probablydance.files.wordpress.com\/2023\/04\/branchless.png?w=650\" alt  ><\/a><\/figure>\n<p>The branchless version of branchless_lower_bound is faster than the branchless version of std::lower_bound. On the left half of the graph, where the arrays are smaller, it\u2019s 1.5x as fast on average. Why? Mainly because the inner loop is so tight. Here is the assembly for the two:<\/p>\n<figure>\n<table readabilityDataTable=\"1\">\n<thead>\n<tr>\n<th>inner loop of std::lower_bound<\/th>\n<th>inner loop of branchless_lower_bound<\/th>\n<\/tr>\n<\/thead>\n<tbody>\n<tr>\n<td>loop: mov %rcx,%rsi<\/td>\n<td>loop: lea (%rdx,%rax,4),%rcx<\/td>\n<\/tr>\n<tr>\n<td>mov %rbx,%rdx<\/td>\n<td><strong>cmp<\/strong> (%rcx),%esi<\/td>\n<\/tr>\n<tr>\n<td>shr %rdx<\/td>\n<td><strong>cmovg<\/strong> %rcx,%rdx<\/td>\n<\/tr>\n<tr>\n<td>mov %rdx,%rdi<\/td>\n<td>shr %rax<\/td>\n<\/tr>\n<tr>\n<td>not %rdi<\/td>\n<td>jne loop<\/td>\n<\/tr>\n<tr>\n<td>add %rbx,%rdi<\/td>\n<td><\/td>\n<\/tr>\n<tr>\n<td><strong>cmp<\/strong> %eax,(%rcx,%rdx,4)<\/td>\n<td><\/td>\n<\/tr>\n<tr>\n<td>lea 0x4(%rcx,%rdx,4),%rcx<\/td>\n<td><\/td>\n<\/tr>\n<tr>\n<td><strong>cmovge<\/strong> %rsi,%rcx<\/td>\n<td><\/td>\n<\/tr>\n<tr>\n<td><strong>cmovge<\/strong> %rdx,%rdi<\/td>\n<td><\/td>\n<\/tr>\n<tr>\n<td>mov %rdi,%rbx<\/td>\n<td><\/td>\n<\/tr>\n<tr>\n<td>test %rdi,%rdi<\/td>\n<td><\/td>\n<\/tr>\n<tr>\n<td>jg loop<\/td>\n<td><\/td>\n<\/tr>\n<\/tbody>\n<\/table>\n<\/figure>\n<p>These are all pretty cheap operations with only a little bit of instruction-level-parallelism, (each loop iteration depends on the previous, so instructions-per-clock are low for both of these) so we can estimate their cost just by counting them. 13 vs 5 is a big decrease. Specifically two differences matter:<\/p>\n<ol>\n<li>branchless_lower_bound only has to keep track of one pointer instead of two pointers<\/li>\n<li>std::lower_bound has to recompute the size after each iteration. In branchless_lower_bound the size of the next iteration does not depend on the previous iteration<\/li>\n<\/ol>\n<p>So this is great, except that the comparison function is provided by the user and, if it is much bigger, it can take many more cycles than we do. In that case branchless_lower_bound will be slower than std::lower_bound. Here is binary searching of strings, which gets more expensive once the container gets large:<\/p>\n<figure><a href=\"https:\/\/probablydance.files.wordpress.com\/2023\/04\/strings.png\"><img decoding=\"async\" data-attachment-id=\"11433\" data-permalink=\"https:\/\/probablydance.com\/2023\/04\/27\/beautiful-branchless-binary-search\/strings\/\" data-orig-file=\"https:\/\/probablydance.files.wordpress.com\/2023\/04\/strings.png\" data-orig-size=\"942,531\" data-comments-opened=\"1\" data-image-meta=\"{\"aperture\":\"0\",\"credit\":\"\",\"camera\":\"\",\"caption\":\"\",\"created_timestamp\":\"0\",\"copyright\":\"\",\"focal_length\":\"0\",\"iso\":\"0\",\"shutter_speed\":\"0\",\"title\":\"\",\"orientation\":\"0\"}\" data-image-title=\"strings\" data-image-description data-image-caption data-medium-file=\"https:\/\/probablydance.files.wordpress.com\/2023\/04\/strings.png?w=300\" data-large-file=\"https:\/\/probablydance.files.wordpress.com\/2023\/04\/strings.png?w=650\" src=\"https:\/\/probablydance.files.wordpress.com\/2023\/04\/strings.png?w=650\" alt  ><\/a><\/figure>\n<h2>More Comparisons<\/h2>\n<p>Why is it slower for strings? Because this does more comparisons than std::lower_bound. Splitting into powers of two is actually not ideal. For example if the input is the array [0, 1, 2, 3, 4] and we\u2019re looking for the middle, element 2, this behaves pretty badly:<\/p>\n<figure>\n<table readabilityDataTable=\"1\">\n<thead>\n<tr>\n<th>std::lower_bound<\/th>\n<th>branchless_lower_bound<\/th>\n<\/tr>\n<\/thead>\n<tbody>\n<tr>\n<td>compare at index 2, not less<\/td>\n<td>compare at index 4, not less<\/td>\n<\/tr>\n<tr>\n<td>compare at index 1, less<\/td>\n<td>compare at index 2, not less<\/td>\n<\/tr>\n<tr>\n<td>done, found at index 2<\/td>\n<td>compare at index 1, less<\/td>\n<\/tr>\n<tr>\n<td><\/td>\n<td>compare at index 1, less<\/td>\n<\/tr>\n<tr>\n<td><\/td>\n<td>done, found at index 2<\/td>\n<\/tr>\n<\/tbody>\n<\/table>\n<\/figure>\n<p>So we\u2019re doing four comparisons here where std::lower_bound only needs two. I picked an example where it\u2019s particularly clumsy, starting far from the middle and comparing the same index twice. It seems like you should be able to clean this up, but when I tried I always ended up making it slower.<\/p>\n<p>But it won\u2019t be too much worse than an ideal binary search. For an array that\u2019s less than <img decoding=\"async\" src=\"https:\/\/s0.wp.com\/latex.php?latex=2%5En&#038;bg=fff&#038;fg=444444&#038;s=0&#038;c=20201002\"  alt=\"2^n\"> elements big<br \/>\u2013 an ideal binary search will use <img decoding=\"async\" src=\"https:\/\/s0.wp.com\/latex.php?latex=n&#038;bg=fff&#038;fg=444444&#038;s=0&#038;c=20201002\"  alt=\"n\"> or fewer comparisons<br \/>\u2013 branchless_lower_bound will use <img decoding=\"async\" src=\"https:\/\/s0.wp.com\/latex.php?latex=%28n%2B1%29&#038;bg=fff&#038;fg=444444&#038;s=0&#038;c=20201002\"  alt=\"(n+1)\"> or fewer comparisons.<\/p>\n<p>Overall it\u2019s worth it: We\u2019re doing more iterations, but we\u2019re doing those extra iterations so much more quickly that it comes out significantly faster in the end. You just need to keep in mind that if your comparison function is expensive, std::lower_bound might be a better choice.<\/p>\n<h2>Tracking Down the Source<\/h2>\n<p>I said at the beginning that \u201cShar\u2019s algorithm\u201d is impossible to google. Alex Muscar said he read it in a book written in 1982 by John L Bentley. Luckily that book is available to borrow online from the Internet Archive. Bentley provides the source code and says that it\u2019s got the idea from Knuth\u2019s \u201cSorting and Searching\u201d. Knuth did not provide source code. He only sketched out the idea in his book, and says that it came from Leonard E Shar in 1971. I don\u2019t know where Shar wrote up the idea. Maybe he just told it to Knuth.<\/p>\n<p>This is the <a href=\"https:\/\/probablydance.com\/2018\/06\/16\/fibonacci-hashing-the-optimization-that-the-world-forgot-or-a-better-alternative-to-integer-modulo\/\">second time<\/a> that I came across an algorithm in Knuth\u2019s books that is brilliant and should be used more widely but somehow was forgotten. Maybe I should actually read the book\u2026 It\u2019s just really hard to see which ideas are good and which ones aren\u2019t. For example immediately after sketching out Shar\u2019s algorithm, Knuth spends far more time going over a binary search based on the Fibonacci sequence. It\u2019s faster if you can\u2019t quickly divide integers by 2, and instead only have addition and subtraction. So it\u2019s probably useless, but who knows? When reading Knuth\u2019s book, you have to assume that most algorithms are useless, and that the good things have been highlighted by someone already. Luckily for people like me, there seem to still be a few hidden gems.<\/p>\n<h2>Code<\/h2>\n<p>The code for this is available <a href=\"https:\/\/github.com\/skarupke\/branchless_binary_search\">here<\/a>. It\u2019s released under the boost license.<\/p>\n<p>Also I\u2019m trying out a donation button. If open source work like this is valuable for you, consider paying for it. The recommended donation is $20 (or your local cost for an item on a restaurant menu) for individuals, or $1000 for organizations. (or your local cost of hiring a contractor for a day) But any amount is appreciated:<\/p>\n<div>\n<h4>Make a one-time donation<\/h4>\n<p>Choose an amount<\/p>\n<p>Or enter a custom amount<\/p>\n<hr>\n<p>Thanks! I have no idea how much this is worth to people. Feedback appreciated.<\/p>\n<p><a href=\"https:\/\/subscribe.wordpress.com\/memberships\/?blog=22872755&#038;plan=11336&#038;lang=en&#038;pid=11376&#038;redirect=https%3A%2F%2Fprobablydance.com%2F2023%2F04%2F27%2Fbeautiful-branchless-binary-search\">Donate<\/a>\n\t\t\t<\/p>\n<\/div><\/div>\n<p><a href=\"https:\/\/probablydance.com\/2023\/04\/27\/beautiful-branchless-binary-search\/\" class=\"button purchase\" rel=\"nofollow noopener\" target=\"_blank\">Read More<\/a><br \/>\n Thomas Grumbles<\/p>\n","protected":false},"excerpt":{"rendered":"<p>I read a blog post by Alex Muscar, \u201cBeautiful Binary Search in D\u201c. It describes a binary search called \u201cShar\u2019s algorithm\u201d. I\u2019d never heard of it and it\u2019s impossible to google, but looking at the algorithm I couldn\u2019t help but think \u201cthis is branchless.\u201d And who knew that there could be a branchless binary search?<\/p>\n","protected":false},"author":1,"featured_media":642513,"comment_status":"closed","ping_status":"closed","sticky":false,"template":"","format":"standard","meta":{"footnotes":""},"categories":[800,122095,46],"tags":[],"class_list":["post-642512","post","type-post","status-publish","format-standard","has-post-thumbnail","category-beautiful","category-branchless","category-technology"],"aioseo_notices":[],"_links":{"self":[{"href":"https:\/\/newsycanuse.com\/index.php\/wp-json\/wp\/v2\/posts\/642512","targetHints":{"allow":["GET"]}}],"collection":[{"href":"https:\/\/newsycanuse.com\/index.php\/wp-json\/wp\/v2\/posts"}],"about":[{"href":"https:\/\/newsycanuse.com\/index.php\/wp-json\/wp\/v2\/types\/post"}],"author":[{"embeddable":true,"href":"https:\/\/newsycanuse.com\/index.php\/wp-json\/wp\/v2\/users\/1"}],"replies":[{"embeddable":true,"href":"https:\/\/newsycanuse.com\/index.php\/wp-json\/wp\/v2\/comments?post=642512"}],"version-history":[{"count":0,"href":"https:\/\/newsycanuse.com\/index.php\/wp-json\/wp\/v2\/posts\/642512\/revisions"}],"wp:featuredmedia":[{"embeddable":true,"href":"https:\/\/newsycanuse.com\/index.php\/wp-json\/wp\/v2\/media\/642513"}],"wp:attachment":[{"href":"https:\/\/newsycanuse.com\/index.php\/wp-json\/wp\/v2\/media?parent=642512"}],"wp:term":[{"taxonomy":"category","embeddable":true,"href":"https:\/\/newsycanuse.com\/index.php\/wp-json\/wp\/v2\/categories?post=642512"},{"taxonomy":"post_tag","embeddable":true,"href":"https:\/\/newsycanuse.com\/index.php\/wp-json\/wp\/v2\/tags?post=642512"}],"curies":[{"name":"wp","href":"https:\/\/api.w.org\/{rel}","templated":true}]}}