A fast binary search intended for B-tree key array searches does not have branches at all [1]. An B-tree interior node's key array is usually sized to fit in one cache line, so there isn't a difference in cache access pattern.
With 8-byte keys and the usual 64-byte cache line size, you end up with 8 keys per interior node's key array. That's a pretty bad range for the "simple linear search" most people would write, which is going to be dominated by the expected branch mispredict cost. The idea of using a hint to guess a starting position for either linear or binary search has the same issue. You need branches to capitalize on the best case vs worst case spread, and the mispredict cost will dominate for smaller arrays like this.
If you care about speed for this case, use a fast branchless binary search or a vectorized linear search (the latter will have an edge). An additional minor consideration for B-tree traversal is the latency of vector vs scalar loads since you don't know the load address (except for the root node) until you've finished searching the parent node's key array. That is not necessarily true for vectorized searches of small arrays in general, where the vector load addresses may be key independent (like the B-tree root node is here) and so in that case the few extra cycles of L1 load-to-use latency for vector loads (and higher load latencies in general) can often be hidden.
[1] By B-tree I am referring to in-memory B+ trees. A disk-based B+ tree would usually be built around page-granularity nodes. In that regime a branchless binary search for integer/float keys (with software prefetch to populate the cache ahead of the data dependency chains) that finishes off with a vectorized linear search is likely going to be the winner. The B-tree slack interval (M/2, M] helps since it guarantees a fully unrolled binary search with lg(M) steps won't waste any steps. You can still support "underflowing" B-tree nodes with the same branchless code path.
With 8-byte keys and the usual 64-byte cache line size, you end up with 8 keys per interior node's key array. That's a pretty bad range for the "simple linear search" most people would write, which is going to be dominated by the expected branch mispredict cost. The idea of using a hint to guess a starting position for either linear or binary search has the same issue. You need branches to capitalize on the best case vs worst case spread, and the mispredict cost will dominate for smaller arrays like this.
If you care about speed for this case, use a fast branchless binary search or a vectorized linear search (the latter will have an edge). An additional minor consideration for B-tree traversal is the latency of vector vs scalar loads since you don't know the load address (except for the root node) until you've finished searching the parent node's key array. That is not necessarily true for vectorized searches of small arrays in general, where the vector load addresses may be key independent (like the B-tree root node is here) and so in that case the few extra cycles of L1 load-to-use latency for vector loads (and higher load latencies in general) can often be hidden.
[1] By B-tree I am referring to in-memory B+ trees. A disk-based B+ tree would usually be built around page-granularity nodes. In that regime a branchless binary search for integer/float keys (with software prefetch to populate the cache ahead of the data dependency chains) that finishes off with a vectorized linear search is likely going to be the winner. The B-tree slack interval (M/2, M] helps since it guarantees a fully unrolled binary search with lg(M) steps won't waste any steps. You can still support "underflowing" B-tree nodes with the same branchless code path.