The longer code is typically generated because the compiler will generate vectorized code that provides enormous speedups in case of longer data sets. Take, for example, this code: https://godbolt.org/z/WEx3Gb5jr
At -O2 the assembly it generates is straightforward, and in line with what a human programmer would write. At -O3 it generates vector code that needs a lot more instructions (vector pipeline setup, code to deal with the remaining elements that don't entirely fill up a vector register, etc.) but the main loop takes 4 integers at a time instead of one, so that provides a nice 4x speedup. In order to achieve that it needs 25 instructions to set up the loop / finish the remaining elements, compared to 5 instructions for the -O2 code.
For very short loops the -O2 version will have superior performance, but for runs of data from around 8 integers (wild guess) the -O3 version will begin with pull ahead. So it really depends on the type of data your program is handling, whether it is better to optimize for speed or size.
At -O2 the assembly it generates is straightforward, and in line with what a human programmer would write. At -O3 it generates vector code that needs a lot more instructions (vector pipeline setup, code to deal with the remaining elements that don't entirely fill up a vector register, etc.) but the main loop takes 4 integers at a time instead of one, so that provides a nice 4x speedup. In order to achieve that it needs 25 instructions to set up the loop / finish the remaining elements, compared to 5 instructions for the -O2 code.
For very short loops the -O2 version will have superior performance, but for runs of data from around 8 integers (wild guess) the -O3 version will begin with pull ahead. So it really depends on the type of data your program is handling, whether it is better to optimize for speed or size.