The longer code is typically generated because the compiler will generate vector...

The longer code is typically generated because the compiler will generate vectorized code that provides enormous speedups in case of longer data sets. Take, for example, this code: https://godbolt.org/z/WEx3Gb5jr

At -O2 the assembly it generates is straightforward, and in line with what a human programmer would write. At -O3 it generates vector code that needs a lot more instructions (vector pipeline setup, code to deal with the remaining elements that don't entirely fill up a vector register, etc.) but the main loop takes 4 integers at a time instead of one, so that provides a nice 4x speedup. In order to achieve that it needs 25 instructions to set up the loop / finish the remaining elements, compared to 5 instructions for the -O2 code.

For very short loops the -O2 version will have superior performance, but for runs of data from around 8 integers (wild guess) the -O3 version will begin with pull ahead. So it really depends on the type of data your program is handling, whether it is better to optimize for speed or size.