Horizontal minimum and maximum using SSE

Question 1

I suggest two changes:

Replace ((int8_t*) ((void *) &buffer))[0] with _mm_cvtsi128_si32.

Replace _mm_shuffle_epi8 with _mm_shuffle_epi32/_mm_shufflelo_epi16 which have lower latency on recent AMD processors and Intel Atom, and will save you memory load operations:

static inline int16_t hMin(__m128i buffer)
{
    buffer = _mm_min_epi8(buffer, _mm_shuffle_epi32(buffer, _MM_SHUFFLE(3, 2, 3, 2)));
    buffer = _mm_min_epi8(buffer, _mm_shuffle_epi32(buffer, _MM_SHUFFLE(1, 1, 1, 1)));
    buffer = _mm_min_epi8(buffer, _mm_shufflelo_epi16(buffer, _MM_SHUFFLE(1, 1, 1, 1)));
    buffer = _mm_min_epi8(buffer, _mm_srli_epi16(buffer, 8));
    return (int8_t)_mm_cvtsi128_si32(buffer);
}

Question 2

SSE 4.1 has an instruction that does almost what you want. Its name is PHMINPOSUW, C/C++ intrinsic is _mm_minpos_epu16. It is limited to 16-bit unsigned values and cannot give maximum, but these problems could be easily solved.

If you need to find minimum of non-negative bytes, do nothing. If bytes may be negative, add 128 to each. If you need maximum, subtract each from 127.
Use either _mm_srli_pi16 or _mm_shuffle_epi8, and then _mm_min_epu8 to get 8 pairwise minimum values in even bytes and zeros in odd bytes of some XMM register. (These zeros are produced by shift/shuffle instruction and should remain at their places after _mm_min_epu8).
Use _mm_minpos_epu16 to find minimum among these values.
Extract the resulting minimum value with _mm_cvtsi128_si32.
Undo effect of step 1 to get the original byte value.

Here is an example that returns maximum of 16 signed bytes:

static inline int16_t hMax(__m128i buffer)
{
    __m128i tmp1 = _mm_sub_epi8(_mm_set1_epi8(127), buffer);
    __m128i tmp2 = _mm_min_epu8(tmp1, _mm_srli_epi16(tmp1, 8));
    __m128i tmp3 = _mm_minpos_epu16(tmp2);
    return (int8_t)(127 - _mm_cvtsi128_si32(tmp3));
}

Question 3

here's an implementation without shuffle, shuffle is slow on AMD 5000 Ryzen 7 for some reason

    float max_elem3() const {
        __m128 a = _mm_unpacklo_ps(mm, mm); // x x y y
        __m128 b = _mm_unpackhi_ps(mm, mm); // z z w w
        __m128 c = _mm_max_ps(a, b); // ..., max(x, z), ..., ...
        Vector4 res = _mm_max_ps(mm, c); // ..., max(y, max(x, z)), ..., ...
        return res.y;
    }

    float min_elem3() const {
        __m128 a = _mm_unpacklo_ps(mm, mm); // x x y y
        __m128 b = _mm_unpackhi_ps(mm, mm); // z z w w
        __m128 c = _mm_min_ps(a, b); // ..., min(x, z), ..., ...
        Vector4 res = _mm_min_ps(mm, c); // ..., min(y, min(x, z)), ..., ...
        return res.y;
    }