1) Make your struct size multiple of 4-bytes. For example, if it is 125 bytes long, then you can add a char3 which is three bytes long to have a 128-bytes of chunk of struct.
2) Reorder the struct, biggest or multiple-of-4-bytes elements to "head" and smaller/not multiple ones to "tail". This will make your struct need less memory access operations.
3) As DarkZeros mentioned, you may try to get the struct with _work_group_copy(with casting to a long16 or similar if struct is too big) and then carry the values to private memory element-wise. There are many cache lines for this, so it would be fast enough to copy from local to private. (dont forget synching them before/after the transitions)
4) Pack small variables into bigger ones until it fills a cache lane so that lane's bandwidth is not wasted when under heavy cache usage.
But, if you are to copy "a single" struct to "all cores" of a thread group, you can copy element-wise because some newer GPUs have broadcasting technology which can be fastest in such scenarios. Please indicate any speedup (if algorithm happens to have) in the future, as a multiplier.
5) Sometimes heavy branching can purge performance and hide those memory latencies for benchmarking(in a bad way of course).