Performance hit may be imposed by using local work group of non-optimal size or synchronizing of WI within WG.
Reading into local memory itself doesn't introduce any performance hit - it has same order of speed as reading into private memory (both placed on chip).
Also, check if your data fits into local memory size, as it's usually has small size.