Thanks for pointer Niels
So it seems that the approach is perfectly reasonable however each individual MP4 file has marginal differences between the audio length and video length due to differences between the sampling frequency. The MP4s include an EDTS.ELST combination which correct this issue for that file. I was failing to consider the EDTS when I merged files. Merging EDTS has fixed the issue.