Ultimate answer to all performance related questions is profiling your whole app on hardware same(or at least similar) to what it will be running on in production environment, but here's my three cents of theorycrafting:
passing by value in option 3) doesn't make any sense
vector dydx_tmp(x.size()); <- this causes default constructing(aka zeroing) your vector. use vector dydx; dydx.reserve(x.size()); and then emplace_back() in the loop(adding _temp to your name is useless - everyone can see it's local)
option 2) involves input parameter which is consider bad style and there will be no copy in option 1) anyway(as explained in article you linked), so 1) is best option