While processing, your ApplicationMaster would negoriate with the NodeManager for containers and NodeManager in turn would try to obtain the nearest datanode resource. Since your replication factor is 3, HDFS would try to place 1 whole copy on a single datanode and distribute the rest across all the datanodes.
1) Change the replication factor to 1 (Since you are only trying to benchmark, reducing replication should not be a big issue).
2) Make sure your client(machine from where you would give your -copyFromLocal command) does not have a datanode running on it. If not, HDFS will tend to place most of the data in this node since it would have reduced latency.
3) Control the file distribution using dfs.blocksize
property.
4) Check the status of your datanodes using hdfs dfsadmin -report
.