Usually you want the same thread using nearby memory due to caching. So accessing 0, 1, 2, 3 is faster than 0,2,1,3 (assuming that isn't big enough to be in the same page).
>if the array length is unable to be divided evenly by the number of threads, it can be more of a pain to handle.
You can just use smaller chunks. Not too small or the overhead will kill any gains. There are ways to be clever about it of course but you probably don't need to.