N-back: Video or images?

I’m building an N-back task using images displayed on a continuous stream. The images will not move directly into the next image, but will alternate in coherence due to fluctating noise. Because of this, there is no “right time” or “right image” to make a response but rather an interval or window of time that a response should be made. I would like to know at what point in the presentation a participant made a response and on what frame/image presented. Would this be easiest done using a video file or a large (hundreds) of images being shown in a continuous stream? Does one take too much of a toll on the comp or introduce timing offset? Anyone experiment with many image files or videos? Any input would be greatly appreciated.

