Visual stim duration is longer if presented at same time as a sound

OS (e.g. Win10): Win10
PsychoPy version (e.g. 1.84.x): v2020.1.3
Standard Standalone? (y/n) If not then what?: y
What are you trying to achieve?: A basic task where participants view two pairs of basic flash-beep stimuli one after the other, and must judge which pair was presented simultaneously. One pair is always synchronous, while the other pair has the auditory cue delayed by a small amount (± 266ms in 66ms increments). The order of the stimulus pairing is random. I have a 144 Hz monitor so the visual stimulus should flash for 3 frames (roughly 20 ms) and the auditory cue is presented for 20 ms (although I can’t confirm this because my output doesn’t give me a sound.stopped value). The visual stim is a polygon with 1000 vertices (a white circle).

The problem: Whenever the visual stimulus is the standard stimulus (i.e. whenever its accompanying auditory cue is presented at exactly the same time) it is presented for either 1 or 2 frames too long, which actually makes the task easy. Meanwhile, the other (comparison) stimulus duration remains just about perfect. It’s almost like the fact that the auditory cue is presented at the same time is delaying the visual stim, as this issue does not occur when I disable the auditory components.

What did you try to make it work?: I tried reordering the components in the builder. I tried using duration(s) instead of frames for duration. I tried specifying a stop frame instead of duration frames. I used a different 60 Hz monitor. I enabled and disabled G-sync and vertical sync on my monitor. I removed the visual stimulus components and remade them. I disabled the auditory components, and the issue went away (obviously this is not a solution as the task is audiovisual). I tried using other audio libraries and latency priorities, but nothing works (my default is latency critical and PTB).

Here is the part of my code that may be relevant, but I feel like this is just some sort of latency issue that arises when a visual and auditory stim are presented at the same time.

-------Run Routine “IFC_SJ”-------

while continueRoutine:
    # get current time
    t = IFC_SJClock.getTime()
    tThisFlip = win.getFutureFlipTime(clock=IFC_SJClock)
    tThisFlipGlobal = win.getFutureFlipTime(clock=None)
    frameN = frameN + 1  # number of completed frames (so 0 is the first frame)
    # update/draw components on each frame
    
    # *polygon* updates
    if polygon.status == NOT_STARTED and frameN >= 288:
        # keep track of start time/frame for later
        polygon.frameNStart = frameN  # exact frame index
        polygon.tStart = t  # local t and not account for scr refresh
        polygon.tStartRefresh = tThisFlipGlobal  # on global time
        win.timeOnFlip(polygon, 'tStartRefresh')  # time at next scr refresh
        polygon.setAutoDraw(True)
    if polygon.status == STARTED:
        if frameN >= (polygon.frameNStart + 3):
            # keep track of stop time/frame for later
            polygon.tStop = t  # not accounting for scr refresh
            polygon.frameNStop = frameN  # exact frame index
            win.timeOnFlip(polygon, 'tStopRefresh')  # time at next scr refresh
            polygon.setAutoDraw(False)
    
    # *polygon_2* updates
    if polygon_2.status == NOT_STARTED and frameN >= 576:
        # keep track of start time/frame for later
        polygon_2.frameNStart = frameN  # exact frame index
        polygon_2.tStart = t  # local t and not account for scr refresh
        polygon_2.tStartRefresh = tThisFlipGlobal  # on global time
        win.timeOnFlip(polygon_2, 'tStartRefresh')  # time at next scr refresh
        polygon_2.setAutoDraw(True)
    if polygon_2.status == STARTED:
        if frameN >= (polygon_2.frameNStart + 3):
            # keep track of stop time/frame for later
            polygon_2.tStop = t  # not accounting for scr refresh
            polygon_2.frameNStop = frameN  # exact frame index
            win.timeOnFlip(polygon_2, 'tStopRefresh')  # time at next scr refresh
            polygon_2.setAutoDraw(False)
    # start/stop sound_1
    if sound_1.status == NOT_STARTED and frameN >= standardStim:
        # keep track of start time/frame for later
        sound_1.frameNStart = frameN  # exact frame index
        sound_1.tStart = t  # local t and not account for scr refresh
        sound_1.tStartRefresh = tThisFlipGlobal  # on global time
        sound_1.play(when=win)  # sync with win flip
    if sound_1.status == STARTED:
        # is it time to stop? (based on global clock, using actual start)
        if tThisFlipGlobal > sound_1.tStartRefresh + 0.02-frameTolerance:
            # keep track of stop time/frame for later
            sound_1.tStop = t  # not accounting for scr refresh
            sound_1.frameNStop = frameN  # exact frame index
            win.timeOnFlip(sound_1, 'tStopRefresh')  # time at next scr refresh
            sound_1.stop()
    # start/stop sound_2
    if sound_2.status == NOT_STARTED and frameN >= delay:
        # keep track of start time/frame for later
        sound_2.frameNStart = frameN  # exact frame index
        sound_2.tStart = t  # local t and not account for scr refresh
        sound_2.tStartRefresh = tThisFlipGlobal  # on global time
        sound_2.play(when=win)  # sync with win flip
    if sound_2.status == STARTED:
        # is it time to stop? (based on global clock, using actual start)
        if tThisFlipGlobal > sound_2.tStartRefresh + 0.02-frameTolerance:
            # keep track of stop time/frame for later
            sound_2.tStop = t  # not accounting for scr refresh
            sound_2.frameNStop = frameN  # exact frame index
            win.timeOnFlip(sound_2, 'tStopRefresh')  # time at next scr refresh
            sound_2.stop()

Update: I tested this using an even simpler task (remade from scratch) where there is only one pair of flash-beep stimuli on each trial, and yep, all of the stimuli are presented for the correct time except when the auditory and visual stimuli coincide perfectly. Therefore, it appears that psychopy has latency issues when it has to present two stimuli at exactly the same time.

Note that on a 60 Hz monitor (using a different PC) I managed to get the discrepancy down to about an extra 1-4 ms, which seems to be acceptable for the task. Not sure why the discrepancy is so great on my beefier PC and 144 Hz monitor, but there ya go.

An explanation/fix for this would still be much appreciated.

TL;DR - psychopy seems to have latency issues when trying to present two stimuli (auditory and visual) at the same time.