IEC 62503 pdf – Multimedia quality – Method of assessment of synchronization of audio and video
The leftmost in Figure 1 is a real world and the rightmost in Figure 1 is a reproduced world.
Lip sync at the section 0-0′, i.e.scene delay in reference to accompanying sound, is normallyzero; in other words null video delay against accompanying audio is expected;At’ =0 . Wheret≠0 is foreseen, it shall also be taken into account.
Lip sync at the section 1-1′ is supposed to be introduced by separate acquisition of physicalphenomena by microphones and video cameras followed by yet further separate digitalprocessing for audio and video data.It will cause lip sync of At,≠0.
NOTE In case of MPEG-2 encoding, there is the scheme of synchronization using Decoding Time Stamp (DTS) aswell as Presentation Time Stamp(PTS) embedded in the header of Packetized Elementary Stream (PES).SeeISO/IEC 13818-1.
Lip sync at the section 2-2′ is supposed to be introduced by reproduction process for audioand video channels separately such as decompression，rendering and reproduction. It willcause lip sync of at, ≠ 0 , which can be measured using a reference test multimedia materialwith At =0 .
Lip sync at the section 3-3′ is in the reproduced multimedia world and assessed by humansubjects. Subjective opinion scores on lip sync are statistically analyzed to find estimatedvalue for A,≠0 ; corresponding to the amount of compensation for just-synchronizedreproduction.
5 Subjective assessment of lip sync5.1 ltems to be assessed
Subjective grading level of miss-synchronization of video and audio.
5.2Preparation of test video clips and test video sequence5.2.1 Selection of content of a test video clip
Since lip sync is a kind of human perception, it may depend on the contents of the video andaccompanying audio.Especially when it is related to movement of lips of a human speaker, amatch between a spoken language and a mother tongue may affect the result.
NOTE ln this International Standard, in order to provide worked examples, speech in Japanese language utteredby a well trained processional news reader is watched and listened to by the subjects with the same mother tongue.A bust shot of a news reader shall be extracted,duration of which should be around 10 s to20 s. Data of audio channel of the video clip shall be taken as the timing reference.
Possible amount of time caused by miss-synchronization in this original video clip，At, at thesection 1-1’， is unknown. However,this international standard provides the method toestimate overall lip sync At,including At。and At, . Namely，At, =At。+At,+At, .
5.2.2Creation of a test video sequence
The test video sequence shall be a randomised series of the video clip selected in 5.2.1,inwhich each of the audio channels shall be replaced by time-shifted audio data with necessaryduration of padding as a leader or a trailer depending on the direction of the time shift.Preparation of such video clips is show in Figure 2 as in the image frames with delayed audioand with led audio.The amount of time shifts T, and T, is subject to be adjusted.
To allow for each of the video clips with the time-shifted audio composing the test video sequence to be visually identified by a subject, each video clip prepared in accordance with Figure 2 should be preceded by a necessary number of title frames which include a sequence number. The amount of time shifts for audio data, l T and d T , shall be determined taking into account the sum of the lip sync in reproduction system, t Δ 2 , and possible lip sync in the original video clip, t Δ 1 . The amount of increment and decrement of the time shift T Δ for l T or d T shall be decided in accordance with precision of assessment. In this standard, ∆ T = 1 0 ms is recommended.
The test video sequence should be stored in a medium such as CD-ROM for use in 5.3 without losing audio-video synchronization.