YouTube's loudness normalization algorithm

In order to raise the highest sound quality video to YouTube, you need to know the YouTube loudness normalization specification.

However, YouTube's loudness normalization specification is not published. Some people have already been investigated, but specific calculation formulas are not known.

I tried to estimate the formula for loudness normalization on YouTube.

Contents

1 YouTube loudness normalization specification
2 Research policy
3 A large frame of YouTube's loudness normalization
4 Loudness calculation formula on YouTube
5 Test video used for parameter estimation
6 Equalizer parameter estimation
7 Parameter estimation other than equalizer
- 7.1 Parameter list
- 7.2 Result list
8 Appendix
9 References
10 change history
11 Summary
- 11.1 Related article

YouTube loudness normalization specification

The following is a summary of the survey results.

The loudness normalization is performed in a manner that the loudness of the sound source is adjusted to the loudness target value as much as possible within a range where the peak does not clip.

The loudness of the sound source is calculated with its own specifications, but by replacing the weighting curve of Short-term loudness of EBU TECH 3341 with the following and taking the maximum value of Short-term loudness, it is possible to obtain the accuracy of 1 dB Can be approximated.

Research policy

We will investigate the framework of YouTube's loudness normalization in detail and the details of loudness calculation.

A large frame of YouTube's loudness normalization

I think that probably it looks like the following when referring to here .

The loudness normalization on YouTube is done in a way that the loudness of the sound source is adjusted to the loudness target value as much as possible within the range where the peak does not clip. When written with an expression, it becomes the following.

Compensation (dB) = Min (- Peak, Target - Loudness)

Peak is the peak of the sound source, Loudness is the loudness of the sound source, Target is a constant, the loudness target value, and Compensation is the correction gain. The overall volume changes uniformly by the amount of Compensation.

Right click on a YouTube video and the content loudness seen from the detailed statistical information is equivalent to Loudness - Target.

Loudness calculation formula on YouTube

YouTube's loudness calculation formula seems to be using its own one. So, I need to guess.

Consider the following model with reference to ITU-R BS.1770-3.

Equalizer -> Cut by window -> Convert to LUFS -> Gating -> Aggregation

Equalizer

Weight each frequency by an equalizer.

In previous experiments, K-weighting adopted in ITU-R BS.1770-3 and other popular weighting did not apply, so estimate direct frequency characteristics.

Cut by window

Cut out the waveform with the Rect window.

Window length and overlap ratio are parameters.

For reference, the momentary and integrated parameters of ITU-R BS.1770-3 and EBU TECH 3341 have a window length of 400 ms and an overlap length of 100 ms (the overlap ratio is 75%). The short-term loudness parameter of EBU TECH 3341 has a window length of 3 seconds and an overlap length of 2.9 seconds or more (overlap ratio is 96.7% or more).

Convert to LUFS

Calculate the RMS of the extracted waveform and convert it to LUFS with Log 10 (RMS).

It also corrects to be 0 with stereo 1000 Hz sine wave. The correction amount for ITU-R BS.1770-3 is -0.691 dB.

Gating

In order to eliminate the influence of silence time on loudness, we discard small sounds among multiple RMS values obtained by cutting out.

Refer to ITU-R BS.1770-3 and EBU TECH 3342 and perform Absolute threshold gating and Relative threshold gating.

The parameters are the respective Threshold values. I also try patterns that do not do gating.

For reference, the parameters of ITU-R BS.1770-3 and EBU TECH 3341 are Absolute Threshold -70 LKFS and Relative Threshold -10 dB. Parameters for calculating the Loudness Range of EBU TECH 3342 are Absolute Threshold -70 LKFS and Relative Threshold -20 dB.

Aggregation

Take the average or maximum of multiple RMS values remaining in Gating.

ITU-R BS.1770-3 takes an average, but it seems there is a possibility of using the maximum value of Short-term according to this .

Test video used for parameter estimation

Prepare a test movie to estimate the parameters of the loudness calculation model.

According to here , it seems that there is a possibility that loudness normalization will not be applied if there is not a certain number of playback numbers, or it will not be applied unless some time has elapsed since posting. Without preparing test videos on their own, there are enough playback numbers, select some of the existing videos that have been posted enough times, and make them test videos.

A list of test videos is described in the Appendix.

Equalizer parameter estimation

By using a sinusoidal test movie with a constant volume, you can eliminate effects other than equalization on loudness. Using this we first estimate the frequency response of the equalizer.

For the sine wave sound source of various frequencies, measure the content loudness on YouTube and estimate the frequency characteristics by taking the difference from the RMS of the sound source. The estimation result is below. For detailed data please see Appendix.

The result was unstable, for example, the results were different depending on the animation even at the same frequency above 16 kHz, so in the following discussion, we will only use data below 15 kHz. Extrapolate with linear interpolation for 44 Hz or less and 15 kHz or more.

Parameter estimation other than equalizer

Next, fix the frequency characteristics of the equalizer and estimate parameters other than the equalizer.

Calculate the loudness of various videos with various parameters. Compare with the loudness (Content Loudess) calculated by YouTube and look for the parameter with the least error. The test video list is described in the Appendix.

Parameter list

Parameters	value
Window length	400 ms, 3 sec
Overlap ratio	75%, 96.7%
Absolute threshold	None, -70 LKFS
Relative threshold	None, -10 dB, -20 dB
Aggregation	mean, max

Result list

Parameters	Estimated Target (LUFS)	Error Stddev (dB)	Error Max (dB)
abs threshold none, rel threshold none, window 0.4 sec, overlap 75%, mean	-16.15449408	5.51255362	10.73290254
abs threshold none, rel threshold none, window 3 sec, overlap 96.7%, mean	-14.97681484	4.908278646	11.91484089
abs threshold none, rel threshold - 10 dB, window 0.4 sec, overlap 75%, mean	-13.94987923	3.954370989	7.389401665
abs threshold none, rel threshold - 10 dB, window 3 sec, overlap 96.7%, mean	-13.68684721	3.684007274	7.647167492
abs threshold none, rel threshold - 20 dB, window 0.4 sec, overlap 75%, mean	-14.49831437	4.531255406	9.145055115
abs threshold none, rel threshold - 20 dB, window 3 sec, overlap 96.7%, mean	-14.01660691	4.048723057	9.667181199
abs threshold - 70 LUFS, rel threshold none, window 0.4 sec, overlap 75%, mean	-16.15449408	5.51255362	10.73290254
abs threshold - 70 LUFS, rel threshold none, window 3 sec, overlap 96.7%, mean	-14.97681484	4.908278646	11.91484089
abs threshold - 70 LUFS, rel threshold - 10 dB, window 0.4 sec, overlap 75%, mean	-13.89217514	3.911543318	7.447105751
abs threshold - 70 LUFS, rel threshold - 10 dB, window 3 sec, overlap 96.7%, mean	-13.66565863	3.666025972	7.668356069
abs threshold - 70 LUFS, rel threshold - 20 dB, window 0.4 sec, overlap 75%, mean	-14.47170654	4.52391958	9.171662946
abs threshold - 70 LUFS, rel threshold - 20 dB, window 3 sec, overlap 96.7%, mean	-14.00512426	4.038389533	9.678663846
abs threshold none, rel threshold none, window 0.4 sec, overlap 75%, max	-8.993721502	1.106961021	2.968119771
abs threshold none, rel threshold none, window 3 sec, overlap 96.7%, max	-10.31246414	0.90143559	1.746039964
ITU-R BS.1770-3	-10.39317645	11.03141212	33.14216451
RMS	-13.03007896	10.1756184	29.41685531

Parameter combination with the least error was window size 3 seconds, overlap rate 96.7%, Max aggregation, standard error of error was 0.9 dB, maximum error was 1.7 dB. It is the maximum value of Short-term loudness of EBU TECH 3341. The loudness target value is -10.3 LUFS.

With this, you can estimate the loudness calculation method of YouTube.