Detecting Deepfake Audio by Modeling the Human Acoustic Tract
This is interesting research:
In this paper, we develop a new mechanism for detecting audio deepfakes using techniques from the field of articulatory phonetics. Specifically, we apply fluid dynamics to estimate the arrangement of the human vocal tract during speech generation and show that deepfakes often model impossible or highly-unlikely anatomical arrangements. When parameterized to achieve 99.9% precision, our detection mechanism achieves a recall of 99.5%, correctly identifying all but one deepfake sample in our dataset.
From an article by two of the researchers:
The first step in differentiating speech produced by humans from speech generated by deepfakes is understanding how to acoustically model the vocal tract. Luckily scientists have techniques to estimate what someone—or some being such as a dinosaur—would sound like based on anatomical measurements of its vocal tract.
We did the reverse. By inverting many of these same techniques, we were able to extract an approximation of a speaker’s vocal tract during a segment of speech. This allowed us to effectively peer into the anatomy of the speaker who created the audio sample.
From here, we hypothesized that deepfake audio samples would fail to be constrained by the same anatomical limitations humans have. In other words, the analysis of deepfaked audio samples simulated vocal tract shapes that do not exist in people.
Our testing results not only confirmed our hypothesis but revealed something interesting. When extracting vocal tract estimations from deepfake audio, we found that the estimations were often comically incorrect. For instance, it was common for deepfake audio to result in vocal tracts with the same relative diameter and consistency as a drinking straw, in contrast to human vocal tracts, which are much wider and more variable in shape.
This is, of course, not the last word. Deepfake generators will figure out how to use these techniques to create harder-to-detect fake voices. And the deepfake detectors will figure out another, better, detection technique. And the arms race will continue.
Slashdot thread.
Clive Robinson • October 3, 2022 8:07 AM
@ Bruce, ALL,
Re : Arms race on fakes
And it’s not that hard to see how. From the article,
Note that the “approximation” is just from the acoustic information.
Although quite different in detail, it can be seen that if you have a face and profile image of the speaker you can come up with an “approximation” of the alledged speakers vocal tract.
So even if the deep fake “audio approximation” is within “human constraints”, how well will it align with the alleged individuals “image approximation”?
Probably not that well without a lot of work.
However it will get more fun with “video” as in men certainly the movment of the Adams Apple is sufficiently clear to make a dynamic model…
As back in the days of the ECM ECCM… Arms race the question is not realy how far can it go technically but the resource costs involved.
But also consider that a similar analysis will enable deep fake videos to be unmasked. That is if you have a known genuine audio of a speaker you can build a model, then articulate it with the video audio track and compare to the video image of the audio tract.
It is certain that as the bit resolution and scan rates of audio, image and video recordings increases things like “blood flow to emmotion” and much else will fall under scruitiny for deep-fakes. One that should be easy to do is examine the “eye response to artificial lighting”, another “head movments to background noise”.
If people remember back, in the UK quite a few years back, an audio recording was shown to be fake because the very low level “mains hum” did not align with the time claimed and the “National Grid” records.
I think people would be supprised at the amount of research in this area is going to appear over the next few years.
Primarily at the moment it’s fairly easy to “make your name” as there is minimal academic competition, but that will change with just the publication of one or two papers.