Comments

Papa DIF Kid January 15, 2024 8:51 AM

Makes me wonder if 100 or 200 years from now, there might be scientific advancements that make it possible to clone not just voices but the whole human being, provided that DNA sample is available. Imagine bringing dead people back, and all it’s required is to have their DNA sample. Stuff for the SF movies.

Clive Robinson January 15, 2024 9:06 AM

@ Bruce, ALL,

“Voice Cloning with Very Short Samples”

I guess the question is not how short the researchers have so far achieved, but if there is some kind of limit.

In essence what they are trying to do is to “build or select a model” of the physical voice tract that is sufficiently close to that of thr target.

In part the closeness of the model depends on how long a false representation is needed. If just Yes / no the the model need not be that close so just a very quick sample to select out of a small number of models may all that is required.

However the longer the false representation the closer the model needs to be to the target. So of they have to quote the entire US Constitution then you actually need more than just a model of the physical voice tract.

Because things like timing due to emmotion would then need to be modeled.

But what of free input speech?

Here you need a lot more such as word usage, tense and other idiopathic speech / language constructs that can under analysis and sufficient ordered classifiers become like a fingerprint.

Importantly as these classifers work with written speech as well as spoken they are independent of the voice tract model.

You know “it’s Shakespeare or Betjeman” irrespective of if it’s Old Hamish in the pub, or young Lucy at her school assembly.

Clive Robinson January 15, 2024 9:59 AM

@ ALL,

It’s rather important that you read section 4 “Discussion” of the paper, as it explains why their definition of “clone” might not be yours,

Specifically,

“… It is relatively easy to train a base speaker TTS model to control the voice styles and languages, as long as we do not require the model to have the ability to clone the tone color of the refrence speaker.

Thus it is only a “partial clone” of any given “refrence speaker”.

Hence my Hamish and Lucy comment above.

I’ll let others amplify on what this actually means with respect to impersonation and potential crime.

Bob January 15, 2024 11:55 AM

@Clive

I don’t feel it’s necessarily impossible to model tone color either. Myself and my cousins used to run around with a parabolic mic back in the day. Covert listening has really only gotten easier since then, especially given that we’re now all carrying listening devices in our pocket that need only for a sophisticated enough attacker. We even talk right into them all the time.

I feel like you’re missing a “…for now.” at the end of your first couple posts here. And it seems when we’re talking about this sort of modeling, the amount of time that constitutes “for now” keeps getting shorter. Old as you might think yourself, I think you’ll live to see increasingly more trust problems springing up in this area.

vas pup January 15, 2024 5:59 PM

@Eric and may use it to train their AI for other purposes – e.g. to give or not to give loan. The must clearly specify all possible usage of your voice. Unfortunately, for privacy it looks like one way street for customer.
It may substantially differ by country: e.g. Canadian Bank versus US Bank.
For me main concern with banks is outsourcing customer service to other countries far away from US main land when they have same access to your PII and financial information but less possibility to make them liable for it misuse.

vas pup January 15, 2024 6:13 PM

Small addition: and practically unlimited access/sharing to your voice print stored by bank with all alphabetic LEAs without your knowledge and consent. Orwell rest.

Clive Robinson January 15, 2024 6:31 PM

@ Eric, ALL,

“And multiple financial institutions use voice ID for a second factor…”

If what the researchers say in §4 “Discussion” of their paper is correct then it is unlikely this method could be used, hence my reason for mentioning it.

@ Bob, ALL,

“I don’t feel it’s necessarily impossible to model tone color either.”

It’s not impossible, but not with this method and short samples.

“Old as you might think yourself, I think you’ll live to see increasingly more trust problems springing up in this area.”

Of that I have no doubt but not woth such short samples.

I won’t trot out the math it has been done by others in the past and it’s easy enough to look up (if you know where to look).

Nyquist worked out why there were issues with limited bandwidthvand sampling.

Hartly and Shannon developed it further.

Shannon worked out the information bandwidth limit in a channel.

Shannon gave just over 19kb/s for random data in the audio bandwidth of a phone channel. And his foundations gave us several “apparent possabilities” by removing redundancy that Hamming used to give us both error correction and information compression. Which is why 1990’s modems gave the illusion of upto 56kb/s in the phone bandwidth.

From this it can be worked backwards and show that a 1sec sample can not contain sufficient information to make a sufficiently accurate clone.

Further as you strip the speech down into it’s various layers you will find that even 12secs will not give sufficient bandwidth. Especially as free user input, means that the process is effectively stochastic not determanistic and any model of a user would probably be incompleate.

Thus the verifying party would need to be very limited in what it takes as input and measures for a sufficiently high probability of an attacker to succeed. Which would be tantamount to negligence by the verifying party.

Which brings us onto,

“And it seems when we’re talking about this sort of modeling, the amount of time that constitutes “for now” keeps getting shorter.”

It does but only as a fairly rapidly decreasing percentage.

Worse the cost of gaining each percentage shortening goes up as a power law.

So the potential return on investment has assuming a sensible upgrade policy by the verifier passed the point of normal “home banking” theft average returns.

Thus I would expect normal criminals to seek other methods that are both “lower hanging fruit” and “target rich”.

lurker January 15, 2024 6:36 PM

@Clive Robinson, Bob

You quoted correctly from the paper, that they do not require the TTS engine to “clone” the reference speaker’s tone colour. But the sentence following your quote, and Fig. 1 show that they extract the tone colour before the mechanical processing, then put it back again after. Rather than try to build a spaghetti coded voice synthesiser with a huge sample database, they have gone back to old-school Unix programming of piping the signal flow through small modules which perform one function efficiently.

A key factor is the use of IPA as a “cross-lingual unified phoneme dictionary”. This must greatly simplify their phoneme analysis and synthesis.

lurker January 15, 2024 7:40 PM

@Clive Robinson

I understand your reference to Nyquist and Shannon. The bandwidth required for a human speaker to be recognised and understood by a human listener is also language dependent. Early papers on articulation and intelligibility focussed on west European languages. I have seen(1) the results of a study that claimed for 90% intelligibility a bandwidth of 4.4khz was needed for the Welsh language, while only 900hz was sufficient for Chinese. This makes the use of IPA more interesting in the engine described in the paper.

1) I thought it was in Atkinson, Telephony Ch.1, but a quick skim there reveals not; and the search engines seem to be still on holiday …

Winter January 16, 2024 1:28 AM

@Bob

I don’t feel it’s necessarily impossible to model tone color either.

Indeed, the tone color is determined by the length of the vocal tract, the distance between the vocal folds and the lips. A few vowels are enough to estimate the length of the vocal tract. The details of the spectrum of these vowels allows to estimate the range of tongue movements.

In 1 second of speech, you can expect to find around 4 vowels. With only a few vowels, these estimates will not be precise, but probably acceptable.

Systems for automatic speaker verification can easily be trained to spot artefacts from voice conversion to weed out imposters. That is an arms race, but every new attack will be met with counter measures.

vas pup January 16, 2024 6:02 PM

A secret phone surveillance program is spying on millions of Americans
https://cyberguy.com/security/secret-phone-surveillance-program-spying-
millions-americans/

“Imagine that every time you make a phone call, someone is keeping a record of
extraordinary details of your calls. They are tracking who you are talking to, when, where, and for how long. And they don’t stop there. They also track the calls of people you talk to, the people they talk to, and so on*. This is a reality for millions of Americans who use AT&T’s phone network.”

*Is so on includes voice sampling and cloning of both parties?

@Bruce: please ask your contacts in EFF to file FOIA request with government entities involved on the subject to clarify this. thank you.

Clive Robinson January 17, 2024 2:59 AM

@ lurker,

Re : Duck Down and out.

Yup the Microsoft backed search engines are about as close to usless as it can get on technical subjects.

And I’m quite sure it’s deliberate.

With regards,

“I have seen(1) the results of a study that claimed for 90% intelligibility a bandwidth of 4.4khz was needed for the Welsh language, while only 900hz was sufficient for Chinese.”

Can you remember if it gave a technical reason?

It’s known that there are known correlations between landscape type and very basic underlying language formations.

The argument is it’s a learned effect based on redundancy to correct errors at distance or in noise.

One such is dealing with “Inter Symbol Interferance”(ISI) from near reflections through to echos.

As a young lad over half a century ago, I was surprised to find that whistling when walking past certain types of “vertically ship lapped fencing” produced a strange low buz / drone effect[1].

I now know it was a form of multipath direct convertion envelope demodulation. Importantly our environment especially areas with bare rock does this. Whilst vegitation dampens part of the audio spectrum. As for desert scrub… Let’s just say “Don’t go there” as it’s not just your hearing that gets effected 😉

[1] In effect my moving head was acting like one of those CW microwave doppler radars and I was hearing the amplitude change of my whistle amplitude and phase sumed with it’s reflection with the very close ship lap fence acting as a sufficient reflector with regular sawtooth changes,

https://en.m.wikipedia.org/wiki/Doppler_radar

Clive Robinson January 17, 2024 3:52 AM

@ Winter, Bob, ALL,

Re : Is a technology race different?

“That is an arms race, but every new attack will be met with counter measures.”

And each step gets exponentially more expensive.

But the same is true for any technology race as well.

Thus we can see things driven by both a profit curve and a technology cost reduction curve and even a cost of delivery curve.

But all research can be seen as comprising four basic gains,

1, Personal learning.
2, Academic standing.
3, Commerce advancment.
4, Technology advancment.

With perhaps the exclusion of the first and last, these are competative areas much akin to any battle, skirmish, or war. It’s just a consideration of “gain v harm” and how efficient you can make the process.

Winter January 17, 2024 4:28 AM

@Clive

And each step gets exponentially more expensive.

So do not use Speaker Verification. The idea that Voice Access Verification is secure is an illusion anyway.

Winter January 17, 2024 4:52 AM

@lurker

I have seen(1) the results of a study that claimed for 90% intelligibility a bandwidth of 4.4khz was needed for the Welsh language, while only 900hz was sufficient for Chinese.

The stuff of urban myths. With an upper cut-off of 900Hz, you will be challenged to discriminate the 21 consonants of Mandarin even in clearly read speech. In noise and with spontaneous speech, intelligibility will collapse. But that is no different from English.

What they could have meant is that you can distinguish all tones in (eg, Mandarin, Cantonese) Chinese. But that cutoff, again, is probably lower.

Winter January 17, 2024 4:57 AM

@Clive

And each step gets exponentially more expensive.

PS
I was told that biometrics are not a password, but a user-name. Biometrics is your Social Security Number, and that is not safe to use as your access code either.

Voice can be your user name, but it should not be your password.

Clive Robinson January 17, 2024 6:32 AM

@ Winter,

Re : Biometrics are the death of both security and the user.

“Voice can be your user name, but it should not be your password.”

It should be neither an Identifier or an Authenticator, as it can not be changed.

Whilst it should not be done, many use the identifier as a weak authenticator, in theory they think they get some increased measure of security…

As many readers here know that is “security by obscurity” thinking, but you would be surprised at just how often it’s done (think hard coded in embedded systems like IoT devices).

But there is another reason that we should make more obvious. Which is what is served by the ID,

1, The system.
2, The User.

As far as authentication is concerned it’s actually the system and the identifier should be unique in that context only.

But a far as authentication is concerned user has many roles in life thus has many systems they authenticate to. For the safety of the user there should be no “unique to the user” identifier. Because it’s a major security failing as many people use only a single password.

Thus the user should be able to change their identifier as easilly as they change their password.

Which means using authentication identifiers as Email names etc is realy a very bad idea.

The problem is no matter how often you explain “User Role implications” to authoritarians and developers, you get the same response of they ignore it for their own benifit…

And we are all less secure because of that “block headed” mentality…

WAVint January 18, 2024 7:13 PM

Excellent timing for this topic! Thanks so much.
This audio topic is right on time! There’s sooooo much for amateurs like me to think about within this!

I hope you and other researchers will open a discussion about tragic “FLAK” (“rendlieee eyre”) within security, such as security officers accidentally harrassing security officers.

Dodging the bs is what it seems to be all about, even within security topics.

sincerely, WAVint

P.S.-I am an amateur. Please no more hazing, (aside), I can’t survive the joke(?).

vas pup January 19, 2024 5:09 PM

@Clive said “But a far as authentication is concerned user has many roles in life thus has many systems they authenticate to. For the safety of the user there should be no “unique to the user” identifier. Because it’s a major security failing as many people use only a single password.

Thus the user should be able to change their identifier as easily as they change their password.”

Excellent point as usually!

Leave a comment

Login

Allowed HTML <a href="URL"> • <em> <cite> <i> • <strong> <b> • <sub> <sup> • <ul> <ol> <li> • <blockquote> <pre> Markdown Extra syntax via https://michelf.ca/projects/php-markdown/extra/

Sidebar photo of Bruce Schneier by Joe MacInnis.