Hiding Vulnerabilities in Source Code

Really interesting research demonstrating how to hide vulnerabilities in source code by manipulating how Unicode text is displayed. It’s really clever, and not the sort of attack one would normally think about.

From Ross Anderson’s blog:

We have discovered ways of manipulating the encoding of source code files so that human viewers and compilers see different logic. One particularly pernicious method uses Unicode directionality override characters to display code as an anagram of its true logic. We’ve verified that this attack works against C, C++, C#, JavaScript, Java, Rust, Go, and Python, and suspect that it will work against most other modern languages.

This potentially devastating attack is tracked as CVE-2021-42574, while a related attack that uses homoglyphs –- visually similar characters –- is tracked as CVE-2021-42694. This work has been under embargo for a 99-day period, giving time for a major coordinated disclosure effort in which many compilers, interpreters, code editors, and repositories have implemented defenses.

Website for the attack. Rust security advisory.

Brian Krebs has a blog post.

EDITED TO ADD (11/12): An older paper on similar issues.

Posted on November 1, 2021 at 10:58 AM101 Comments

Comments

echo November 1, 2021 11:31 AM

Anyone familiar with COBOL and dot matrix printers when trying to find the source of a ten page error report knows this problem.

Compiler vendors are fundamentally lazy. You see this through entire “compiler like” toolchains. You have designers of languages ignoring rationality and piling in function after gee whiz function while ignoring legal mandates such as equality law governing accessibility. The lack of regulatory oversight and remedy in law perpetuates this.

File under avoidable problem.

Reality is a concrete mattress disguised by avoidant language much as “beef” when people really mean dead cow. Hence “Not normally” – a horrible phrase often used by lawyers along with “that depends” and “with kind regards”. You can add it to the same list as politicians speak such as “endevour” and “pledge”.

See you same time next year for the some problem wearing different clothes.

Bear November 1, 2021 11:38 AM

I used to have a long, angry Unicode Rant.

I would bring it out at the slightest provocation, expounding on the evils of lookalike characters and the possibility of deceptive source code with ‘fake-equality’ of identifiers, of ‘meta’ or control characters doing unexpected things to display of characters they’d never been meant to apply to, about a multiplicity of different code sequences to denote the same character, about there being more than one ‘normalization form’, about its chaotic non-ordering, about its unused codepoints, about the way it was accumulated rather than designed and continued to mutate year after year after year when people were already using it, and about the way we had somehow gotten from the important business of actually communicating language somehow to the idiotic situation of worrying about whether users identified with the supposed ethnicity of abstract symbols that should never have had any kind of color associated with them in the first place and cataloging poop emoji.

We have ADULTS, spending their days cataloging poop emoji. Anybody who cared about communicating language is now beating their heads against a wall and weeping.

But I’ll spare you the long version of the screed. The above is like a few reminders of things I can go on for pages about.

It was a good and important idea. It got sidetracked into something else and became a juddering abomination.

So meta characters that can do unexpected things turn out to be a bad idea? So deceptive source code that displays different semantics to humans and compilers has been enabled? Color me surprised. I’ve been telling people so for decades!

lurker November 1, 2021 12:15 PM

We’ve verified that this attack works against C, C++, C#, JavaScript, Java, Rust, Go, and Python,

Is this really a problem with these languages? Aren’t they just behaving as described on the box, allowing code writers to think and comment in their own human language? Surely the problem is in the compilers/interpreters that allow nonfunctional inclusions in the code to perform naughty functions.

As @Clive suggests in
#comment-391195
in the squid thread, the solution is to use sane filtering ahead of compilers/interpreters that are insane.

Yabba Dabba Don't November 1, 2021 12:21 PM

Wasn’t there a famous security essay from decades ago whose summary was “never trust the compiler”? I’m sure I read about it on this blog.

Clive Robinson November 1, 2021 12:46 PM

@ ALL,

I posted on this earlier over on the current squid page.

Something similar came up last century when *nix went internationalzed. It was the *nix internationalization effort that gave rise to much that is now in Unicode…

Unfortunatly it looks like the bidirectional printing behaviour has remained.

For some reason of nearly all technical based domains the IT industry realy does not learn from it’s history…

Thinking back to when the *nix internationalisation was causing issues and what people did as work arounds back then is worth thinking about.

First off back then few “code editors” supported anything other than ASCII, and as far as I’m aware that is still true for the basic *nix editors of the time that are still around (ed / vi). Thus simply opening in RO mode with one should throw up a visual clue at code review.

There may also be a simple work around[1] for some source code to executable compiler “pipelines”,

1, Run all source code through a filter that converts Unicode to ASCII (many editors will have a “Save as ASCII” option anyway).

2, Run the filter output through the compiler pre-processor only, so you get code expansion etc.

Take that expanded pre-processor output and then do your code review on it…

[1] Not all compilers are going to have this issue, because they don’t accept “unicode source” as input. I have quite a number of “old” compilers and I know they only take “7bit clean” ASCII.

Why modern compilers do take Unicode by default is a bit of a mystery as well… They realy should not do so unless specifically told to with a “switch”. On the “least unexpected behaviour” principle and likewise the “No hidden behaviour” principle (which did cause older source code control systems to barf).

Z.Lozinski November 1, 2021 12:52 PM

@Yabba Dabba Don’t ..

Yes, it’s Ken Thompson’s 1984 Turing Award paper “Reflections on trusting trust”, and it is the very first reference cited in Nicholas and Ross’ paper.

On reading the paper, my immediate thought is “Oh, bother” .. well, that’s the cleaned up version.

When I worked on the display of interesting text (Japanese, Hebrew) on mainframe terminals (IBM 5550 family of terminals designed in Japan), the control characters were always displayed so you could always see the [SO/SI] characters sequences that changed how you interpreted the text. Very handy for debugging that feature.

Sumadelet November 1, 2021 1:33 PM

In the comments on Brian Krebs’ article, someone links back to a very similar sounding issue discussed in Golang back in May 2017

h++ps://github.com/golang/go/issues/20209

There’s a long history of discoveries publicised by and/or attributed to people other than the first discoverer (h++ps://en.wikipedia.org/wiki/List_of_multiple_discoveries): this might be another example.

echo November 1, 2021 1:35 PM

@lurker

Validate your data. This was one of the first things I was taught and I was known for writing bullet proof code.

I hadn’t even read Clive’s comment before posting my own. Clive isn’t the only pebble on the beach and he’s began circling around putting engineers on a pedestal and beating up coders again.

Physicists do physics. Engineering is applied physics. Engineering doesn’t teach creativity or what should be done. It’s mostly a long list of rote learned formulas and sanity checks. This is not to put the formal and informal aspects of the discipline down. Just don’t put it on a pedastel.

This whole topic pivots around psychology. There is linear “here are the rules” which heads off “you shouldn’t/cannot do that”. Psychology is not rules is not laws. The brain stalls and misses the superset. A person coming at it from the point of creativity looks at the superset and dodges all the rules to get the outcome they desired.

If you rewind life back to when you were in school there is only so much school teaches. Some of it is formal and some of it is implied and some of it is sourced externally from the surrounding culture and parents and friends and other teachers and influencers. Snap back to the present and take this insight and apply it to where you are now.

Unlike engineers and engineers in QA departments coders don’t have the same proectections in law. They cannot just say “no” and halt the production line. That is a quick way to be fired and yes I have been fired more than once for doing it. On one occasion the MD even admitted that I had been wrongly fired by a middle manager but the “management line had to be upheld”. The other time after I opened by mouth during a board level meeting I was unceremoniously fired and the company later went bust for massive cost overruns on vanity projects I had warned them not to do. This was a company which spent £10,000 on desks and from outside the meeting room you could hear managers shouting and banging the desk, and where not a single woman in the whole building would say boo to a goose even though they knew it was nonsense. Except me, of course, but I’m stupid like that. It was a company which effectively printed money and it would have taken an imbecile for it to go bankrupt which is exactly what happened. There’s more stories of management corruption and strongarming I could write about and other organisations going bust and inadequate management being fired and fraud but things would ramble on.

BCS November 1, 2021 1:50 PM

The simplest mitigation for a tool chain is likely to provide an –ascii-only flag. At a guess >99% of code has no need for anything outside 7-bit ASCII, so just enforce that.

I suspect the majority of the exceptions are in string literals, which suggests a useful middle ground option. Even then, I’d consider a style rule that non-ASCII string literals should go in isolated translation units.

Andrew November 1, 2021 2:35 PM

Why am I reminded of an old xkcd?

https://xkcd.com/1137/

I’m kinda surprised how big waves is this making today in all my feeds.

Also surprised I haven’t come over an “obligatory xkcd” comment yet.

Grahame Grieve November 1, 2021 2:50 PM

Why modern compilers do take Unicode by default is a bit of a mystery as well…

It’s only a mystery if you speak english and only write software for other english speakers

Clive Robinson November 1, 2021 2:57 PM

@ BCS,

The simplest mitigation for a tool chain is likely to provide an –ascii-only flag.

Err, no, and you know it 😉

At a guess >99% of code has no need for anything outside 7-bit ASCII, so just enforce that.

The default for the compilers should be 7bit ASCII, the exception switch or flag needs to be –notascii–

Zaphod November 1, 2021 3:00 PM

Does it work with vi? I think not.

Also – Bruce you need some advice re. sartorial elegance.

Alexey T. November 1, 2021 3:00 PM

“uses Unicode directionality override characters” — I have read about this 5 years ago on russian Habr.com. And made a special handling of these chars in CudaText (free code editor) – it shows these chars specially.

lurker November 1, 2021 3:07 PM

Let’s all point fingers at Unicode, which should be a pure character set, and let text/word processors decide if and how they want to handle LTR/RTL. Sure, we should be able to write Arabic backwards if we want, after all some people (not just xkcd) think backwards English has some use. Even with the control characters there are still some text/word processors that cannot cleanly handle Arabic quotations embedded in English text. And while we’re at it, what about Unicode control characters for vertical scripts? W3C in its wisdom or otherwise has recommended html devices for displaying traditional Chinese, but there are still enough browsers out there that throw a hissy fit at it.

I had a little smile at Ken Thompson’s rotated character set for vertical graph axes, but PostScript could do it more cleanly by drawing the string in normal characters then rotating the whole string.

No, this problem is a compiler/interpreter problem. Most of them are pretty good at parsing white space, then extend that to invalid characters in valid places.

JW November 1, 2021 3:52 PM

Seems like something a syntax highlighter would immediately make obvious if used in your code review. Most IDE and web utilities have this for major languages at least?

Ted November 1, 2021 4:18 PM

@All

What do you all think about the Defense strategies listed for this vulnerability in the paper (on page 8)? Are any of these more or less effective?

One of them is:

Therefore, a better defense might be to ban the use of unterminated Bidi override characters within string literals and comments. By ensuring that each override is terminated – that is, for example, that every LRI has a matching PDI – it becomes impossible to distort legitimate source code outside of string literals and comments.

Clive Robinson November 1, 2021 4:27 PM

@ lurker,

Let’s all point fingers at Unicode, which should be a pure character set, and let text/word processors decide if and how they want to handle LTR/RTL.

The issue is not just Unicode. The problem is “in band signaling” that is silent to the user, which is stupid.

Even ASCII has it’s control chars, with which you can print backwards and forwards as much as you want. It was after all how *nix “man pages” got their Bold titles on printers that lacked any kind of intensity control (nearly all serial printers and many parallel port printers going into the 1990’s).

No, this problem is a compiler/interpreter problem.

The issue is not realy the compilers either they are acting on what is actually in the file, not what appears on the screen as seen by the human eye, because it’s actually the application that displays the file, that lies to the users eye.

So the real problem is stupidity, for not thinking such behaviour would not have undesirable side effects.

But the problem goes deeper, there has never yet been invented a “universal tool” that “can do every job”… The nearest we have to that when you think about it is a broken claw hammer…

And “broken” is the real issue. To try to make a tool that “can do every job” you have to break things, the only real question is,

“Where do you want the breaks to be?”

Then make them “hard breaks” where they can not hide or be used to hide other things.

That is “overt” not “covert” behaviour with “in band signaling”, anything else is asking for trouble…

Z.Lozinski November 1, 2021 4:53 PM

@Ted

ban the use of unterminated Bidi override characters within string literals and comments.

That’s harder than it looks.

You are going to have to implement that override in the lowest level of the interpreter/compiler. The lexical analyzer is usually a finite state machine that tokenizes the sequences of characters in the input program into logical items. Adding recognition of the Unicode Bi-Di state in addition to the language’s lexical state is complex. I can see interactions when attackers start using homoglyphs for end-of-string-literal or end-of-comment. The lexical analyzer will now have a different view of the input state and may skip ahead. An attacker can use that to remove an input validation check.

You can’t even say no bi-di text interleaved with non-bidi text, because it is perfectly reasonable to comment in Arabic or Hebrew while most of the program text is plain text.

The immediate response is a program source file with any Unicode control code should be flagged as “needs careful review. Then someone gets to write a scanner that looks for anomalies – the challenge is this has to be programming-language dependent.

SpaceLifeForm November 1, 2021 5:04 PM

@ Clive, ALL

The default for the compilers should be 7bit ASCII, the exception switch or flag needs to be –notascii–

This is correct.

This ‘feature’ to accept UTF-8 in source code at the parser level, was a mistake, that added unneeded complexity. And with unneeded complexity, the attack space increases.

All programmers, world wide, even if english was not their native language, knew enough english to read and write source code. More importantly, how to read documentation.

No one needs UTF-8 to get machine code generated.

The fact that no instances of this problem have ever been found in the wild, completely confirms this point.

Clive Robinson November 1, 2021 5:14 PM

@ Ted, ALL,

Are any of these more or less effective?

They are all good or bad depending on your view point under any of many different circumstances…

An editor for writing source code in, should not be hiding such things, some don’t (ed / vi / vim) but most other modern editors do which under most circumstances is bad.

Why modern code editors do this, I have no idea, after all it adds needless code by the bucket full, and as we should all know the number of vulnarabilities in code is approximately related to the number of lines, and worse is multiplied by needless complexity…

The design principle should not be to hide things from the user in a coders editor, because it’s asking for trouble…

The principle of “least surprise” is now well established and hiding things by the use of control charecters and other silent in band signalling is most definately very high on the surprise list…

As a temporary fix, making all source files “7bit clean” ASCII via a simple filter is probably the best thing to do, because it will fix the problem with minimal surprises.

In the long run “yes” we can fix the editors and other tools that hide what is actually in the source file.

But programmers have no need in by far the greatest number of cases to need anything other than 7bit clean ASCII by default. If they need to add “funnies” then they can put them in with escaped numeric or hex codes. But realy, they realy realy should not do even that… You should never embed presentation in the program logic. It’s the very reason the internationalization started in the first place. Programers should lift such stuff out and put it in tables in seperate files not the source code. Back in the 80’s and 90’s some lessons were learnt the hard way, yet now that all appears to have been forgoton.

So,

“Therefore, a better defense might be to ban the use of unterminated Bidi override characters within string literals and comments.”

Is a bad idea for two simple reasons,

1, It still gives a hiding place “for no good reason”.
2, It adds unnecessary complexity “for no good reason”.

Either is bad, but combined, they form the base of another disaster waiting to happen.

It would be better to say “no control characters” other than the basics… But as I’ve pointed out even ASCII control chars alow you to write backwards thus “over-strike”. Which with modern displays in reality means “hiding information”, thus does not follow the “least surprise” principle.

echo November 1, 2021 6:53 PM

Everyone bikeshedding and making this more complicated than need be with three page essays nobody will ever read… You could spec this out in five minutes on one page of A4 instead of gold plating with finger pointing and jargon. We’ll be reinventing PRINCE project management before this topic is done…

  • Keep it simple.
  • Valide your inputs.
  • Put a warning or reveal code toggle in there if you have to. Even word processors have this functionality.
  • Compile.
  • Done.

Still not a single peep off anyone about accessibility which is actually a real problem leaving around 99% of OS and toolchain vendors and websites and other media producers and hardware suppliers legally liable.

Apparently, Google penalises websites which use aggressive SEO similar in concept to this attack to present content with an agenda in front of the indexed content. Yet the self same Google does not penalise websites with poor acessibility.

Moving on – the biggest talk around subjects like super clever “deep learning” artificial voicing of text is all about steel jawed and gimlet eyed duck and roll security. Nothing about making life easier for those with an accessibility problem.

Focus, Grasshopper. Focus.

MK November 1, 2021 8:41 PM

“No one needs UTF-8 to get machine code generated.”
I’m going to disagree, here. That’s because different OS have different default mappings, especially when they support multiple natural languages. ASCII-7, vs. CodePage-1252 vs. EBCDIC vs. ISO 8859-1. Moving code written with one character set to another character set, then compiling leads to problems. People don’t chenge their native code page when writing or editing code. Maybe they should, but I have to import whatever is out on GitHub and can’t enforce an authoring choice. Just shifting between Mac and Windows gives problems. So I convert to Unicode, THEN filter the characters. UTF-8 in the source file to get around 16-bit byte order problems.

Peter A. November 1, 2021 9:54 PM

This is very old news. Back in the 80’s I have used a few control characters in a particular ZX Spectrum BASIC program to hide a conditional I sneaked into it, so it wasn’t visible when the program was displayed on the screen. The program, written by our teacher, was used to proctor physics tests. You can probably imagine what the conditional did. I told my classmates to use the feature sparingly, as not to raise suspicions. As far as I know, it persisted until the very end of the poor Speccy’s life. I am sure such tricks were used by many people before, on much older systems.

Introducing non-ASCII characters to a programming language parser, so you could call your temporary variable хуй, or use Sanskrit numerals, is just plain stupid. Every programmer has to know some English – the keywords are in English anyway (in the languages I know at least, unless you specifically alias them somehow). Moreover, using sensible English names not only makes your code readable for everyone – it also, for a non-native speaker at least, makes you stop and think more abstractively what the object is for and if it is needed at all.

Unicode text strings is a different thing. However avoiding text literals in he code shall be a common practice in inernationalized software – they should be placed somewhere else. Alas, there’s also debug code which virtually nobody cares about etc. which can be misused to hide something from plain view – but only if your editor is stupid enough.

Ted November 1, 2021 10:05 PM

@Clive, Z.Lozinski

Thank you for replying to my question about defense strategies. I am doing my best to keep up with your impressive knowledge.

Based on your responses, I am wondering if I went the wrong direction with my question. I saw someone tweet an excerpt from the conclusion of the paper.

The fact that the Trojan Source vulnerability affects almost all computer languages makes it a rare opportunity for a system-wide and ecologically valid cross-platform and cross- vendor comparison of responses. As far as we are aware, it is
an unprecedented test of the coordinated disclosure ecosystem.

Do you think this is more about a system wide assessment than a particular vulnerability?

MarkH November 1, 2021 11:42 PM

I’ve been startled to see what (by my lights) look like airy dismissals of the needs of non-Anglophone cultures and application domains.

Let the fuzzies sort it out amongst themselves?

The w*gs start at Calais?

White man’s burden and all that?

I’m hopeful that broader perspectives will guide the real-world mitigations.

Clive Robinson November 2, 2021 1:06 AM

@ echo,

You could spec this out in five minutes on one page of A4 instead of gold plating with finger pointing and jargon.

Yet your bullet list, misses the point entirely.

The problem is not the source code file, or the way the compiler or interpreter process it.

The problem is actually the editors and other programs that display the source code file to humans hide information from view.

And because of that failing you can hide nasties and vulnerabilities in the source code file that are “valid code”.

Making an arm waving “Valide your inputs” comment suggests strongly that you realy do not understand the issue. The source code is “valid”

Likreise “Keep it simple.” is at best “a motherhood and apple pie” platitude, and compleatly irrelevant, because this is not an issue of “complexity” but “malicious behaviour”.

Clive Robinson November 2, 2021 2:35 AM

@ MarkH,

I’ve been startled to see what (by my lights) look like airy dismissals of the needs of non-Anglophone cultures and application domains.

Because you are looking at things the wrong way.

In effect you are seeing the swelling in someones leg and saying it needs a compression stocking. Where as the swelling is a side effect of a broken bone that needs setting and splinting and then the swelling will then go away anyway.

Language is a communications tool, designed by humans for humans.

However we do not all speak the same language or write our words in the same way.

Does not speaking Finish make you any less of a human?, how about Cantonese?, or how about one of ~30 African languages that use click consonants?

So why apply the same bad reasoning to a computer program that has a formal language of its own?

Your,

airy dismissals of the needs of non-Anglophone cultures

Is nothing what so ever to do with the computer language, but everything to do with the “totaly irrelevant to the compiler or interpreter” of “presentation” of the source code to humans.

Whilst I disagree –for other reasons[1]– with those who say code should not have comments, they do have a valid point about learn to “understand the code the way the computer does”. That is treat it as a new language and learn to speak it correctly.

So why should there be any comments being sent to the compiler or interpreter any way?

The simple answer is that they should not…

However, there is a reason why comments do go into the compiler, and that is “tracing” in “debugging” from a low level (something embedded systems programers used to do one heck of a lot of last century).

The problem with comments is character sets and the way they are presented to humans. For arcane reasons to do with teleprinters –that long predate computers by a half century or so,– being able to “over strike” was desirable and went back to the earliest of manual typewriters and the ability to “underline” text for titles and headings. Also it alowed “overtype” to get accents or make text bold.

It is the reason why we have in-band control charecters for,

1, Carriage return (no line feed).
2, Line feed (no carriage return).
3, Backspace.
4, Vertical feed (opposit of line feed).

The problem is what works nicely on paper mostly does not work at all on screens.

That is “overtype” on paper alows you to build up a charecter by adding additional print. But on a screen it is nothing of the sort, it is most often simply a replacment.

And the “backspace” and “overtype” on screens at the user presentation level, when combined alow information to be “hidden from view”. So a malicious user can “hide from view” of the user what a compiler or interpreter will see and act upon.

There is no real change you can make to a compiler or interpreter that will stop this because that is not where the “information hiding” occurs.

To solve it the tool chain would have to pre-process the file so it is an exact match for what the editor presents to the users view…

So to do that the compiler would have to know all about your editor… Which we do not realy have a mechanism for currently, and the level of complexity it would add would in all probability open up other vulnerabilities.

But this “hiding” trick is not limited to just abusing comments you can do it to the actual code as well though it is more difficult.

If you just “flag up” the backspace or other presentation level control characters you will have to stop comments etc.

So you are “Caught twix t’Devil ‘n t’deep blue sea”.

Importantly it applies just as much to WASP English as it does to any other human to human language, so it does not have an,

airy dismissals of the needs of non-Anglophone cultures

[1] My view comming from a machine code and lower level background is that comments are part of the specification translation. That is they are an abstraction of the high level specification into a description of function so is a “functional language” at an intermediate level. If you have written your comments correctly the assembler code for one processor could be striped out and replaced with that for another and the comments would be no different.

Dave November 2, 2021 2:36 AM

You can do this even without Unicode tricks. Years ago I messed with a login- check routine on an 8-bit character set system that I updated to work with two variables allow_login, one with a Cyrillic ‘o’, and carefully checked and set one while at one point allowing login based on the other. If you jumped through a few other hoops to pass some checks (couldn’t just allow anyone in) you could get in and bypass the password check.

ResearcherZero November 2, 2021 4:28 AM

“It seems having a process of handling security reports is one of the better solutions to this problem. So making it easier for security researchers to submit such reports seems vital.”

hxxps://www.rtcsec.com/article/killing-bugs-one-vulnerability-report-at-a-time/

Ted November 2, 2021 5:22 AM

@Andrew, Winter

Thank you for the XKCD comics. I very literally learn things from those. 🤓

Also did you see that the Unicode v14.0 update included changes to Vertical Text?

Section 6.2, Vertical Text was clarified to indicate how the Bidirectional Algorithm is (or is not) used when text is laid out in vertical orientation.

Most importantly I’m guessing everyone saw the 37 new unicode emoji’s? I can’t get them to display yet, but maybe next year?

http://www.unicode.org/versions/Unicode14.0.0/

Sut Vachz November 2, 2021 6:23 AM

To steal from a Pet Boys song, “code” is a bourgeois construct. Real programmers use machine language. (I know, that’s code too.) At least, they debug using machine language dumps. I knew of one who worked that way. His byword was “use a stick shift, not an automatic transmission” .

Winter November 2, 2021 6:56 AM

@Sut Vachz
“To steal from a Pet Boys song, “code” is a bourgeois construct. Real programmers use machine language.”

Nah, real programmers use a magnetized needle and a steady hand:

https://xkcd.com/378/

[Randall goes one step further]

echo November 2, 2021 7:12 AM

@Clive

I’m not missing anything Clive. You’re not reading what I said the right way. Go back and read both comments again.

If this was a problem which landed on my desk I’d just get on with it because in all honesty I’d rather carve code than sit around listening to old men smoking clay pipes yammering on about it.

@MarkH

I’ve been startled to see what (by my lights) look like airy dismissals of the needs of non-Anglophone cultures and application domains.

In the real world a fair chunk of none native English speaking coders code in English and if be use their local none unicode character sets. As for unicode in code files I’ve never used it myself. I don’t personally see unicode as a issue if people (either manually by converting from unicode or automagically via tools) validate their inputs and handle them correctly.

As for structural racism issues (or sexism and ageism etcterea) that’s a big topic hardly anyone on either side handles well. It can become polarised and toxic very quickly which I why the last time political issues came up relating to a certain country which peddles spyware came up I skipped the topic.

You need to work through the policy stack very carefully. It’s a waste of time pointing out the legal issues as nobody reads it. Unlike most people I do my background research and yes I have read treatease on Iranian law and Saudi Arabian law, as well as reading up on Kenyan law, and South African law, and Hong Kong law, and EU jurisprudence and case law, as well as the Russian Constitution among other documents. I also read academic papers and commentary from affected communities. Unfortunately we have a low information media and reactive politicians and neither are helping.

Loads of applications today come with translation in the application and and documentation baked in. Yes it actually does cost money to get it done properly and in a timely fashion and yes some vendors do community outsource this but it is a thing. Likewise for none English vedors.as we know some translation has been done horribly but things are improving. China and Japan employ more English native speaking translators now and possibly for other languages but like I said it costs money.

As for none English code? What’s sauce for the goose is sauce for the gander. I’m not learning [random foreign language of choice] just to use a piece of code.

Rome wasn’t built in a day.

For safety reasons in some contexts there are times when I will insist on a native English speaker or someone who is predominately of the same culture. That’s not racism that is simply not wanting to be pranged on lack of grasp of subtle issues or where there may be an ideological difference of opinion. I’ve had clients who are none native English speakers and their grasp of English is very good and they are more pleasant to work with than some 100% white English. The occasional one had bad English but they make an effort and they are agreeable to work with so I can work past these differences. There are others who I have to decline especially if I think we’re not on the same page.

Life is short.

Winter November 2, 2021 8:09 AM

@Dave C
“It’s almost impressive how many commentators on here don’t “get it” and propose expensive complicated “solutions” ignoring the reality…”

Indeed, the problem is not that coders would use Unicode, but that the tools will render Unicode if it finds it. The coders can be diligent with their use of Unicode, the attackers will be very diligent too.

That said, if your default user or coder does not understand English, then there are many, many situations where it is be better to put a text or comment in a language they understand in the file with the source code.

Anyhow, what law of nature prescribes that all coding should be done in pseudo-English? For most humans, English is a difficult to understand language.

There already is a language that codes in Classical Chinese:
https://spectrum.ieee.org/classical-chinese

Note that the Classical Chinese script is made in hell (a higher level than Japanese, I must say). There were very good reasons it was “reformed” after the 1949 revolution.

Clive Robinson November 2, 2021 8:23 AM

@ Ted,

With regards,

“The fact that the Trojan Source vulnerability affects almost all computer languages makes it a rare opportunity for a system-wide and ecologically valid cross-platform and cross- vendor comparison of responses”

Is a statment that is paraphrasing the issue somewhat…

The reality is,

1, Only relatively modern compiler etc tool chains support Unicode.
2, Other programs that support Unicode are vulnerable in some way (even if it’s a DoS).
3, It is not just Unicode that has in-band control codes that alow this hiding of information in files and serialised streams.

Thus the actual affected code base is a lot lot larger than “computer languages” for compilers and interpreters…

But look at it in terms of a generalised computing stack model. It is a valid attack at the presentation layer, which was the upper most layer of the ISO OSI seven layer model. But it works “up” to the user layer to produce vulnerabilities further down…

Take a moment to think about that.

Yup it is an attack that works up the stack to attack the “human”. And by successfully attacking the human, it gets a vulnetability into the code at a lower level.

Does that make it some form of “social engineering”[1] attack?

Some will no doubt say that it is a “steganographic”[2] attack and I guess linguistically they would be correct.

Though both catagories feel wrong 😉

It is however an attack using valid “in-band signaling” of a “serialized stream” which when people realise the actual extent, makes it a very big “class” of vulnerabilities in which many many “instances” are very probably going to appear.

Which is why, yes to a limited extent did make it,

“a rare opportunity for a system-wide and ecologically valid cross-platform and cross- vendor comparison of responses”

If and only if (iff) the vendors chosen were providing programs that were actually affected by it.

This is where I am probably going to get shot at by a few people 😉

Were the compilers and interpreters vulnerable?

Arguably no. That is they received a valid file within their language specifications and processed it correctly. Does this mean their specifications were incomplete?

Arguably no. No program can be all things to all situations it’s logically and practically impossible because things will break. Thus it’s highly undesirable to even try to make a program like a compiler or interpreter do this. Look at it this way, the file is valid the syntax of the code is valid and so on up. It’s only when you get up into a much higher level in the functional business logic of the application being compiled or interpreted that the choice can be made, and the compiler or interpreter has no way to know if the code is intended or not, nor should it be able to decide. The best it can do is warn about the use of “in-band signalling”, which is to be expected under many use cases…

In short, as the compiler or interpreter is part of a tool chain preceading parts of the tool chain should be responsible for picking it up.

Thus logically you need to move such functionality where it belongs, which is in or immediately following the human user interface. That is the editor etc.

So the question arises as to what venders were contacted?

Having had the opportunity to skim read through the main body of the paper, there is not a list, though refrence is given to using MITRE as a “last resort”.

Without this list, we have no way to independently judge if the responses given by the vendors was appropriate or not…

One of the major reasons for scientific papers being withdrawn from journals is that the published results may not reflect the actual data set. This is not to say that that there is any way an attempt by authors to decieve either consciously or subconsciously, they may use inappropriate analysis or be unaware of other pertinent information.

To demonstrate this, there is the line about “Physics is a series of lies each more acurate than the proceading ones”. Sir Issac Newton came up with a theory of gravitation, it’s elegent it’s simple and it’s more than good enough to get you around the solar system. As our ability to measure improved, certain inacuracies were found that Newton could not have known about. Just over a century ago Albert Einstein came up with a new idea, it was found that whilst this could account for the anomalies, it’s not suitable for navigation as there is a problem. Newtons work inherantly assumes that the force of gravity is not constained by the speed of light, and it produces stable results. Einstein’s work does take the speed of light constraint into account, but it does not produce stable results, so without extream caution it won’t get you around the Earth let alone the solar system…

Now getting back to the paper, why has the list of venders and their responses not been published in the paper so others can analyze them independently?

The answer is “Human nature” at one of it’s basest levels. If such information was publically available, there would be consequences.

Unfortunatly the likes of certain vendors Oracle being one have repetedly made false claims about the function[3] and security of their products for fiduciary gain (ie technically fraud and extortion). Then senior Oracle staff have effectively threatend people with “breach of contract” including chief security officer, Mary Davidson[4] and other legal sanctions such as implying “false light reporting” for investigating the security of Oracle’s products. The customers had little choice because they had become aware by a process of elimination that the products were vulnerabe and had been exploited to the customers loss…

So some vendors are “trigger happy” and have reached for “legal sharks” as a first solution and some still will.

Which kind of makes it tough as a researcher…

[1] “In the context of information security, social engineering is the psychological manipulation of people into performing actions or divulging confidential information.“,

https://en.wikipedia.su/wiki/Social_engineering_(security)

[2] “Steganography is the practice of concealing a message within another message or a physical object. In computing/electronic contexts, a computer file, message, image, or video is concealed within another file, message, image, or video.“,

https://en.wikipedia.org/wiki/Steganography

[3] https://www.pcworld.com/article/472852/university_accuses_oracle_of_extortion_lies_rigged_demo_in_lawsuit.html

[4] https://www.digitaltrends.com/computing/oracle-cso-blog-security-testing/

Peter A. November 2, 2021 8:47 AM

@MarkH: “I’ve been startled to see what (by my lights) look like airy dismissals of the needs of non-Anglophone cultures and application domains.”

There’s nothing told here about non-Anglophone user interfaces. They are completely fine, welcome, and needed. There are multiple methods to develop software in a way that it is relatively easy to “translate” by systematically replacing or plugging in text strings and other elements to be presented to the user.

Also, everybody is free to use own language (even if transliteration is needed) in code comments, text strings not intended for user consumption (debug, logging, diagnostics), or programming constructs names (variables, functions etc.) It is just unwise from maintainability point of view. The next person to take over the code base may not speak that particular language. I remember I had to learn a bit of Portuguese while hacking Timex FDD 3000 CP/M BIOS. I could do without, by just analyzing the machine code, but the attached listing with comments was quite helpful; would be much more helpful if comments and labels were in English. Some labels were too cryptic for me to decipher, though, due to 6-character limit IIRC. English abbreviations could be more readable for me, even as a non-native speaker.

The situation described above was only a mild issue, as the text was ASCII, so I could recognize the glyphs at least, and look up the words in a paper dictionary. The problem is amplified towards infinity if a language uses a script that the next person has no familiarity with and simply cannot tell the difference between characters. I’ve seen a Python script that contained Sanskrit numerals (as a practical joke, apparently). It worked fine, but was unreadable until I converted all of them into “normal” numerals. If all comments and identifiers were written in a script other than Latin, it would be a totally unreadable garbage for anyone other than the original authors and their kin. This is why I despise including non-ASCII characters in the language itself (string literals are a bit of a different issue[1]).

There’s a reason doctors still learn Latin and science papers are often published in English, even if all authors are non-native speakers, if at all.

[1]
For example, in Python 3.9.x I’m quite fine with this:

>> penis = ‘хуй’
but this, while syntactically correct, is unacceptable for me:
>> хуй = ‘penis’

Winter November 2, 2021 8:58 AM

@Peter A
“There are multiple methods to develop software in a way that it is relatively easy to “translate” by systematically replacing or plugging in text strings and other elements to be presented to the user.”

The problem is that the same tools are used for the computer code and the non code text files. If the IDE sees Unicode, it must render it because this could be valid text files.

Clive Robinson November 2, 2021 9:19 AM

@ echo,

If this was a problem which landed on my desk I’d just get on with it because in all honesty I’d rather carve code than sit around listening to old men smoking clay pipes yammering on about it.

And you would fail, plain and simple.

I’ve already indicated why with valid argument the problem is not solvable at the compiler or interpreter.

The fact you still don’t grok that says volumes.

But you have a happy life just clicking on those keys,

https://xkcd.com/722/

P.S. I don’t smoke, clay pipe or otherwise, something I suspect can not be said of you, perhaps you are overdoing the “herbal tobacco” again.

Clive Robinson November 2, 2021 9:23 AM

@ Dave C,

Thankfully I read On “Trojan Source” Attacks before these comments.

I’ve just read it, and yup, he sees it the way I do…

Freezing_in_Brazil November 2, 2021 9:33 AM

Non-native English speaker here

I have lived all my life suspecting diacritics and special characters. like those used in the Portuguese language. I never use them in code and I know many programmers who avoid them too.

Portuguese words like identificação, memória can be used – both in structure namenclature[1] and in comments – without diacritics or special characters [identificacao, memoria]. They remain perfectly readable for the Portuguese programmer while keeping the program safer.

[1] Variables, constants, function, etc names

Ted November 2, 2021 10:15 AM

@Clive

Now getting back to the paper, why has the list of venders and their responses not been published in the paper so others can analyze them independently?

Yes, interesting thought. Adding to that (and as @SpaceLifeForm said) were there even examples of malicious code in the wild?

To elicit a CVE (CVE-2021-42574) there must have been some demonstrable problem. How were they able to trump up excitement?

From the paper:

Neutral disclosures like those found in academic papers are less likely to evoke a response than disclosures stating that named products are immediately at risk.

I guess it was determined that something was at risk? Also, another point that I though was interesting was:

Novel Vulnerability Patterns…We observed a tendency to close issues immediately as representing no threat when they did not align to something well-known and easily evidenced, such as SQL injection.

I’ve greatly enjoyed reading the excellent discussion here on some of the ins and outs of programming – most of it honestly over my head. Maybe people fear there is enough complexity in software that it would be a little ducky to avoid planning for the unexpected. 🤷

MarkH November 2, 2021 11:06 AM

@Ted:

To elicit a CVE … there must have been some demonstrable problem. How were they able to trump up excitement?

“Vulnerability” refers to the potential for harm. It’s neither necessary nor desirable to limit disclosure databases to vulnerabilities already under malicious exploitation!

The general idea is to close the barn door before the cows escape, when that’s an option.

As for trumping up excitement, Anderson is one of the best in the world in research and education concerning security engineering. I haven’t known him to be sensationalist.

It’s a real vulnerability; it might well be practical to exploit in a variety of circumstances; it could be used to do great harm. What makes it worthy of a blog post (such as Bruce’s here) is its extremely broad scope.

Bear November 2, 2021 11:11 AM

A viable fix, IMO, is to handle bidi-override control characters exactly the way we handle backspace control characters.

There are no literal backspace characters in any source code for any non-esoteric language anywhere. They are intercepted by the editor and treated as editing commands, not as characters that pass through the editor into the source file. When we mean for a literal backspace to appear in a string, we write \b – or whatever that language uses to represent it. In fact a literal backspace character in any source code file, no matter what, should cause the compiler to barf.

We’d also need to ban the use of RTL punctuation outside of string constants and comments.

Winter November 2, 2021 11:21 AM

@Ted (MarkH)
“As for trumping up excitement, Anderson is one of the best in the world in research and education concerning security engineering. I haven’t known him to be sensationalist…”

Ross Anderson did write the book on “Security Engineering” [1].

If Anderson writes
We have discovered ways of manipulating the encoding of source code files so that human viewers and compilers see different logic.
You already know it is time to take action now.

[1] Security Engineering 3rd edition is available on paper, an older edition is online (scroll down):
ht-tps://www.cl.cam.ac.uk/~rja14/book.html

Sut Vachz November 2, 2021 12:25 PM

What I want is a language that makes it clear what I am trying to do, so that the compiler can hammer straight my bent code nails.

Z.Lozinski November 2, 2021 3:12 PM

@Bear

We’d also need to ban the use of RTL punctuation outside of string constants and comments.

My reading of the “Trojan Source” paper is that this is not sufficient.

The problem is that creative use of bidi control codes breaks the string constant or comment lexical elements of the programming language. You cannot guarantee that a bidi control is wholly contained within a sting constant or wholly contained within a comment. The programmer sees one thing, the language processor sees something different.

From my own experience, I’m all in favour of development tools making bidi and DBCS control codes visible. A programmer’s editor should not be hiding these.

In the meantime we need to scan source for any bidi control codes.

SpaceLifeForm November 2, 2021 3:43 PM

BNF is your friend

This is seriously not a huge problem.

You just make sure that problematic symbols can never reach the lexer.

You can allow UTF-8 in String Literals or Comments, but force pure 7-bit ASCII to be used for everything else. And that everything else is what the compiler or linker deals with in the main, in terms of logic, machine code generation, and symbol resolution.

Keep the core as simple as possible.

Like it always has been.

hxtps://en.m.wikipedia.org/wiki/Backus%E2%80%93Naur_form

Then read up on Yacc and Bison, and then on Lex and Flex.

The entire process is NOT simple.

So, why make it more complex?

Ted November 2, 2021 5:12 PM

@MarkH

It’s neither necessary nor desirable to limit disclosure databases to vulnerabilities already under malicious exploitation!

Yes, reporting facts is more clear cut than analyzing them. I am trying to make sense of this with extremely limited experience, so I very much appreciate you giving me more information to consider. Heap it on I say.

I will try to do some more reading on the vulnerability and coordination process. I don’t know if I’m the only one who thinks this, but I’m pretty darn sure I’m not the world-renowned expert here 🙂 I may actually have more confidence in that than just about anything ever.

But about sensationalism, what parts of this process do you think do better with more attention?

@Winter

You already know it is time to take action now.

The CVE severity base score is listed as 9.8. That is high isn’t it?

MarkH November 2, 2021 5:50 PM

@Ted:

Don’t be in awe of the expertise in the commentariat! Altogether, there’s an impressive volume of experience and knowledge; however, this does not preclude a percentage of foolish and incorrect comments.

what parts of this process do you think do better with more attention?

The paper reports that some of the programming resource organizations the authors contacted either didn’t respond, or indicated that they don’t plan to patch against the vulnerability.

More attention to the problem might encourage more action toward fixes.

P.S. If you’re new to security, you might not be aware that attacks on foundations are more useful to malefactors — and more dangerous to victims — than attacks on the “upper works.”

Software toolchain attacks are especially insidious, and InfoSec security people are inclined to take them seriously.

Ted November 2, 2021 5:56 PM

@MarkH

however, this does not preclude a percentage of foolish and incorrect comments.

Actually comforting. thank you

MarkH November 2, 2021 6:01 PM

@SpaceLifeForm:

The essence of the problem isn’t that Unicode symbols reach the compiler or interpreter. Outside of string literals, almost no language tools can “digest” them as semantic inputs anyway and would just throw error messages.

The essence of the problem is that by one or more views of source files, the visual presentation implies semantics distinct from the compiler’s interpretation.

Filtering what the lexical analyzer sees can not remedy that disparity.

Clive Robinson November 2, 2021 6:23 PM

@ SpaceLifeForm,

BNF is your friend

Yes it removes ambiguity in LEXing but that is not relevant here.

The source code file is serialised. The vulnerability is put into the file, it is syntaticaly and lexicaly correct code. The compiler or interpreter reads it in and treats it as valid code, as it should do.

At some later point in the file is a bunch of control characters serialised in a perfectly valid comment or data decleration of some form so the compiler or interpreter reads it in and again treats it as valid as it should do.

Take a comment that contains a couple of hundred back spaces as far as the compiler or interpreter are concerned they could be the same bumber of ‘A’ or ‘z’ characters it treats them all the same and appart from allocating space in the heap and putting them there it does nothing with them.

That is the backspace or any other character in a comment or data declaration does not get interpreted in any way (unless the data gets used in the program).

As I’ve said a couple of times now the problem is not the compiler or interpreter, which is why you can not fix this vulnerability in them. Trying to would be a monumental excercise in futility and would in all probability create new vulnerabilities.

The problem is entirely in the editor or other application that pushes the file out at the presentation level to the users screen.

In effect what happens is,

The file is read in character by character. Each is either written into the display memory or ifva control charecter interpreted.

So,

1, The vulnarability gets written to display memory.
2, The string of backspaces moves the display memory write pointer back before the memory that holds the vulnerability.
3, The characters following the string of backspaces gets written into the display memory over writing the vulnerability.
4, The display memory gets output to the screen.

That is in the editor, browser, or other app that displays the file contents on the users screen never displays the vulnerability code because it correctly interpreted the control characters in the file.

The solution is in the presentation layer application that displays the file, it should not interpret control characters.

Old style *nix CLI apps that output files to Standard Out and Standard Error as a general rule do not interpret the control characters, just replace them with a ‘.’ or similar.

So much for the ASCII backspace method, Unicode is on steroids in comparison. It is capable of oh so much more, so much in fact it’s distinctly problematical to resolve.

To do so would require the equivalent of a syntax parser to issolate where Unicode control characters can be safely interpreted and where they can not.

Oddly perhaps a few editors already have much of the required code to do highlighting already…

MarkH November 2, 2021 6:29 PM

An example of a more benign form of the appearance vs meaning danger is the FORTRAN programmer intending to code

DO 10 J = 1, 20

(meaning iterate the following statements up to line 10 with variable J initially assigned as 1, and incrementing in each iteration, terminating after iterating with J equal to 20)

but inadvertently coding

DO 10 J = 1. 20

FORTRAN actually deleted all white space prior to analysis, so in

DO10J=1.20

the substring “1.20” was interpreted as a floating point constant … the statement was interpreted as an assignment … and DO10J was interpreted both as an implicit variable declaration and an assignment lvalue.

In this example the mistaken statement was visually distinguishable from the intended one, though it’s easy to imagine “seeing” the dot as a comma because DO statements always have a comma there.

Supposedly, a U.S. interplanetary spacecraft was lost to such a mistake.

SpaceLifeForm November 2, 2021 6:36 PM

@ MarkH

BNF is your friend

The essence of the problem isn’t that Unicode symbols reach the compiler or interpreter. Outside of string literals, almost no language tools can “digest” them as semantic inputs anyway and would just throw error messages.

That is a Feature, not a Bug.

MarkH November 2, 2021 7:03 PM

@SpaceLifeForm:

It seems to me that you’re not understanding this problem.

An attacker exploiting this vulnerability will, of course make sure that the malicious source compiles ok — and will do so taking account of the compiler configuration under attack.

But code reviewers may see different code from what the compiler sees. That’s the problem, and keeping Unicode away from the lexical analyzer doesn’t fix it.

SpaceLifeForm November 2, 2021 7:23 PM

@ MarkH

Fortran has ambiguous grammar

Interesting old link from 1977

hxtps://apps.dtic.mil/sti/citations/ADA039969

Elsewise, just search on (Fortran grammar)

echo November 2, 2021 8:28 PM

@Clive

Pretty much everyone is repeating in one form or another what I said from the start and expanded on later. They’re adding their own focus and interpretation or a bit more gold plating but it’s basically the same thing.

I just can’t be assed writing up pseudo logic or drawing a flowchart because if this problem landed on my desk I’d rather be getting on with that than talking about it. Why? Because when my coder mind switches on I go into the zone. It’s quicker to do stuff than deal with cognitive chatter or anxieties because that’s a good way to get coders block.

Back when I did COBOL I made massive use of templates. When I did C/C++ I used a severe subset because that got 99% of the problem solved plus the code was more simple and maintainable. Depending on the particular problem being solved by the code anywhere from 10% to 50% of the line count would be in-line documentation.

1.Do we filter on ASCII and UNICODE depending on how lazy we are?
2. Validate input.
3. Do we have a problem with character codes changing presentation? If so issue warnings on load and save, and outside of the valid set provide warning highlights and toggle to human readable format and delete/modify as appropriate. There’s quite a bit of logic you need to do here but you should be able to come up with a solid reference implementation usuable across a wide range of text types from filenames to data fields to long form text.
4. Whatever we get here will be clean so we can compile.
5. Done.

I’ve probably missed something but like I said I can’t be assed with it. The fix is already in and a zillion people will be looking at it afterwards and I’ve got better things to do.

Always, always, always validate your data. Never never assume. Yes it can add code in bigger systems you could in theory delete and everything would carry on regardless but when some smartass meddles with another part of the system don’t blame me if it turns to poo. If you’re a speed freak you can add conditional compiles or use more obscure or dangerous code functions to optimise code but that’s another thing. For people who want to live dangerously by default if you change one line of code you’re going to have to validate the lot. Good luck justifying the cost of that code audit or picking up the pieces of your career if it goes wrong.

I’m not citing myself as a representative sample but that’s what a coder is going to focus on, Clive. They’re not going to want to hear anyone’s life story or jargon flying around the room. It’s kind of a given.

There’s nothing magic here. You’ve got variables and logic and the basic building blocks of a language then validation. Validation was hammered into my skull because the people who tought me had a background in data processing (and maths and engineering funnily enough). Most worked for or consulted for what was then called British Aerospace (now BAE) or had half a dozen scrambled letters after their name. It stuck. My validation and documentation was so bullet proof it got their positive attention. Yes I got in to trouble in lots of other ways because I like pretty pictures and making systems do what they’re not necessarily designed to do but on validation I was pretty faultless. Put it this way. No wiseguy hacker is getting past me because I probably thought of it myself hence my validation putting input on a very short leash.

So validate, validate, validate.

name.withheld.for.obvious.reasons November 2, 2021 9:36 PM

@SpaceLifeForm

As with any language, adaptations are made that serve one particular feature/customer/developer/marketing campaign.

Luckily for Cobol (unless you include SNOBOL), Fortran, Ada, Prolog, Forth, and maybe Smalltalk. The emphasis has focused largely on the language and its functionality. A few are the esoteric large iron development platforms, OOCAM comes to mind. C/C++ have found their way probably the widest in adoption thanks primarily to Linux. Visual Basic though somewhat useful is more problematic across platforms/OS’s. To my mind, MASM is probably the way back from the forest of confusion that exists today. I have worked on production compiler platforms and it is not pretty and I cannot say it has gotten a lot better. The feature race always wins out over the robustness characteristics. So choose your language wisely, Luke. And may the Fortran force be with you. (NOTE: I am a member of the First Assemble Language Programmers of God, not to be confused with the Peoples Front of Judea).

MarkH November 2, 2021 10:31 PM

@name.:

Python (Monty) references are welcomed.

I seem to recall that long ago, Bruce linked to audio from the cat licence sketch.

SpaceLifeForm November 3, 2021 2:42 AM

@ name.withheld.for.obvious.reasons, MarkH, Clive, ALL

Code review scheduled for Friday.

The code is written in WhiteSpace. Please bring a sheet of blank paper, so we can review.

Seriously though, this instant problem is actually multiple problems.

As Clive mentioned, it depends upon the medium. Monitor vs paper.

But, why does anyone really believe that code reviews really work?

In my experience, unless the reviewer is already intimately familiar with the code being reviewed, they will likely miss things. Even if they are looking at true actual valid source code.

If an attacker can control the build environment, then the attacker can make the reviewer(s) see different code than what the toolchain sees. Which is the essence of the instant problem. Except the attacker does not need to have to control the build environment, but just get a malicious file accepted by a developer that does not see a problem.

It only takes one commonly used source file.

If the attacker can do that and can subvert a header file, that is NOT reviewed, then the review is worthless.

Which is why Clive mentioned running it all thru the preprocessor, and then reviewing that.

But, if the attacker can subvert the build environment, that will not work either, because the attacker can switch files temporarily at build time. (See SolarWinds)

Review Time is NOT Compile Time.

We want to detect a subverted build environment, and FAIL FAST.

And FAIL HARD. We want to crash the machine (with dump) in order to catch the attack, collect the artifacts.

First line of defense is KISS.

So, if a source file looks ok on paper or monitor, but is actually malicious, then the defense has to occur in the build tools.

So, scan all of the input files, and disallow non 7-bit ASCII.

If you want to allow UTF-8 in your source code for String Literals or Comments, then, you have created your own additional headache. Now you will need a parser to not flag things.

If you want to allow UTF-8 anywhere in your source code, you are invited to my Friday code review of the WhiteSpace program.

I just want to say one word to you. Just one word. Macros.

s/Macros/Plastic/

Macros are Plastic.

Some Plastics are very flexible, if you get my drift.

Clive Robinson November 3, 2021 4:55 AM

@ SpaceLifeForm, ALL,

But, why does anyone really believe that code reviews really work?

I don’t think they realy do, but they are a check box on a “managment approved” list, which came about from some assumed “Best Practice” list etc.

As I’ve mentioned before I worked for a while at an organisation where managment put what they viewed as the “most productive” programers on producing production code features, and lets say those they viewed as the “least productive” on code review[1]. What does that say?

Even a semi-smart programmer could get stuff by that code review process and into production, “Code signed sealed delivered it’s yours”[2].

But as you correctly note, “code review like “scrums” are way way up the chain well away from code signing and as Shakespear’s predecessors observed “There’s many a slip twixt the cup and the lip”.

Way way back on this blog[3] you will find conversations about the failings of code signing and code reviews that covered this issue exactly. Yet here we are still pointing out the failing more than a decade later…

But let’s be honest, it’s now kind of widely acknowledged that “scrums” are in reality an abusive proces that is inefective. Often wielded by certain types of people as weapons against those they dislike or want to penalize. The same failings are also apparant with code review processes. What we need is effectively a “blind process” that can not be used for “office politics”, and we don’t currently have one.

So time for my lament 😉

“Why is the IT industry not learning from even it’s living history?”

Any bets on how many more decades it will be true?

[1] The managment appeared to think that fast production of marketing features that shall we say were forever “high in maintainance” whilst “low in use” was a winning stratagy.

In theory a good code review process would have sent much of it direct to the “scrap yard” because it had more holes than a pair of second hand string underpants. But managment would have instantly pulled the teeth. So all that came out of it were comments about “comments and style” and I do not remember them ever finding actual faults, vulnerabities, or backdoors…

[2] With appologies to Stevie Wonder and fans for mutilating “Signed, Sealed, Delivered I’m Yours”.

[3] This from over 11years ago states exactly the conversation we are now having,

https://www.schneier.com/blog/archives/2010/03/back_door_in_ba.html/#comment-133824

But it was at the time “old” you can go back further discussing the failings of both “code signing” and “code review”. But I even outlined how malicious actors would behave,

https://www.schneier.com/blog/archives/2011/06/malware_in_goog.html/#comment-161737

The fact the industry has not changed for the better in the slightest, as they say “speaks volumes” about the industry, or more correctly those who manage it.

[4] Have a read of this very short article about the history and failings of assessing glucose levels in blood samples, to see it’s not just the Software Industry QA processes where problems slip through,

https://academic.oup.com/clinchem/article/60/7/1025/5621704

just me November 3, 2021 5:42 AM

Can you remember the case of emoji SSIDs crashing smartphones? I hope there won’t be much CVEs like “three banana emojis can cause poop variable to overflow” in the future.

But seriously, checking the code for non-printable or non-standard characters (except strings and names of variables and so on)should be industry standard.

Clive Robinson November 3, 2021 7:34 AM

@ Matthias Wiesmann,

Interesting link, especially the small line that says,

“Published 8 years ago by Thias”

I’ve noted on several occasions that vulnerabilities discussed on this blog have in the past taken around eight years to become known in the industry…

But even back then Unicode was just one of the more current of “code sets with control charecters”.

ASCII has “backspace” and vertical tab and goes back over half a century to around 1960 with it’s standard being published in 63[1]. But before that most “teletype codes” had a “back space” or equivalent to enable correction of mistakes a century ago, that they inherited from manual typrighters.

The first manual typewriter was made by Pellegrine Tarri, who made an early typewriter back two centuries ago in 1801[2]… Though if it had a “backspace” equivalent I have no idea. But he clearly had ideas about over striking and similar because he also invented carbon paper in 1808.

The thing about all these codes and the systems that “present on paper” is that the “overtype” is usually obvious and often quite usefull. The move to initially Nixie tube, then “Cathode Ray Tube”(CRT) and later other electronic display systems changed the “backspace” and other control charecter effects from “overtype” to “overwrite”.

To most the difference between “overtype” and “overwrite” is to subtle to notice let alone think about… When they have it’s because of the likes of umlauts (double dots over vowels). Often they double dots existed as a print head in it’s own right so rather thsn have all the vowels repeated with double dots over them you typed {vowel}{backspace}{umlaut}.

So whilst there as clear as day, the potential for backspace and other printhead control characters to be used for mayhem has passed people by. For probably three to five decades, that is since the use of paper based user terminals declined in the 1970’s and early 80’s. At first the likes of early editors would have shown it up. But the first editor I can find that does not show it up is Microsoft’s DOS editor[3]…

[1] Of interest to cryptograhpy is the all important ASCII “NUL” control charecter which is 0x00. The NUL could have had any values as a control character. But importantly if you XOR or more importabtly ADD 0x00 with any other value, it does not change. So if you use a One Time Tape(OTT) super encryption system which has a non binary power sized alphabet as all telletypes do. Then it provides an easy solution to the “overflow problem”. That is if you receive a character that matches to the one on your OTT you know a NUL has been sent and you just “drop it”. So as the sender if the charecter you want to send when added to the OTT character “over flows” out of the teletype printing alphabet, you just hold the input and send a NUL. You keep doing this untill the addition does not produce an overflow, and so send the actual charecter.

[2] This was long before what many claim as the “first” typewriter by “US patent date”. Made by Christopher Latham Sholes of Milwaukee and others in 1868. What might be of further interestcis that the prototype was built by the clock-maker and precision machinist engineer Matthias Schwalbach the year before.

[3] As Microsoft’s editor actually came out of their implementation of BASIC the chances are that other “screen editors” in BASICs of a myriad of home computers that loaded files from tape or disk also did information hiding by printhead control characters. I guess those that still have working ones can “test and report back”.

Frank Wilhoit November 3, 2021 8:02 AM

@lurker:

“…let text/word processors decide if and how they want to handle…” whatever (directionality is only one facet). They can only do that if the text is self-describing. That is what Unicode has come to be about. It is not easy, at all. Taking a step back, this is a prime example of the fact that machines are not good at what humans are good at. In order to pass off a human task to a machine, it typically has to be simplified, sometimes very drastically.

FA November 3, 2021 3:01 PM

@Clive

So as the sender if the charecter you want to send when added to the OTT character “over flows” out of the teletype printing alphabet, you just hold the input and send a NUL.

Ehh, no, you send the OTT character. Which combined with the same at the receiver will produce a NUL. That at least is how Rockex worked (except that it used XOR instead of ADD/SUB, which changes nothing essential).

Quite a clever trick. Both the one-time key and the ciphertext would be just A..Z, while the plaintext could use the full Baudot symbol set.

bear November 3, 2021 4:32 PM

After a thorough review of the unicode bidi algorithm, I think I know how to deal with this. But it’s damned annoying, requires a rewrite of parsers and compilers, and more than a little cooperation from a programming editor for unicode.

The bidi state must be congruent to the program syntax tree structure. And the compiler must enforce it.

We can’t get rid of Unicode’s disastrous visual ambiguity – there are still uncountable code sequences that display exactly the same – but we can enforce that all sequences whose visible display diverges from a specific form corresponding to their sequence and semantics, are syntax errors.

After traversing any syntax subtree (even a trivial one such as a comment or a string constant) the bidi state must finish in exactly the same as when that syntax subtree started.

This annoys the hell out of me particularly when it affects infix operators, because while the infix operator itself is at the same syntactic level as its context its branches are subtrees and can have their own bidi states. The annoying part of that is that most of the characters we want as infix operators are marked by Unicode as bidi-neutral, which means we have to stick explicit bidi-override control codes in specifically to force the operator to conform to the requirement.

Let’s say an infix operation with a bidi-neutral operator appears in an LTR context but both its subtrees are RTL. For example an assignment using bidi-neutral ‘=’ from an RTL hebrew-alphabet variable to an RTL arabic-alphabet variable, in the consequent clause of an LTR Roman-alphabet if statement. That is, in sequence order we have

if (…) var1 = var2

Program logic says that ‘=’ sign is LTR because the operator itself is in the LTR context of the ‘if’ consequent, so the statement ought to display as

if (…) 1rav = 2rav

But it is lexically separated from the ‘if’ statement on both sides by its branches. Absent any hints, the Unicode Bidi Algorithm will conclude that it’s RTL because it’s neutral and appears between two RTL elements. So it will display as

if (…) 2rav = 1rav

and do a thing that corresponds with the sequence order but not with the display or programmer expectation. It looks EXACTLY like an assignment in the opposite direction.

In order to make sequence, display, and semantics congruent, our editor must insert appropriate control codes making the sequence

if … var1 (control)= var2(control)

which the unicode bidi algorithm will display as the programmer wants it:

if … 1rav = 2rav

and the compiler must treat those control codes as semantically significant required elements of the language to bring the bidi state of the operator back to congruence with the bidi state of the surrounding if clause. In this case the second sequence re-establishing the bidi state of the ‘if’ context can be skipped if reading the following character has the same result – and because most text editors won’t allow the control codes to be there in that case, the compiler must allow them to be skipped.

Clive Robinson November 3, 2021 4:33 PM

@ FA,

Ehh, no, you send the OTT character.

That is the result when,

“you just hold the input and send a NUL.”

The whole of what you quoted, is talking about the “input to the adder” not “the output sum to line”. Because the “holding” is a function that preceads the adder.

With regards the XOR function not ADD, yes the result is very similar, only the complexity is different[1].

[1] With an XOR system each bit is in effect isolated from all the other bits as the XOR is a “half adder”. With ADD whilst the least significant bit is an XOR function, from then on the other bits have the potential for a carry input. This causes a non linear effect in the higher bits, which some believe is desirable.

Personally I use both in series in some stream cipher systems, using two stream generators that are very different in their basic design.

In effect the XOR “whitens the input” changing it’s statistics and the ADD does the actual encryption of that.

Some think the other way around ie ADD then XOR is better. Further some think “rotating” the input is as good as whitening it (but as it does not change the set to clear bit ratio, only the bit positions, I have my doubts).

RealFakeNews November 3, 2021 7:55 PM

Source code editors should display ALL characters of a source file.

For years, Visual Studio has presented alert boxes for source code not written on the local computer (well, projects, anyway). This is because it will not display certain characters, and the way it parses projects can itself be exploited to do nasty things.

This Unicode problem needs filing under the same heading as “don’t trust the compiler”.

Don’t trust the source code FILE. Use a hex editor to examine every byte first.

SpaceLifeForm November 3, 2021 8:25 PM

@ Clive, ALL

WTF dude?

LOL. Thank you for digging that up.

Classic. Spot on.

p.s. ISO 9000 is garbage.

Thomas Stone November 3, 2021 10:37 PM

I wonder, would applying Unicode Normalization before compiling address this? Seems that would be very straightforward (www.unicode.org/reports/tr15)

lurker November 4, 2021 12:09 AM

@Thomas Stone: Unicode Normalization might adress the problem, or not. That report has some:
Where (…) behaviour must (…); and some
If (…) it is recommended (…)

Pay yer money and take yer choice…

FA November 4, 2021 4:05 AM

@Clive

The whole of what you quoted, is talking about the “input to the adder” not “the output sum to line”.

OK, but would you put the switch at that input, selecting either the plaintext symbol or NUL ? That creates an unstable feedback loop since the action of the switch depends on the adder output. You’d have to latch the condition that drives the switch, or use a second adder to generate the final output.

Putting the switch at the output (selecting either adder output or key symbol) avoids this. That is the scheme I had in mind and why I understood ‘send’ to refer to the final output.

Some think the other way around ie ADD then XOR is better.

I don’t think it matters. It certainly doesn’t harm to have both.

Clive Robinson November 4, 2021 10:17 AM

@ FA,

would you put the switch at that input, selecting either the plaintext symbol or NUL ?

It’s a question that has no correct answer other than,

“It depends on implementation factors”.

And brings up “Security-v-Efficiency” issues. Which is why I prefere to pipeline things with latches and bi-phase clocks.

First though, one difference between a true OTT system and a Stream Cipher is that it matters not a jot how much OTT keyMat you send over the wire instead of ciphertext (as long as it is never re-used). However you should aim to NEVER send any KeyMat over the wire with a key stream generator only ever ciphertext[1] preferably where the plaintext has had it’s statistics flattened by compression or encryption[2].

So one known issue of the early Rockex was that it was hoplessly insecure to what we would call a TEMPEST or “Passive EmSec” attack[3]. The reason was that the XOR function had a bad time based side channel issue that revealed the OTT on the wire. That is by looking at the ciphertext on the wire with an oscilloscope or simillar you could tell by the pulse width if a bit had been fliped or not… So striping off the XOR’d OTT whilst not childs play could be done automatically. Something that apparently caused the Canadian Pat Bayly, Assistant Director of “British Security Co-ordination”(BSC) who designed Rockex more than a few headaches to sort out.

So in a practical implementation you have to consider “side channels” and that takes priority over other apparently more logical reasons you indicate.

It’s also one of the reasons for a TEMPEST design rule that is effectively,

“Clock the outputs”

As this is effectively a subset of “pipelining techniques” that can remove “jitter” that can be a side channel that leaks KeyMat or plaintext via timing.

[1] This is because each character in an OTT is “truly independent” –or should be– but every character out of a key stream generator is 100% dependent on the generator internal state. So every output charecter from a stream generator is in no way independent of all the others. So allowing output from the stream generator on the wire provides information to an observer, that can in theory help predict the next output charecter from the stream generator.

[2] It’s one of several reasons why the Rockex was used as a “super-encrypter” not a message encrypter like the Typex. But to those who used the Rockex they were encouraged to think of it as a “link encrypter” of fast multiplexed teletype traffic that was being “routed through their switching node”. Not that, by that time, the Typex was actually considered insecure (because of Ultra). Something that neither the British or US wanted to get out as the myth of rotor machine security ment other Nations used it or similar rotor machines without super-encrption, so up into the late 1980’s GCHQ and the NSA were happily reading their “routine” messages, that gave more valuable intell due to “traffic analysis” than traffic that the adversary considered so secret they hand ciphered with an OTP. Not that OTP was always secure in practice… in some cases due to corruption it was actually totally insecure due to “key reuse”, as Project VENONA demonstrated over the many years it ran,

https://en.wikipedia.org/wiki/Venona_project

[3] Whilst now “common knowledge” this is still technically clasified in both the UK and US but… Suprisingly for different reasons.

In the UK it is both “unavailable” under the “hundred year rule” that “protects the guilty”, as well as the fact it is still “classified” which is something we should not know as the UK does not reveal such things under the “We can neither confirm nor deny” mealy mouth response to nearly all alledged “National Security” related questions. But we know through the US…

In the US TEMPEST, and all that is related to it, is still technically classified even though most of it is obvious from the laws of physics. So the technique of pulse width measurment with an oscilloscope, chart recorder or simillar, is “clasified”… But also under the BRUSA/UKUSA arangment the US treats anything originating from the UK that the UK still regards as “clasified”, as “clasified” as well. But as with measuring pulse widths the way the US has handeled information requests has enabled people to strip off the cause for the classification, so we know it’s because the UK still have it clasified… Yup there is some irony there.

vas pup November 4, 2021 4:30 PM

Tag – vulnerabilities

Who is creating the apocalypse

https://www.bbc.com/future/article/20211014-agents-of-doom-who-is-hastening-the-apocalypse-and-why

“Mass surveillance

There are concerns not just of global calamity, but the risk of dystopias, such as a long-lasting, AI-surveillance power totalitarian state. The social scholar and author Shoshana Zuboff at Harvard University has branded the modern era as the Age of Surveillance Capitalism. It is a situation in which human attention, experience and data are captured and commercialized en-masse.

It is dominated by a few, familiar titans: Alphabet (Google), Microsoft, Facebook, Amazon, Tencent, Baidu, Alibaba and Apple. All of which are in the top 10 largest companies in the world by market capitalization. These are accompanied by tech firms tailored to provide cyber weapons and surveillance, such as Palantir and the NSO Group. The latter has recently been exposed by investigative journalists as having provided malware that has been used to infect the devices of activists, politicians, union leaders and journalists around the world, including by repressive regimes. (NSO denies any wrongdoing).

Big Tech and surveillance firms are accompanied by intelligence communities to form what I call a =Stalker Complex= that constantly watches the world. The breadth and intrusiveness of this mass surveillance apparatus is carried out by a small number of intelligence agencies such as the National Security Agency (NSA) and its UK counterpart the Government Communications Headquarters (GCHQ). While the future of AI surveillance technologies such as facial recognition technology is dominated by US and Chinese firms.”

Read the whole article! Many interesting facts there.

Clive Robinson November 4, 2021 6:02 PM

@ vas pup, ALL,

With regards the “British Broadcasting Corporation”(BBC)

It along with the UK “Civil Service” have become “captured” by the current political encumbrants. Thus the touch stone the operated on of,

“Speak the truth, without fear or favour”

Is very much a thing of the past.

I thus advice people to “sanity check” anything coming out of either of them with other independent sources.

As Samual Langhorn Clemens (Mark Twain) once remarked,

“If you don’t read the newspaper, you’re uninformed. If you read the newspaper, you’re mis-informed.”

And in the case of the BBC and UK Civil Service, they have become “mouth pieces” for what you might call “The UK Alt-Right”, that the current political encumbrants are, and their brand of “faux news”.

Sam Clemen’s also had wise words on politicians,

“Reader, suppose you were an idiot. And suppose you were a member of Congress. But I repeat myself.”

SpaceLifeForm November 4, 2021 10:03 PM

He is no dummy, but I disagree about defense approach. Code review is not going to prevent all attacks. But, I do agree that dependency review is very important.

And, also, that this story is a distraction. Maybe that is the intent?

https://research.swtch.com/trojan

The authors of this paper have clearly done a good job promoting it. Kudos to them on that. But I am concerned that the attention and response this paper is getting is in general distracting from far more useful security efforts. We should redirect that attention and response at improving general code and dependency review instead.

Clive Robinson November 5, 2021 1:45 AM

@ SpaceLifeForm, ALL,

Code review is not going to prevent all attacks.

No disagreement there, the logic behind that, though raises a problem.

The logic is the issue of the “Known, Knowns”, “Unknown, Knowns”, and “Unknown Unknowns”

Whilst all review processes should pick up “Known, Knowns” you start runing into problems with “Unknown, Knowns” or as some call them “Black Swans”. Whilst I’m not that good at it I can “join the dots” and envision a limited number of “black swans” often half a decade to a decade befor they become “Known, Knowns”. Any security or safety review needs people who can not just see black swans but enunciate them sufficiently clearly that all involved understand them.

But by definition you can not predict or anticipate “Unknown, Unknowns” (just be an originator of then at some future point). Whilst that is true, you can “get lucky” and stop them by chance. That said you can evolve design processes that help in that direction. One such is aim to “Mitigate Classes not Instances” of vulnerability[1].

Which brings us to the problem, whilst what you can do for design processes carries forward into review processes, the review process can never go beyond the design process it is bound by.

As design processes are bound so must be ALL review processes.

That is “All review processes are limited by design”.

Whilst it can be argued that some review processes are better than the design process in use, it is only true because the review process is based on a different and arguably better design process. Whilst this might suggest changing the design process in use, there may be good and proper reasons for not doing so.

But at the end of the day, we should realise that any review process can not pickup anything the design process it is based on should have highlighted and stopped…

[1] To “Mitigate Classes not Instances” of vulnerability, you have to understand the notion of “coverage”. Whilst akin to scope it works the other way around, that is coverage reduces from global to individual, whilst scope increases from individual to global. Most thing interms of scope, not coverage as it is they assume simpler or more well defind thus tractable. Actually it is often not the case. An instance of vulnerability falls in a class, the instance has many very specific attributes, which you abstract out some key elements to form a class. Obviously this means that a spicific class is part of a larger class within the global set of all vulnarabilities. Thus if you mitigate one key feature all vulnerabilities that have it are also mitigated. It is possible to have entirely simple mitigations that have near global coverage. As an example, at the highest class levels of attacks we have “internal” and “external” attacks under which all attacks fall. To mitigate against all “external” attacks you simply do not have a system with any communications, and you make it “tamper evident” which is realisable under certain broad assumptions whilst “tamper proof” is provably not possible. This is the basic design philosophy of “secure tokens”, whilst there might be “Unknown, Unknown” attacks the system might be vulnerable to, it can not be attacked by them if the attackers can not get to the system at the level required. In essence it is the basis for both “segregation” within systems and “issolation” of whole systems, the latter being covered by “energy-gapping”. However for a component to function within a system it needs to be part of it and that requires the ability to transfer energy of some form which is the essence of communication. Thus to segregate it you have to identify all communications paths and their characteristics and not just mitigate those that are undesired but also monitor and control desired communications such that it does not behave out of the specification. Think of striping the majority of control characters out of ASCII as a base level then work upwards through the layers. As can be seen “issolation” is far easier than “segregation”, but it has way more limitations on utility.

John Nada November 5, 2021 3:53 AM

Clive Robinson wrote on November 4, 2021 at 6:02 PM :
//
As Samual Langhorn Clemens (Mark Twain) once remarked,

“If you don’t read the newspaper, you’re uninformed. If you read the newspaper, you’re mis-informed.”
//
As Oscar Fingal O’Flahertie Wills Wilde (a.k.a. Oscar Wilde) could have stated :

“In media stat virus”.

😉

Quantry November 5, 2021 11:41 AM

@vas pup Re: “Age of Surveillance Capitalism… dominated by a few, familiar titans: Alphabet (Google)… Big Tech … =Stalker Complex= … intelligence agencies…”

Googel evidently even vigorously condemns itself in its own “unwanted software policy”:

It doesn’t tell the user about all of its principal and significant functions.
It affects the user’s system in unexpected ways.
It is difficult to remove.
It collects or transmits private information without the user’s knowledge.
It is bundled with other software and its presence is not disclosed.

Amazing. [many more censored bits]

ht tps://www.google.com/about/unwanted-software-policy.html

John Nada November 5, 2021 12:43 PM

@Clive Robinson November 5, 2021 11:16 AM :
Seems you forgot your latin and did not catch the pun. You’re human, after all. 😉

John Nada November 5, 2021 12:52 PM

Second and last try, for comment did not appear and could possibly be lost as already happened two or three times in the past month.

@Clive Robinson, November 5, 2021 11:16 AM :
Seems you forgot your latin, for you did not catch the pun. You’re human, after all. 😉
In mediO stat virTus
In mediA stat virus

lurker November 5, 2021 3:36 PM

@SpaceLifeForm: … this story is a distraction.

Perhaps they knew that too, from the paper (bold added)

Many of the discrepancies between source code logic and compiler output logic stem from compiler optimizations, about which it can be difficult to reason.

SpaceLifeForm November 5, 2021 5:24 PM

@ Clive, lurker, Ted

“Unknown Unknowns”

Connect dots.

Silicon Turtles and Compiler Optimizations

name.withheld.for.obvious.reasons November 5, 2021 10:49 PM

@ Clive
Ha ha, you said Nixie tube–still have some 8 segments laying around. Do any of your devices require less than 3A DC to operate?

Clive Robinson November 5, 2021 10:52 PM

@ SpaceLifeForm, lurker, Ted,

Silicon Turtles and Compiler Optimizations

Both in practice and theory –note the reverse order– “compiler optimizations” are a security weakness for several reasons.

It falls under the “Security-v-Efficiency” issue. The most obvious effects being,

1, Opens up side channels.
2, Increses system transparancy.

Neither of which is good.

But there is a further issue, which is one I keep meaning to dig more into. Which is “System Signals” that can aid “Active EmSec Attacks” by providing synchronizing markers for the likes of “EM Fault Injection Attacks”.

For example, some of you might be aware of “Inter Symbol Interference”(ISI)[1] it is a process where by in communications the energy of a bit gets spread across adjacent bits. Less well known is that by the use of matched filters you can pre-distort and gain better utilisation of the channel[2]. One such that is in almost every home is “pre-emphasis” in FM channels where boosting signals at certain frequencies at the TX when corrected at the RX reduce the effective channel noise. By adjusting it correctly you can reduce phase and time distortion that cause “jitter” etc and get a significantly increased range at a given data rate. Something that was very usefull to exploit in designs of the likes of “analogue cordless phones” connected to POTS land lines, that were very popular from the 1970’s through 1990’s before, the likes of DECT then cellular systems replaced them.

Now imagine an optomizing compiler as a “pre-distortion” system… What can it get you?

If you monitor the channel the pre-distortion makes the signal less obvious to instrumentation and makes synchronization points difficult to detect. However to a receiver with the correct filter the signal appears out of the noise as if by magic. In effect the same principles as Spread Spectrum “Low Probability of Intercept”(LPI) communications that got used for “Digital Watermarking”(DWM” for use in “Digital Rights Managment”(DRM) in the 1990’s…

Thus there is potential for all sorts of side channels to be introduced that can be used for both passive –TEMPEST– and active –EM injection– EmSec attacks.

As I’ve mentioned before the “square law” nature of semiconductors has advantages such as envelope detection. You can therefore take a microwave carrier that will get through case ventilation slots and side pannel edges / seams that act as “slot antennas”[3] relatively easily and have it “demodulated” inside the case. Thus the AM on the microwave signal radiates off of internal wiring and the like, to get envelope detected it’s self in other parts of the system. This way you can piggyback a relitively low frequency fault injection signal into a system where it would not otherwise be possible.

[1] ISI is present in all Shannon Channels due to frequency response[2]. Also where information travels via multiple independent paths between two paths. It’s nicely explained in,

https://m.youtube.com/watch?v=I087FUvW2ys

[2] Remember that if you know what the path properties are you can pre-distort with an inverse function to in effect cancel some of the ISI (not all due to the fact the system is finite and you want to keep delay time down). This technique is used with adaptive matched filters in the likes of DSL modems, which are one of the few reasons to still have POTS land lines any longer. Thus capable of 8 megabit or more data rates in wired channels that were only supposed to have a 3kHz audio bandwidth…

[3] Slot antennas are a form of “dialectric antenna”. Put way way over simply a slot in a sheet of metal is the inverse of a wire antenna such as a,dipole. It’s one of the fun things behind “Fractal antennas” and a curse in PCB design.

name.withheld.for.obvious.reasons November 5, 2021 11:34 PM

To go throw a whole tree, say rebuild from a source via a filter, use a print processor stub as filter for the source files and target to a new source snapshot and analyze the tree for deviations. So the first pass is rebuilding the source from the original repository and afterwords analyze the tree for differential results at varying altitudes so to speak. First the snap size, then the number of different files/sizes, and then the contents of those files. Easily scripted in an afternoon before nap time.

SpaceLifeForm November 6, 2021 12:33 AM

@ name.withheld.for.obvious.reasons

How about another step?

Build the snap, dump it to thumb drive, hand carry to an offline build machine.

Then, just to be safe, run the checks one more time on the offline build machine.

If you stick to 7-bit ASCII, this is not rocket science. It is KISS.

SpaceLifeForm November 6, 2021 1:15 AM

A brief comment regarding comments

simple bash or perl comment

/* difficult to parse c comment */
// easier to parse c++ comment

php attempts to deal with all of them.

In this instant case, the difficult to parse comment method was used in the attack in the c code.

Note that it is much easier, and much safer, for a parser to identify the start of a comment, and throw away everything remaining on that line.

It does not have to worry about finding a closing */

hxtps://www.php.net/manual/en/language.basic-syntax.comments.php

SpaceLifeForm November 6, 2021 1:28 AM

Place a hash in front of ‘simple’ above. It did not make it thru.

I did not preview. Markdown treated it as a html header.

In theory, this should work (no space after hash)

#simple bash or perl comment

SpaceLifeForm November 6, 2021 2:06 AM

As to String Literals

If your application wants to support i18n, then to do it correctly, you want to support more than just english, right?

But, to do it correctly, you are not going to hardcode various messages in various languages.

Unless one is insane.

Your binary will bloat badly. You may end up with magnitudes of order more bytes in your binary than the actual machine code.

The way to deal with this (been there, done this, last century), is to use MsgCodes, and in conjunction with the users language preference, access a database using the MsgCode and userlang, and extract the proper UNICODE text that the user can understand. You can even sprintf() into it as needed before display to the user.

Serious errors (panic situation, going to crash), sure, keep them in plain 7-bit ASCII because they should be very rare. But for user interaction, using a database for error messages, warnings, help text, and general documentation just makes sense. If something textual is not clear, you can update the database, no re-compile required!

echo November 6, 2021 4:52 AM

As has been noted a sensible specification helps create sensible code. One not very sensible specification is that for URL’s. Tim Berners-Lee to his credit has acknowledged that the specification is a nightmare and if he did it today he would never do it that way.

Security Sam November 9, 2021 3:14 PM

Cleverly hiding vulnerabilities in the source code
Makes the sum of the parts greater than the whole
Creating a new paradigm of the morphed opcode
That resembles a frustrating game of whack-a-mole

Dave November 16, 2021 1:32 PM

This doesn’t address the bidirectional problem (the main problem in the article), but the secondary issue of similar-looking letters of different scripts….

When we started allowing non-Latin web addresses, I thought we should have instead invented a new kind of Unicode normalization to solve it. This is another case where that other kind of normalization would be helpful. Unicode has two general normalization forms, NFC/NFD (canonical), and NFKC/NFKD (compatible). The compatible one combines similar meaning letters, like a superscript 2 to a normal 2. However, we don’t have one to combine similar looking letters of different scripts. It seems like we are getting more cases of security issues associated with these similar looking letters from different scripts.

What if the Unicode Consortium made a third kind of normalization for this? For example, Latin lowercase o and Cyrillic lowercase o would map to the same character? This is not 100% perfect because there are similar looking letters within the same script (e.g. lowercase L vs uppercase I vs numeral 1). But those could be distinguished with fonts (users would typically become familiar pretty quickly with font variations within the script they are familiar with, but maybe not so with scripts of other languages). If we did this with DNS, all we would need is a rule that says domain names must be normalized to this form. Then, web browsers (hidden from the user interface) would map google.com to the same google.com when doing the DNS query, regardless of what kind of lower case o’s, e’s or c’s are in there. The web browser can let the user type it in whatever mix of scripts the user wants, and show it on the navigation bar in those scripts, but there would be only one google.com. I understand other methods have been developed to address this, but this seems cleaner and simpler.

If we had that kind of Unicode normalization, it could be used in this case also to warn of this issue, and various other situations where security flaws arise out of visually similar looking letters.

Leave a comment

Login

Allowed HTML <a href="URL"> • <em> <cite> <i> • <strong> <b> • <sub> <sup> • <ul> <ol> <li> • <blockquote> <pre> Markdown Extra syntax via https://michelf.ca/projects/php-markdown/extra/

Sidebar photo of Bruce Schneier by Joe MacInnis.