Although AI, in this article mostly referring to LLMs, seems to be revolutionizing many areas and many laws of conventional technologies don’t seem to apply to it, I am surprised how many concepts of traditional cybersecurity are appearing in a new light in the era of AI. One that scares me the most is the concept of “Sources and Sinks”.
What are Sources and Sinks?
The concept of sources and sinks originally comes from security code reviews. It is in reference to the fact that data comes from somewhere, a.k.a. the source, (user input, databases, website components, other systems etc…) and flows through the application and the logic that processes it into a so-called sink (database, webpage element, an email etc…). Security researchers commonly do something called “Taint Tracking” or “Taint Analysis” to identify what data goes where. This is already pretty hard to do in large scale applications, but with enough effort it is achievable. But at least those applications follow deterministic algorithms, which means, if A happens, the consequence is always B. You may already see where this is going.
AI “sinks” this concept
You get it? Because of sources and sinks and how it is hard to do taint analysis in AI. Well, the problem with AI is twofold.
One thing is that LLMs are intentionally non-deterministic in the conventional sense. Because they are imitating human language and thinking to some extent, they have built in measures, like temperature and sampling methods, that add randomness to the output of an LLM. Which means, especially in cases where an LLM makes a decision about what tool to use or where to put some data, you can not be entirely sure what exactly an LLM will do. Of course, when you only have one agent, that can either use a calculator or browser, this is not a huge problem. It is however unlikely that this is what AI systems of the future will look like. Although, if you add email and PowerPoint to this agent, it can probably do 90% of a McKinsey consultant’s work ;).
The other thing is that AI systems will probably have a lot of sources and sinks, because one of the intended use cases for these systems is to replace human workers and if you are an office job person, you know what an insane amount of information you need access to and how much stuff you are allowed to do (at least from a typing stuff into something perspective).
Now this becomes even scarier when you realize that, even tho this concept is known to more and more developers, it is insanely often ignored. This may be because of a lot of microservice stuff and compartmentalized responsibility, where the individual developer just takes some data that some other developer sends them and who knows where the user puts in what and where it ends up, or simply because of a lack of awareness. But now imagine these same developers building large scale, “non-deterministic” applications with access to literally the entire internet and confidential corporate data, and they can use a bunch of tools, send emails and slack, surf the web and someday control heavy machinery. And as if this was not enough, they have to do all this under immense time pressure with technology that not even the people who created it fully understand and libraries that abstract it to a level where nobody knows what is going on under the hood.
All hope is lost?
I “hope” that this is not the case, and I think now is the perfect time to put a spotlight back on this “old” concept of “Sources and Sinks”. Currently, everyone is out there building some PoC implementation of some AI tooling and when everyone sees the potential of this crazy new invention, it is very likely that the PoC just gets some more bells and whistles and a feature here and a new API there, and before you know it, it manages some crucial process in your company. So please, while it is still easy, and you think you know where all your data is coming from, because it is just this little tiny PoC, start modeling your data flows. Because if you don’t do it now, somebody will have to reverse engineer them later. It won’t be you, because you have all this other stuff going on. It will be some dude in a basement, and he will do some involuntary pentesting for you.
Talking ‘bout AI.
This has probably already crossed your mind: “Can’t LLMs just do the data flow modeling and taint tracking?” and I think it is very likely the case. This will be an amazing field for AI security companies to work on, you will probably be able to hack a small scale solution together in a couple of hours. I might even do it one of these days. But just because AI COULD do it, it does not mean that it is not important to give this some thought. Because hackers won’t wait till the market is full of tools that help secure AI. They are in fact hacking AI right now, like crazy.