Cybersecurity & Tech

Lawfare Daily: Elliot Jones on the Importance and Current Limitations of AI Testing

Kevin Frazier, Elliot Jones, Jen Patja
Friday, August 30, 2024, 8:00 AM
What is the current state of efforts to test AI systems?

Published by The Lawfare Institute
in Cooperation With
Brookings

Elliot Jones, a Senior Researcher at the Ada Lovelace Institute, joins Kevin Frazier, Assistant Professor at St. Thomas University College of Law and a Tarbell Fellow at Lawfare, to discuss a report he co-authored on the current state of efforts to test AI systems. The pair break down why evaluations, audits, and related assessments have become a key part of AI regulation. They also analyze why it may take some time for those assessments to be as robust as hoped. 

To receive ad-free podcasts, become a Lawfare Material Supporter at www.patreon.com/lawfare. You can also support Lawfare by making a one-time donation at https://givebutter.com/c/trumptrials.

Click the button below to view a transcript of this podcast. Please note that the transcript was auto-generated and may contain errors.

 

Transcript

[Introduction]

Elliot Jones: On one level, there's just a generalization problem, or a kind of external validity problem of, a lot of the tests can do what they need to do. It can tell you, does the system have stored that knowledge, but translating that's whether the system has stored knowledge or not into, can someone take that knowledge? Can they apply it? Can they use that to create a mass casualty event? I just don't think we have that knowledge at all.

Kevin Frazier: It's the Lawfare Podcast. I'm Kevin Frazier, assistant professor at St. Thomas university college of law and a Tarbell fellow at Lawfare joined by Elliot Jones, a senior researcher at the Ada Lovelace Institute.

Elliot Jones: One thing we actually did hear from companies, from academics and from others is they would love regulators to tell them, what evaluations do you need for that? I think that a big problem is that there isn't actually had been that conversation between what are the kinds of tests you would need to do that regulators care about, that the public cares about, that are going to test the things people want to know.

Kevin Frazier: Today we're talking about AI testing in light of a lengthy and thorough report that Elliot coauthored.

[Main Podcast]

Before navigating the nitty gritty, let's start at a high level. Why are AI assessments so important? In other words, what spurred you and your coauthors to write this report in the first place?

Elliot Jones: Yeah, I think what really spurred us to think about this is that we've seen massive developments in the capabilities of AI in the last couple years, the kind of ChatGPT and everything that has followed. I think everyone's now aware how far some of this technology is moving, but I think we don't really understand how it works, how it's impacting society, what the risks are. And I think a few months ago, when we started talking about this project, there was the U.K. AI safety summit. There were lots of conversations and things in the air about like, how do we go about testing, how safe these things are, how these talking. But we felt a bit unclear about like where the actual state of play was there. We thought about, we looked at this and looked around and like, there was a lot of interesting work out there, but we couldn't find any kind of comprehensive guide to actually, how useful are these tools? How much can we know about these systems?

Kevin Frazier: To level set for all the listeners out there, Elliot, can you just quickly define the difference between a benchmark, an evaluation, and an audit?

Elliot Jones: That is actually a slightly trickier question than it sounds. At a very high level, a kind of evaluation is just trying to understand something about a model or the impact the model is having. And when we spoke to experts in this field who work on, in foundation model developers, in independent assessors. Some of them use the same definition for audits. Sometimes audits were a, you know, a subset of evaluation. Sometimes evaluation is a subset of audits. But I think for a very general sense for listeners, evaluations are just trying to understand the kind of, what can the model do? What behaviors does it exhibit? And maybe what the broader impacts of it has on, say, energy costs, jobs, the environment, other things around the model.

For benchmarking in particular, benchmarking is often using a set of standardized questions that you give to the model. So you say, we have these hundreds of questions, maybe from saying like an AP History exam, where you have the question, you have the answer, you ask the model the question, you see what answer you get back and you compare the two. And that allows you to have a fairly kind of standardized and comparable set of scores that you can compare across different models. So, a benchmark is a kind of subset.

Kevin Frazier: And when we focus on that difference between evaluations and audits, if you were to define audits distinctly, if you were trying to separate them from the folks who conflate it with evaluations, what is the more nuanced, I guess, definition of audits?

Elliot Jones: The important thing when I think, when I think about audits is that they are kind of well-structured and standardized. So if you're going into, say, a financial audit, you kind of, there's a process. The audits are expected to go through, to assess the books, to check out what's going on. Everyone kind of knows exactly what they're gonna be doing from the start. They know what end points they're trying to work out. So I think an audit would be something where there is a good set of standardized things you're going in, you know exactly what you're going to do. I know exactly what you're testing against. Audits might also be more expansive than just the model. So an audit might be a kind of governance audit where you look at the kind of practices of a company. Or you look at how the staff are operating, not just what is the model, what is the model doing? Whereas evaluations sometimes can be very structured as I kind of discussed with benchmarks, but can also be very exploratory where you just give an expert a system and see what they can do in 10 hours.

Kevin Frazier: We know that testing AI is profoundly important. We know that testing any emerging technology is profoundly important. Can you talk to the difficulties again at a high level of testing AI? Folks may know this from prior podcasts. I've studied extensively car regulations, and it's pretty easy to test a car, right? You can just drive it into a wall and get a sense of whether or not it's going to protect passengers. What is it about AI that makes it so difficult to evaluate and test? Why can't we just run it into a wall?

Elliot Jones: Yeah, I think it's important to distinguish which kind of AI we're talking about, whether it's narrow AI, like we're talking about a chest x-ray system, where we can actually see what the results we're getting. It has a very specific purpose. We can actually test it against those purposes. And I think in that area, we can do the equivalent of running it into a wall and seeing what happens. And what we decided to focus on in this report was foundation models, these kind of very general systems, these large language models, which can do hundreds, thousands of different things. And also they can be applied downstream in finance, in education, in healthcare, and because there are so many different settings, these things could be applied in so many different ways that people can fine tune them, that they can fine do applications.

I think the developers don't really know how the system can be used when they put them out in the world. And that's part of what makes them so difficult. To actually assess because you don't have a clear goal. You don't know exactly who's gonna be using it, how they're gonna be using it, how. And so when you start to think about testing, you're like, oh God, where do we even start? I think the other difficulty with some of these AI systems is we actually just don't understand how they work on the inside. With a car, I think we have a pretty good idea how the combustion engine works, how the wheel works. If you ask a foundation model developer. So why does it give the output it gives? They can't really tell you.

Kevin Frazier: So it's as if Henry Ford invented all at the same time a car, a helicopter, a submarine, put it out for commercial distribution and said, figure out what the risks are. Let's see how you're going to test that. And now we're left with this open question of what are the right mechanisms and methods to, to really identify those risks. So obviously, this is top of mind for regulators. Can you tell us a little bit more about the specific regulatory treatment of AI evaluations, and I guess we can just run through the big three, the U.S., the U.K., and the EU?

Elliot Jones: Yeah, so I guess I'll start with the EU, because I think they're the furthest along on this track in some ways. The European Union passed the European AI Act earlier this year. And as part of that, there are obligations around trying to assess some of these general purpose systems for systemic risk to actually go in and find out how these systems are working, what are they going to do? And they've set up this European AI office, which right now is consulting on its codes of practice that they're going to set out requirements for these companies that say, maybe you do need to evaluate for certain kinds of risks. So is this a system that might enable like more cyber warfare? Is there a system that might enable systemic discrimination? Is this a system that might actually lead to over reliance or concerns about critical infrastructure? So the European AI office is already kind of consulting around whether evaluation should become a requirement for companies.

I think in the U.S. and the U.K., things are both are much more on a voluntary footing right now. The U.K. back in, when would it have been November, set up its AI Safety Institute, and that has gone a long way in terms of voluntary evaluations. So that has been developing different evaluations, often with a national security focus around, say, cyber, bio, other kinds of concerns you might have. But that has been much more on a voluntary footing of companies choosing to share their models with this British government institute. And then somehow, and I think I'm not even really sure exactly how this kind of plays out. The issue is doing these tests. They've been publishing some of the results. But that's all very much on this, kind of, voluntary footing. And there has been kind of reports in the news that actually that's caused a bit of tension on both sides because the companies don't know how much they're supposed to share or how much they want to share. They don't know if they're supposed to make changes when the U.K. says, look at this result. They're like, cool. What does, what does that mean for us?

And I think the U.S. is in a pretty similar boat, maybe one step back because the United States AI Safety Institute is just, is still being set up. And so it's working with the U.K. AI Safety Institute. And I think they're, kind of, working a lot together on these evaluations. But that's still much more on a, the companies choose to work with these institutes, they choose what to share, and then the government kind of works with what it's got.

Kevin Frazier: So there are a ton of follow up questions there. I mean, again, just to for folks who are thinking at my speed, if we go back to a car example, right? And let's say the car manufacturers get to choose the test or choose which wall they're running into at which speed and who's driving all of a sudden we could see these tests could be slightly manipulated which that's problematic so that's, that's one question I want to dive into in a second.

But another big concern that kind of comes to mind immediately is the company's running the test themselves. Where if you had a car company for example controlling the crash test  that might raise some red flags about, well, do we know that they're doing this to the full extent possible? So you all spend a lot of time in the report diving into this question of who's actually doing the testing. So under those three regulatory regimes, am I correct in summarizing that it's still all on the companies even in the EU, the U.K., and the U.S.?

Elliot Jones: So on the EU side, I think it's still yet to be seen. I think they haven't drafted these codes of practice yet. This kind of stuff hasn't gotten going. I think some of this will remain with the companies in the act. There are a lot of obligations for companies to demonstrate that they are doing certain things that they are in fact carrying out certain tests. But I'm pretty sure that the way the EU is going, there is also going to be a requirement for some kind of like third party assessment. This might take the form of the European AI office itself, carrying out some evaluations, going into companies and saying, give us access to your models, we're going to run some tests.

But I suspect that in, similarly to how finance audits work, it's likely to be outsourced to a third party where the EU office says, look, we think that these are reputable people. These are companies or organizations that are good at testing, that have the capabilities. We're going to ask them to go in and have a look at these companies and then publish those results and get a sense from there. It's a bit unclear how that relationship is going to work. Maybe the companies will be the ones choosing the third-party evaluators, in which case you have still some of these concerns and questions, maybe a bit more transparency.

In the U.K. and U.S. case, some of this has been the government already getting involved. As I kind of just said earlier, the U.K. AI Safety Institute has actually got a great technical team. They've managed to pull in people from OpenAI, from DeepMind, other people with great technical backgrounds, and they're starting to build some of their own evaluations themselves and run some of those themselves. I think that's a really promising direction because as you were kind of mentioning earlier about companies choosing their own tests. In this case, it's also having for like, for a benchmark, for example, if you've got the benchmark in front of you, you can also see the answers, so you're not just choosing what test to take. You've also got the answer sheet right in front of you. Whereas if you've got say the U.K. AI Safety Institute or the U.S. AI Safety Institute building their own evaluations, suddenly the companies don't know exactly what they're being tested against either. And that makes it much more difficult to manipulate and game that kind of system.

Kevin Frazier: And go into that critical question of the right talent to conduct these AI evaluations. I think something we've talked about from the outset is this is not easy. We're still trying to figure out exactly how they work what evaluations are the best, which ones are actually going to detect risks, and all these questions, but key to that is actually recruiting and retaining AI experts. So is there any fear that we may start to see a shortage of folks who can run these tests? I mean, we know the U.S. has an AC, the U.K. has an AC, again, that's AI Safety Institute. South Korea, I believe, is developing one. France, I believe, is developing one. Well, all of a sudden we've got 14, 16, who knows how many ACs are out there. Are there enough folks to conduct these tests to begin with, or are we going to see some sort of sharing regime, do you think, between these different testers?

Elliot Jones: I'll tackle the sharing regime question first. So we are already starting to see that. For some of the most recent tests on Claude 3.5, where Anthropic shared early access of their system, they shared it with the U.S. and the U.K. AC. And they kind of worked together on those tests. I think that it was the U.S. AC primarily getting that access from Anthropic, kind of getting, using the heft of the U.S. government basically to get the company to share those things, but leaning on the technical skills within the U.K. AC to actually conduct those tests. And there's been an announced kind of international network of AI safety institutes that's hopefully going to bring all of these together. And I expect that maybe in future we'll see some degree of specialization and knowledge sharing between all of these organizations that in the U.K., they've already built up a lot of talent around national security evaluations. I suspect we might see the United States AI Safety Institute looking more into say questions of systemic discrimination or more societal impacts. Each government is going to want to have its own kind of capabilities in house to do this stuff. I suspect that we will see that sharing precisely because as you identify, there are only so many people who can do this.

I think that's only a short-term consideration though, and it's partly because we've been relying a lot on people from, coming from the companies to do a lot of this work. But I think the existence of these AI safety teachers themselves will be a good training ground for more junior people who are coming into this, who want to learn how to evaluate the systems, who want to get across these things, but don't necessarily want to join a company. Maybe they'll come from academia, they'll be going to these ACs instead of joining a DeepMind or an OpenAI. And I think that that might kind of ease the bottleneck in future. And I kind of imagined that. I was talking earlier about having these third-party auditors and evaluators. I suspect we might see some staff from these AI safety institutes going off and founding them and kind of growing that ecosystem to provide those services over time.

Kevin Frazier: When folks go to buy a car, they, especially if they have kids or dogs or any other loved ones for all the bunny owners out there, or you pick your pet. You always wanna check the crash safety rating. But as things stand right now, it sounds as though some of these models are being released without any necessarily required testing. So you've mentioned a couple times these Code of Practices that the EU is developing. Do we have any sort of estimate on. when those are going to be released and when testing may come online?

Elliot Jones: Yeah. Yeah. So I think we're already starting to see them being drafted right now. I think that over the course of the rest of the summer and the autumn, the EU is going to be starting to create working groups of work through each of the sections of the Code of Practice. I think we're kind of expecting it to wrap up around next April, so I think by the kind of spring of next year we'll be starting to see at least the kind of first iteration of what these codes of practice look like. But that's only when the codes of practice are published. When we see these actually being implemented, when we see companies taking steps on this questions. Maybe they'll get ahead of the game. Maybe they'll see this coming down the track and start to move in that direction. A lot of these companies are going to be involved in this consultation, in this process of deciding what's in the Code of Practice. But equally, they could get published and then it'd take a while before we actually see the consequences of that.

Kevin Frazier: April of next year. I'm no, by no means a technical AI expert, but I venture to guess the amount of progress that can be made in the next eight months can be pretty dang substantial. So that's, that's quite the time horizon. Thankfully though, as you mentioned, we've already seen in some instances compliance with the U.K. AC testing, for example, but you mentioned that some labs maybe are a little hesitant to participate in that testing. Can you, so can you detail that a little bit further about why labs may not be participating to the full extent, or may be a little hesitant to do so?

Elliot Jones: Yeah, so, yeah, I, it's not quite clear which labs have been sharing and not sharing. I know that Anthropic has because they said it when they published Claude 3.5. To the others, it's kind of unclear. There's a certain opaqueness on both sides about exactly who is involved. But as to why they might be a bit concerned, I think there are some legitimate reasons, questions like say around commercial sensitivities, if you're actually evaluating these systems, then that means you probably need to get quite a lot of access to these systems. And if you're Meta and you're publishing Lama 300 billion, just out on the web, maybe you're not so worried about that you're kind of. Putting all the weights out there and just seeing how things go. But if you're an OpenAI or a Deep Mind and Anthropic, that's a big part of your kind of your value. If someone leaked all of the GPT-4 weights onto the internet, that would be a real, real hit to open AI. So I think there are legitimate security concerns they have around this sharing.

I think there's also another issue where, because this is a voluntary regime, if you choose to share your model, and the AI Safety Institute says it's got all these problems, but someone else doesn't. Then that just makes you look bad because you've exposed all the issues with your system, even though you probably know that the other providers have the same problems too, because you're the one who stepped forward and actually given an access and let your system be evaluated. It's only your problems that get exposed. So I think that's another issue with the voluntary regime of if it's not everyone involved, then that kind of disincentivizes anyone getting involved.

Kevin Frazier: Oh, good old collective action problems. We see them yet again and almost always in the most critical situations. So speaking of critical situations, I'll switch to critical harm. Critical harm is what is the focus of SB 1047. That is the leading AI proposal in the California state legislature that as of now, this is August 12th, is still under consideration. And under that bill, labs would be responsible for identifying or making reasonable assurances that their models would not lead to critical harm such as mass casualties or cyber security attacks that generate harms in excess of, I believe, 500 million dollars. So when you think about that kind of evaluation, is that possible? How do we know? that these sorts of critical harms aren't going to manifest from some sort of open model or even something that's closed like, Anthropic’s models or OpenAI's models.

Elliot Jones: I think with the tests we currently have, we just don't know. I think the problem is that, I guess there's a step one of trying to even create evaluation of some of these critical harms. There are some kind of evaluations out there like the weapons of mass destruction proxy benchmark, which tries to assess using multiple choice questions, kind of whether or not a system has knowledge of biosecurity concerns, cyber security concerns, kind of chemicals security concerns, things that maybe could lead down the track to some kind of harm. But that's, as it says, very much just a proxy. The system having knowledge of something doesn't tell you whether or not it's actually increasing the risk or chance of those events occurring.

So I think that on one level, there's just a generalization problem or a kind of external validity problem of all the tests can do what they need to do. It can tell you, does the system have stored that knowledge? But translating that, whether the system has stored knowledge or not into, can someone take that knowledge? Can they apply it? Can they use that to create a mass casualty event? I just don't think we have that knowledge at all. And I think this is where in the report, we talk about pairing evaluation with post market monitoring with instant reporting. And I think that's a key step to be able to do this kind of assessment of saying, okay. When we evaluated the system beforehand, we saw these kinds of properties. We saw that it had this kind of knowledge. We saw it had this kind of behavior. And at the other end, once it was released into the world, we saw these kinds of outcomes occur.

And hopefully that would come long before any kind of mass casualty event or really serious event. But you might be able to start matching up results on say this proxy benchmark with increased chance of people using these systems to create these kinds of harm. So I think that's one kind of issue. But right now, I don't think we kind of have that like, historical data of seeing how the kind of tests before the system is released match up to behaviors and actions after the system is released.

Kevin Frazier: As you pointed out earlier, usually when we think about testing for safety and risks, again, let's just go to a car example. If you fail your driving test, then you don't get to drive. Or if you fail a specific aspect of that test, let's say parallel parking, which we all know it's just way too hard when you're 15 or 16, then you go and you practice parallel parking. What does the report say on this question of kind of follow up aspects of testing? Because it's hard to say that there's necessarily a whole lot of benefit to testing for the sake of testing. What sort of addons or follow up mechanisms should we see after testing is done?

Elliot Jones: Yeah, I guess there's like a range of different things you might want to see a company do. I think for some tests where you see somewhat biased behavior or somewhat kind of biased outputs from a system. Maybe all that means is that you need to look back at your data, set your training system on say, okay, it's under underrepresenting these groups. It's not including say African Americans or African American perspectives as much. So we need to add some more of that data into the training. And maybe that can fix the problem that you've identified. That can go some way to actually resolving that issue. So there is some stuff you can do that's just kind of, as you're training the model, as you're testing it, kind of adjusting it and making sure that it's kind of adding onto that.

A kind of second step you can do is you might find that actually it's very difficult to fine tune out some of these problems. But that actually there are just certain kinds of prompts into a system, say someone asking about, how would I build a bomb in my basement that you can just build a safety filter on top that says, if someone asked this kind of question of the system, let's just not do that. As your evaluation tells you there is this harmful information inside the model where you can't necessarily completely get rid of it, especially if it's going to really damage the performance, but you can put guardrails around the system that make that inaccessible or make it very hard for a user to do that. And similarly, you might want to monitor what the outputs of the model is if you start seeing it mentioned, how to build a bomb. Then you might just want to cut that off and either ban the user or prevent the model from completing its output. I think when we get into slightly trickier ground and areas where I think companies haven't been so willing to do is on delaying deployment of a model or even restricting access to the model completely and deciding not to publish it.

I think one example of this is that OpenAI had a kind of voice cloning model, a very, very powerful system that could generate very realistic sounding voice audio, and they decided not to release it. And I think that's actually quite admirable to say, we did some evaluations, we discovered that this system could actually be used for say, mass spear phishing. If you think about, you get a call from your grandparents and they're saying, oh, I'm really in trouble. I really need your help. And it's just not them. And imagining that capability being everywhere. That's something really dangerous and they've decided not to release it. But equally, I suspect that as there are more and more commercial pressures, as these companies are competing with each other, there's going to be increasing pressure to, this system is a bit dangerous, maybe there are some risks, maybe there are some problems. But we spent a billion dollars training the system. So we need to get that money back somehow. And so they're going to push ahead with deploying the system. And so I think that's the kind of steps that a company might take that are going to get a bit more tricky around not just putting guardrails around it or tweaking it a bit, but actually saying, we've built something that we shouldn't release.

Kevin Frazier: I feel as though that pressure to release regardless of the outcomes is only going to increase as we hear more and more reports about these labs having questions around revenue and profitability. And as those questions maybe persist, that pressure is only going to grow. So that's quite concerning. And I guess I also want to dive a little bit deeper into the actual costs of testing. When we talk about crashing a car, you only have to take one car. Let's say that's between 20 grand and 70 grand, or for all those Ferrari drivers out there, we've got a half a million dollar car or something that you're slamming into a wall. With respect to doing a evaluation of an AI model, what are the actual costs of doing that? Do we have a dollar range on what it takes to test these different models?

Elliot Jones: To be perfectly honest, I don't have that. I don't know the amounts. I think the closest I've kind of seen is that Anthropic talks about when they were implementing one of these benchmarks. Even this off the shelf, kind of publicly available, widely used benchmark, that still required a few engineers spending a couple months of time working on implementing that system. And that's for something that they don't have to come up with a benchmark themselves. They don't have to come up with anything new. It's just taking something off the shelf and actually applying it to their system. And so I can imagine a few engineers at a couple months of time, and they pay their engineers a lot. So that's going to be in the like hundreds of thousands of dollars range, let alone the cost of compute of running the model across all of these different prompts and outputs. And that was just for one benchmark. And many of these systems are trained on lots of different benchmarks. There's lots of red teaming involved. When say a company like OpenAI is doing red teaming, they're often hiring tens or hundreds of domain experts to try and really test capabilities these systems have, and I can imagine they're not cheap either. So I don't have like a good dollar amount. But I imagine it's pretty expensive.

Kevin Frazier: I think it's really important to have a robust conversation about those costs so that all stakeholders know, okay, maybe it does make sense. If you're an AI lab and now you have 14 different AI safety institutes demanding you adhere to 14 different evaluations, that's a lot of money. That's a lot of time. That's a lot of resources. Who should have to bear those costs is an interesting question that I feel like merits a quite, quite a robust debate.

Elliot, we've gotten quite the overview of the difficulty of conducting evaluations, of the possibility of conducting audits, and then in some cases instituting benchmarks. One question I have is how concerned should we be about the possibility of audit washing. This is the phenomenon we've seen in other contexts where a standards developed or a certification is created. And folks say, you know, we took this climate pledge or we signed this human rights agreement. And so now you don't need to worry about this product. Everything's good to go. Don't ask any questions. Keep using it. It'll be fine. Are you all concerned about that possibility in an AI context?

Elliot Jones: Yes, I'm, I'm definitely concerned about that. I think the one thing we'd really want to emphasize is like, evaluations are necessary. You really have to go in and look at your system. Given the current state of play of this quite nascent field, these evaluations are only ever going to be indicative. They're only ever going to be, here are the kinds of things you should be kind of thinking about or worrying about. You should, with the current evaluations, not ever say, look, we did these four tests and it's fine. Partly because as we kind of discussed before, we haven't actually seen these in the real world long enough to know what those kinds of consequences are going to be. And without that kind of follow up, without that kind of post market monitoring, without that instant reporting, I would really not want anyone to say, this is a stamp of approval just because they passed a few evaluations.

Kevin Frazier: Thinking about the report itself, you all, like I said, Did tremendous work. This is a thorough research document. Can you walk us through that process a little bit more? Who did you all consult? How long did this take?

Elliot Jones: Yeah, sure. This was quite a difficult topic to tackle in some ways, because a lot of this, as a quite nascent field, is kind of held in the minds of people working directly on these topics. So we kind of started off this process by, between January and March this year, talking to a bunch of experts, some people working in foundation model developers, some people working in third party auditors and evaluators, people working in government, academics all working in these fields to just try and, you know, get a sense from them, people who have like hands on experience of running evaluations and seeing how hard they are to do in practice of repeating those things and seeing, do these actually play out in real life? So a lot of this work is based on just trying to talk to people who are kind of at the coalface of evaluation and getting a sense of what they were doing. As to exactly who, that's a slightly difficult topic. I think because this is quite a sensitive area, a lot of people wanted to be off the record when talking about this, but we did try and cover a fairly broad range of developers, of assessors, of these kinds of things.

Alongside that, we did our own kind of deep dive literature review. There are some great survey work out there. Laura Weidinger at DeepMind has done some great work kind of mapping out the space of like socio technical risks and the evaluations there. And so drawing on some of these existing survey papers, doing our own kind of survey of different kinds of evaluation. We worked with William Agnew as our technical consultant who has a bit more of a computer science background, so he could get into the nitty gritty of some of these more technical questions. So we tried to marry that kind of on the ground knowledge from people with what was out there in the academic literature.

I would say this is just a snapshot. This took us like six months, and I think some of the things we wrote are essentially already out of date. Some of the work we did looking at where are evaluations at? What is the coverage? People are publishing new evaluations every week. So this is definitely just a snapshot, but yeah, we tried to kind of marry the academic literature with speaking to people on the ground.

Kevin Frazier: So we, we know that other countries, states, regulatory authorities are going to lean more and more on these sorts of evaluations and they already are to a pretty high extent. From this report would you encourage a little more regulatory humility among current AI regulators to maybe put less emphasis on testing or at least put less weight on what testing necessarily means at this point in time?

Elliot Jones: To a degree, I think it depends what you want to use these for. I think in our report we try and break down kind of three different ways you might use evaluations as a tool. One is a kind of almost future scoping slash what is going to come down the road, just giving you a general sense of the risks, what to prioritize, what to look out for.

I think for that evaluations are really useful. I think that they can give you a good sense of maybe the cyber security concerns a model might have, maybe some of the bio concerns. It can't tell you exactly what harm it's going to cause, but it can give you a directional question of where to look. I think another way in which current evaluations can already be useful is if you're doing an investigation. If you're a regulator and you're looking at a very specific model, say you want to look at ChatGPT in May 2024, and you're concerned about how it's representing certain different groups. Or it's how it's being used in recruitment, say you're thinking about how is this system going to view different CVs and what comments is it going to give about, you know, a CV depending on different names. You can do those tests really well if you want to test it for that kind of bias. I think actually we're already kind of there and it can be a very useful tool for a regulator to assess these systems. But I think you have to have that degree of specificity because the results of valuations change so much just based on small changes in the system and based on small changes in context. Unless you have a really clear view of exactly what concern you have, they're not going to be the most useful.

The third kind of way you might use it is this kind of safety sign off, say, this is, this system is perfectly fine. Here's our stamp of approval. We are definitely not there. And I think if I was a regulator right now, one thing we actually did hear from companies, from academics and from others is they would love regulators to tell them, what evaluations do you need for that? I think that a big problem is that there isn't actually had been that conversation between what are the kinds of tests you would need to do that regulators care about, the public cares about, that are going to test. the things people want to know. And what are they going to build? And I think absent that guidance, industry and academia are just going to pursue what they find most interesting or what they care about the most. So I think right now it's incumbent on regulators, on policymakers to say, here are the things we care about. Here's what we want you to build tests for. And then maybe further down the line, once those tests have been developed, once we have a better sense of the science evaluations then we can start thinking about using it for that third category.

Kevin Frazier: And my hope, and please answer this in a favorable way, have you seen any regulators say, oh my gosh, thank you for this great report. We're going to respond to this and we will get back to you with an updated approach to evaluations. Has that occurred? What's been the response to this report so far?

Elliot Jones: I don't want to mention anyone by name. I feel like it'd be a bit unfair to do that here, but yeah, I think it's generally been pretty favorable. I think that actually a lot of what we're saying has been in the air already. As I said, we spoke to a lot of people kind of working on this, already thinking about this. And part of our endeavor here was to try and bring together conversations people are already having, discussions already have, but in a very comprehensible and public facing format, and I think the regulators were already, and are taking these kinds of questions seriously.

I think one difficulty is a question of regulatory capacity. Regulators are being asked to do a lot in these different fields. If I take the European AI office, for example, they've got, you know, I think maybe less than a hundred people now for such a massive domain. And so one kind of question is just, they have to prioritize, they have to try and cover so many different things. And so I think without more resources going into that area, and that is always going to be a political question of, what things do you prioritize? Where do you choose to spend the money? It's just going to be difficult for regulators to have the time and mental space to deal with some of these issues.

Kevin Frazier: And that's a fascinating one too, because if we see this constraint on regulatory capacity, I'm left wondering, okay, let's imagine I'm a smaller lab or an upstart lab. Where do I get placed in the testing order, right? Is OpenAI going to jump to the top of the queue and get that evaluation done faster? Do I have the resources to pay for these evaluations if I'm a smaller model? So really interesting questions when we bring in that big I word, as I call it, the innovation word, which seems to dominate a lot of AI conversations these days. So at the Institute, you all have quite an expansive agenda, and a lot of smart folks. Should we expect a follow-up report in the coming months, or are you all moving on to a different topic, or, what's the plan?

Elliot Jones: Yeah, I think partly we're wanting to see how this plays out, wanting to see how this field moves along. I think one question that we are thinking about quite a lot, and might, is this kind of question of third party auditing, third party evaluation? How does this kind of space grow? As we kind of mentioned a bit briefly in the report, there is currently a kind of a lack of access for these evaluators right now, a lack of ability of them to get access to these things, especially on their own terms, rather than on the terms of the companies. There's a lack of standardization. If you are someone shopping around as a smaller lab or a startup for evaluation services, it's a bit opaque with you on the outside, who is going to be doing good evaluations, who does good work and who is trying to sell you snake oil. And so I think that one thing we're really thinking about is how do we kind of create this auditing market where people on both sides, so you as the lab know you're buying a good service that regulators will trust, that everyone will work.

But also you as a consumer, when you're thinking about using an AI product, you can look at it and say, oh, it was evaluated by these people. I know that someone has kind of certified them, that someone has said, these people are up to snuff and they're going to do a good job. And so I think that's one thing we're really thinking about of how do you build up this market so that it's not just reliant on regulatory capacity. Because I think while that might be good in the short term for some of these biggest companies, it is just not going to be sustainable in the long term for government to be paying for and running all of these evaluations for everyone if AI is as big as some people think it will be.

Kevin Frazier: And thinking about some of those perspective questions that you, you all may dig into and just the scope and scale of this report. Is there anything in the off chance that not all listeners go read every single page. Is there anything we've missed that you want to make sure you highlight for our listeners?

Elliot Jones: I think one other thing I do want to bring up is the kind of lack of involvement of affected communities and all of this that we asked almost everyone we spoke to, so do you involve affected communities in your evaluations? And basically everyone said no. And I think this is a real problem that as I kind of mentioned before about what do regulators want? What does the public want in these questions? Actually deciding what risks we need to evaluate for and also what is an acceptable level of risk is something that we don't want to be left just to the developers or even just to a few people in a government office. It's something we want to involve everyone in to decide. There are real benefits to these systems. These systems are actually enabling new and interesting ways of working, new interesting ways of doing things, but they have real harms too.

And we need to actually engage people, especially those most marginalized in our society, in that question and say, what is the risk you're willing to take on? What is an acceptable evaluation mark for, for this kind of work? And that can be at multiple stages. That can be in actually doing the evaluation themselves. Have you got, if there are like a very diverse group of people red teaming a model, trying to pick it apart, have you got them involved in the goal setting stage? At that kind of product stage when you're about to launch something into the world, are you making sure that it actually does involve everyone who might be subject to that? If you're thinking about using a large language model in recruitment, have you got a diverse panel of people assessing that system and understanding is it going to hurt people from ethnic minority backgrounds? Is it going to affect women in different ways? So I think that's a really important point that I just want everyone to take away. I would love to see much more work in how you bring people into the evaluation process, because that's something we just really didn't find at all.

Kevin Frazier: Okay, well Elliot, you've got a lot of work to do, so I'm gonna have to leave it there so you can get back to it. Thanks so much for joining.

Elliot Jones: Thanks so much.

Kevin Frazier: The Lawfare Podcast is produced in cooperation with the Brookings Institution. You can get ad-free versions of this and other Lawfare podcasts by becoming a Lawfare material supporter through our website, lawfaremedia.org/support. You'll also get access to special events and other content available only to our supporters.

Please rate and review us wherever you get your podcasts. Look out for our other podcasts, including Rational Security, Chatter, Allies, and the Aftermath, our latest Lawfare Presents podcast series on the government's response to January 6th. Check out our written work at lawfaremedia.org. The podcast is edited by Jen Patja. Our theme song is from Alibi Music. As always, thank you for listening.


Kevin Frazier is an Assistant Professor at St. Thomas University College of Law and Senior Research Fellow in the Constitutional Studies Program at the University of Texas at Austin. He is writing for Lawfare as a Tarbell Fellow.
Elliot Jones is a Senior Researcher at the Ada Lovelace Institute.
Jen Patja is the editor and producer of the Lawfare Podcast and Rational Security. She currently serves as the Co-Executive Director of Virginia Civics, a nonprofit organization that empowers the next generation of leaders in Virginia by promoting constitutional literacy, critical thinking, and civic engagement. She is the former Deputy Director of the Robert H. Smith Center for the Constitution at James Madison's Montpelier and has been a freelance editor for over 20 years.

Subscribe to Lawfare