We talk a lot about the ethical duty of lawyers and legal professionals to understand the risks and benefits of relevant technology. But when it comes to using GenAI, that might not be enough. If we want to prevent the increasing number of hallucinations and inaccurate citations that are bedeviling lawyers and even judges, we need to understand how and why GenAI systems fail.
That was the point of a recent paper by a group of scientists and engineers: Dylan Restrepo, Nicholas Restrepo, Frank Huo, and Neil Johnson. The paper carried the lengthy title, When AI Output Trips to Bad but Nobody Notices: Legal Implications of AI’s Mistakes. In addition to their own calculations and analysis, the group also consulted a couple of lawyers: Daniela Restrepo and Jean Paul Roekaert. I can’t vouch for the mathematical calculations but what they conclude squares with my own experience.
The Basic Premise
The group concludes at the outset that rather than a random, unpredictable glitch, a physics-based analysis demonstrates that hallucination is a “foreseeable engineering risk.” Meaning, of course, the circumstances generating its occurrence can be at least a little predictable.
According to the paper, GenAI systems have “a deterministic mechanism at its core that can cause output to flip from reliable to fabricated at a calculable step.” And that step unfortunately comes when the lawyer’s need is the greatest.
The group’s analysis starts from the proposition that we should know by now: GenAI is “a probabilistic text generator engineered to predict the next most plausible token in a sequence, without any internal concept of legal truth.” It is not, argues the group, a database of verified legal authorities. (The group focused on the publicly available systems and not on the closed systems that claim to rely on verified legal authorities.)
What This Means
Because it’s predicting, not analyzing, GenAI does well when faced with inquiries about valid legal principles, logical-sounding arguments, undisputed case facts, procedural history, and the like. But when faced with something novel and complex, the tool is pushed “into a region where training data is sparse.” In an effort to please and respond, it is then prone to, well, make stuff up.
The paper puts it this way:
The tool is therefore most prone to failure exactly when the lawyer’s need is greatest: on a difficult point of law with sparse precedent. The act of researching an unsettled legal issue via an LLM becomes the principal trigger for the tipping instability.
These are important points since lawyers live in a world where a hallucination, an error, can have devastating consequences. So, as we have discussed, given that risk, GenAI outputs must be checked over and over, often mitigating the cost savings of using the tools in the first place. But if we understand why the errors occur and more importantly when, we can better and more safely use the tools.
A Blessing…And a Curse
If true, then the group’s findings are a blessing since it suggests a sliding scale of verification: less where the output focuses on well-known information and much more when it strays into the novel. Saves time and energy.
But for those uninformed of this predictability, the fact that failure can occur at a certain point can be a curse. Why: a lawyer with a legal project often starts with undisputed facts, then seeks information on what the law generally is with respect to the issues at hand. And then goes to more complex, ambiguous areas thinking it’s okay.
The example given in the paper is a statute of limitations question. A lawyer starts their use of ChatGPT by plugging in undisputed facts. They then seek the general law with respect to the limitation period. All well and good: the lawyer gets correct responses and then, in the words of the paper, “gains confidence in the tool.” So, the lawyer then begins asking for more ambiguous information about how that law can be used to leverage the facts or to develop arguments.
If the lawyer takes all the outputs and prepares a brief based on the information obtained, they (or their supervisor) might be tempted to spot check the first few paragraphs, find nothing amiss and, when pressed for time, conclude the rest of the outputs are also fine when they are not.
So, the blessing becomes a curse: “AI’s period of correct output increases rather than decreases the risk of harm, because it builds the user’s trust just before the fabrication appears.”
What To Do
So, what do we make of all this? Again, I’m no scientist but I do know from experience that the more general information I seek from GenAI, the more prone it is to be correct. When I stray into more ambiguous areas where there is less known about a subject, the more errors I tend to get.
For example, I once asked for information about a well-known painter. I got great information. But when I asked about another painter in the same school of painting who was relatively obscure, the tool just made up a name. Or when I asked what’s the subway stop to take to catch the Q70 bus to LaGuardia Airport, it got it right. When I asked the best route from my hotel (which involves more ambiguity), it sent me to the wrong stop. It did say sorry when I pointed out the error (after some argument).
The point being for lawyers and legal professionals is to understand that “AI possesses no independent legal agency: it is a computational tool.” Granted, it is a computational tool with which you can converse like a human. It reacts in human ways. It’s tempting to anthropomorphize it.
But that’s where we go wrong. Instead, we need to start with thinking of it not as a person but a product with a foreseeable engineering risk. Like a sharp knife or an ATV. A risk that appears to materialize when faced with novelty and ambiguity. But it’s that novelty and ambiguity that creates the greatest risk of hallucination, according to the paper.
For lawyers, that means if you are going to use this sharp knife, you better know how and in what circumstances. You need to know how to do that safely.
The paper says it the best: “The duty of technological competence, as expressed in ABA Model Rule 1.1 and its state-level counterparts, must evolve. It is no longer sufficient for a lawyer to know how to operate a piece of software. Competence now requires a practical understanding of how that software can fail.” That it is clearly right about.
Want to use GenAI? Use it to access known information that would be time consuming or difficult to otherwise get. Ask it to do a lot of things where accuracy isn’t that important. But don’t ask novel or unsettled legal questions, without checking and double checking what you get back. Else you might get off at the wrong subway stop.
Or much worse.
Stephen Embry is a lawyer, speaker, blogger, and writer. He publishes TechLaw Crossroads, a blog devoted to the examination of the tension between technology, the law, and the practice of law.
The post Understanding AI Hallucinations: Making Sure You Don’t End Up At The Wrong Stop appeared first on Above the Law.