Aug 15, 2025 |

30 min

|

291 views

Joe Cheng - Keeping LLMs in Their Lane: Focused AI for Data Science and Research | SciPy 2025

LLMs are powerful, flexible, easy-to-use... and often wrong. This is a dangerous combination, especially for data analysis and scientific research, where correctness and reproducibility are core requirements. Fortunately, it turns out that by carefully applying LLMs to narrower use cases, we can turn them into surprisingly reliable assistants that accelerate and enhance, rather than undermine, scientific work.

This is not just theory—I’ll showcase working examples of seamlessly integrating LLMs into analytic workflows, helping data scientists build interactive, intelligent applications without needing to be web developers. You’ll see firsthand how keeping LLMs focused lets us leverage their "intelligence" in a way that’s practical, rigorous, and reproducible.

Transcript#

This transcript was generated automatically and may contain errors.

Continuing on with our machine learning explainable AI data science track, we are keeping in the theme of kind of containing LLMs and so please give Joe Cheng a warm welcome.

Thanks. This has been a really great track so far. I actually feel like this will be a really nice compliment to Hugo's talk if you if you just got to see that. Although there may be fewer f-bombs in this one. Although, fuck, fuck, fuck. Okay, now we're even. And fuck, now I'm up by one. So when I talk about LLMs today, I'm not talking about chat GPT. I'm not talking about co-pilot. I think like Hugo, I'm here to talk about more DIY LLMs. So building custom things that are designed for a more specific purpose. And if you are in this room and you are curious about LLMs but you have not built these kinds of things yourself, we have a package called chat lists that makes it very easy to get started building things using LLMs.

We've tried to make this as simple as as possible to get started. So this is all you need to ask your first question to, in this case, Cloud 3.7 Sonnet. And if you want to proceed to building agents, you can do that too. There's all sorts of nice little bells and whistles that I'm not really going to get into today because the focus of today is going to be slightly different. But I do want to make it clear that if you are in this room, if you know Python, you are ready to harness LLMs to create advanced agents and you could do this today. But the question that I have is, should you? Like, is this actually a good idea?

The prime directive: trust and responsibility in data science

And to explain what this means to me, I work at Posit Software, PBC. I was the first employee 16 years ago. This company means a lot to me and it is a public benefit corp. And if you're not familiar what that means, you can think of it as sort of blending elements of nonprofit and for-profit organizations. And in order to be a public benefit corp, like a nonprofit, you need to have, as part of your corporate charter, a mission, a public benefit. And for us, this is abridged, but it's to create open source software for data science, scientific research, and technical communication.

And for us, that's in the form of Python packages, tools that are for data science, for research, and for communication. And I want to focus in on data science and scientific research specifically. I'm going to say data science a bunch of times today. And whether you self-identify as a data scientist or not, if you are using code to understand data, then for the purposes of this talk, please consider yourself a data scientist.

And when we started this company, because of that focus on data science and scientific research, there are some principles that are core to who we are as a company. I don't expect you to read this, but in this book by John M. Chambers, he says, those who receive the results of modern data analysis have limited opportunity to verify the results by direct observation. Users of the analysis have no option but to trust the analysis, and by extension, the software that produced it. Meaning software that creates answers from data, it's not always obvious whether the results are correct. Both the data analyst, that's most of you presumably, and the software provider, that's me, therefore have a strong responsibility to produce a result that is trustworthy and, if possible, one that can be shown to be trustworthy. This obligation, I label the prime directive. That's how important he thought it was that he called it the prime directive, and that's a directive that we adopted as a company as well.

And because we know that people are using our tools, our open source tools, for healthcare, for drug development, to influence public policy, epidemiology, therefore we know that the stakes are high. And this is a comment made by a software engineer at Posit. I'm aware that if I make a mistake, bad things happen, death and other things.

I'm aware that if I make a mistake, bad things happen, death and other things.

And, like, thinking concretely about what that means, fulfilling the prime directive, obviously we need our results to be correct. We want the methods of our analysis to be transparent, meaning they can be inspected, and the analysis needs to be reproducible. And this being SciPy, I don't need to explain why these are important, but I just want to make it clear this is where we are coming from as a company.

Why LLMs are a terrible fit — on paper

When you combine these things, like, this to us, not just... These aren't best practices or recommendations. These are moral and ethical obligations. Combining them with LLMs, like, what are we even talking about? Like, on paper, this is a terrible idea, right? When it comes to correctness, these LLMs are infamous for giving convincing but wrong answers. Nobody understands why these things do what they do. Even the people who come up with the architectures, even the people who train them, this is an empirical art, right? And they are inherently non-deterministic. Even the ones that let you set a random seed, the providers are all careful to say you are not guaranteed to get the same results because of changes in hardware over time and software configuration and things like that. So there is no reproducibility here.

So given this sort of clash of these important values and these sort of really terrible qualities of LLMs... And we've all seen this. Like, we've all seen... Here, draw an intricate piece of ASCII art. Here's an intricate piece of ASCII art of a wolf. Right? This GPT-40 Mini. And the interesting thing is, like, these LLMs, they're not bad. I mean, especially, like, the latest models, the latest and greatest models, they're not bad. They're just jagged, right? Like, you might expect that an LLM would be like this, where there are easy tasks that they really get right almost all the time. And as the tasks get harder and harder, they get worse and worse. And if you've used LLMs with any regularity, you know that this is not the world that we live in. LLMs are like this. Like, there are easy things that they are awesome at, and there are easy things that they are surprisingly terrible at. And then there are hard things that they excel at, and then hard things that are almost the same as the thing that they excel at that they are terrible at.

And, like, two points that I would highlight here is, like, they're really good at coding, and coding is super hard. I mean, we kind of take it for granted now because it's been, like, a year and a half. But especially the latest models, what they can do is jaw-dropping, if you stop and think about it. And yet, the very simplest of data tasks, they will fall on their face. And if you are not convinced by that, I will show you.

This is an example that I ran yesterday, just to be sure. So let's start with an array of random numbers using NumPy of some length n. We're going to vary this n over time. Okay? We're going to fire up chat list and ask GPT-4-1, which is OpenAI's latest and greatest general purpose model, how long is this array? And then we give it a JSON representation. I mean, what could be easier than len, right? And if we give it a length 10 array, it says 10. That's actually correct. Okay, good job. At 100, it gets it correct. At 1,000, it gets it correct. At 10,000, it gets it wrong. It says it's 1,000. Why there? Who knows? And this is actually, as disappointing as this is, this is already cherry-picking to make the model look good because I picked pretty round numbers. If I change it to 103, the LLM says 100. So we're talking about software that cannot tell you the length of an array. Like, what business do we have using this technology in data science?

Query chat: keeping the LLM in its lane

So are there ways that we can use this together? And about a year ago, me and my team, the Shiny team at Posit, decided to, like, let's just humor ourselves and give it a try and see what we could do if we made our best effort at making something useful with our technology in an LLM. And what we came up with is this. So this is a Dashboard. This is a Shiny Dashboard. If you haven't seen Shiny, it's a web framework that helps people who primarily just work with data create interactive web applications like this without being web developers. And, you know, if you hide this part, all of this is just standard Shiny. Shiny has done this for years and years. It's not an LLM. It's just a terministic interactive code.

And I can do things, like I can change options here to say, like, do I want to show the most common or least common birds? This is bird sighting data from the eBird dataset from Cornell. There's a map down here that I can pan and zoom that's folium. So all of that is pretty standard. And what we decided to do was to see if we could use an LLM to make this Dashboard more useful. On a lot of these Dashboards, you would expect to see a sidebar with our sliders and dropdowns that you can use to filter the data, and we decided to replace that with a chatbot.

So here, I'm going to say, like, so this is bird sighting data. I'm going to say, like, show only observations between 6 a.m. and noon. Let's see, where the observation method is stationary. Okay, I spelled it wrong, but I think it'll probably figure it out. So it goes ahead and does that, including correcting my spelling there. And the important thing to note here is that although everything on here changed, this SQL here is the only thing that the LLM has direct control over. I'm not asking the LLM for answers. I'm asking it for SQL. Once it generates the SQL, everything else is handled by deterministic code. It's handled by Shiny. And very importantly, we display the SQL at all times. So on the off chance that it does make a mistake or something doesn't seem right to you, you only have to look up and see, like, do I think that that looks correct?

And the same thing in the chat history here, we make sure to leave the SQL there so that you can refer back to it later. And this is a simple query that you could have done with sliders and dropdowns, but because this is natural language, it actually gives us the flexibility to do things we couldn't otherwise do. Like, I could say, like, invert the filter. So everything that you're showing now, hide and vice versa. And this is no longer something that is easy to do with most dashboarding frameworks. And I can go further and say, also show everything in this category, and it'll OR criteria together. You can ask it to calculate standard deviations and outliers and things like that. It'll do all those things.

So besides filtering and sorting, you also have the ability to ask data questions. So you can say, like, we have the total sightings and total birds sighted here, but we don't have, like, what was the average and median counts of birds per sighting. And once again, this LLM, it actually doesn't even have access to the data. So it can't hallucinate based on the data. All it has access to is a data schema, and it will write SQL to answer this question.

And here we can see. And for this particular application, we also added these little buttons so that if it's not clear what you're seeing here in this visualization, you can press this button, and it'll send a screenshot of the output and ask the LLM what it sees. And interesting, like, in this case, it happens to notice that it seems to be following a river or waterway system, which I had not noticed despite looking at this many, many times.

So this works pretty well. I mean, it works pretty well in terms of correctness because we are only asking it to generate SQL, which it does incredibly well. The best models today do that incredibly well. It is transparent because every SQL query is displayed to the user, so you know how the answer is being derived. And it's reproducible because it's generating SQL, and SQL is reproducible. So it's almost magic how well SQL fits in at the intersection of what the LLM can write, what the code can execute, and what the human can review and interpret.

So it's almost magic how well SQL fits in at the intersection of what the LLM can write, what the code can execute, and what the human can review and interpret.

So in fact, we were so happy with this approach that we made it a package. So if you're building a Shiny dashboard, it's like a four-line operation to drop in an LLM that does this kind of sidebar-based filtering. That package is called Query Chat. It's not on PyPI yet, but if you Google for it, you can find it.

Databot: letting the LLM off the leash

Now, that worked, and when I submitted the talk proposal, this is where the talk was supposed to end. A few months ago, we decided, just out of curiosity, like, yes, yes, we're a very responsible company. What happens if we let the LLM off the leash, right? Instead of, like, focusing in on these things that it can't do wrong, why don't we start with the best coding LLM at the time, Claude 3.5 Sonnet, and we give it a tool to run Python unfettered and unsandboxed. And then we ask it data questions. So just this combination of things, what happens? Like, does it delete your hard drive or does it, like, you know, fulfill all of your wishes?

So that, I'm going to show you an early prototype of something we so far have been calling Databot. Everybody hates the name, so I'm sure we're going to change it before it sees the light of day. But I downloaded some data from 538, just some simple CSV data. So starting out in a directory, inside of this directory are a couple of CSV files and a readme, and then I launched Databot. And Databot is a simple chat bot that has the ability to execute Python. So it makes some suggestions, and I tell it, look in the current directory. I'm not even bothering it to tell it the name of these files. It sees that there's these three files, and then now it's making suggestions of what to do next.

And what we're going to do is read the contents of the readme, and it reads it with just Python. And now it sees, okay, I can see this is NBA, this is basketball, historical game data. And it makes a suggestion again, like, let's load this data using polars and look at what kind of data we're looking at. All right, so it's figured some things out. All of that is correct. And create a visualization showing the distribution of game scores. That seems like a pretty good place to start.

So it's using the plot9 visualization package, and it's looking at the plot just like we are and coming to its own conclusions about what the data says. And we can look at the code. Everything is out in the open, so you can see all the code that generated the output.

So this is where things go a little wrong. So there's an error in the code that it wrote, but like a good little agent, it realizes there was an error, and it wants to try again, and it tries again. And it generated the box plot. But interesting, I'm going to pause it here. There appears to be a trend towards higher scoring in recent seasons. But look at that box plot. Like, what are you talking about?

So I was a little confused. I looked at the code. There was nothing obviously wrong. I mean, I looked at... This is real time. This is not sped up or slowed down. So that's how much time I looked at the code. And then finally, I was like... what are you talking about? That's just one box. Yeah, that doesn't look right at all. There's a single box. It says, oh, let me see. Sorry, this data set only contains a single year. The other CSV file has all the other years. So let me go ahead and do that. And this time, it gets it right.

But my biggest problem with this is that the X axis is super hard to read, because there's so many years smushed together. That's the kind of thing where I'm usually like, fine, I'll just make my window bigger or something. I can't be bothered to figure out how to fix it. But in this case, I just tell it, hey, I can't read it. Please fix it. And within seconds, it has fixed that. Yeah, much better.

So the last thing here is... I mean, we have spent, what, a total of three minutes so far. But imagine that we were doing much more intense analysis, and we've learned all this stuff. And I don't want to forget all that just because my session ends, and I'm going to come back later. So what I'm telling it to do is basically take everything you've learned, summarize it, and put it in a file called llms.txt in the current directory. And then the next time I launch it, it immediately greets me with, hi, I have NBA data for you, and here are the things that I know about it. What questions do you want to ask me?

Is this responsible?

So I did have to record that demo, because that demo can go very wrong, depending on what kind of answers you get back. But I swear, that was not cherry-picked. I ran that demo exactly one time today, recorded it, and that's what it was. And that's very typical for the kind of thing that happens. There are always mistakes in the code. It usually figures them out and fixes them. There are sometimes hallucinations where the plot is clearly wrong, and it says something confidently that it sees in the plot that it doesn't. But when the plot is right, it usually makes correct observations.

So is this responsible? I mean, I feel really conflicted about this. I mean, on paper, this is violating so many best practices, both in terms of the prime directive, but also in terms of security. But in practice, it is so effective. I mean, I can object intellectually, but I'll never do exploratory data analysis without this thing now.

That, you know, even though on paper there is this inherent tension, we have done some things to try to make it to mitigate the danger. We've limited it to exploratory data analysis only. We designed it for human intervention, so with tight feedback loops, so that humans are not overwhelmed by so many insights, they can't spot errors. We have UI tricks we use to try to make errors more visible to users. And in practice, not writing the code, it's actually really easy for me to see when there are mistakes that the bot is making. Just like if you've ever paraprogrammed with someone and just been astonished at the simple mistakes they're making while they're typing, because you have all these cognitive, you have all these free brain cycles to look at their mistakes.

So despite the conflict, we are going to ship a version of this in Positron , which is our next generation IDE. It is really advanced compared to what I just showed you. But we are going to call it a research preview, just because I still just feel really weird about this. And I really look forward to seeing what users do with this.

A framework for thinking about risk

The way I'm thinking about this now is not so much in terms of a black and white, are we doing things responsibly, or are we doing things not responsibly? Can this thing give correct answers, or can it not give correct answers? There are many dimensions here to this problem. And the two that I'll focus on in this unit list graph is likelihood of mistakes on the x-axis. So going towards the right means it's more likely to screw up. And then on the vertical axis, how likely are we to overlook those mistakes all the way to production or all the way to the conclusions of the analysis?

So something like query chat is very unlikely to make a mistake. But if it does, you're probably not really on the lookout for a lot of mistakes. So it's kind of middling in terms of how likelihood you are to find it. Something like data bot, you're encouraged to be much more vigilant, but it is more likely to make mistakes because it has much broader kind of tasks that it'll try to do. And then something like chat GPT's deep research mode, where you encourage it to go off and think for a long time about hard problems and then come back with reams and reams of conclusions. To me, unless you already knew the answer of the question that you're asking, it is almost impossible to find the subtle mistakes that it'll make. And anecdotally, the people I've talked to who really know the topics that they ask deep research about say it's pretty common for it to come back with 90% correct and 10% incorrect answers, which is almost like the worst ratio, right? Like that's just high enough for you to be lulled into a false sense of security.

And anecdotally, the people I've talked to who really know the topics that they ask deep research about say it's pretty common for it to come back with 90% correct and 10% incorrect answers, which is almost like the worst ratio, right? Like that's just high enough for you to be lulled into a false sense of security.

So what does this mean? Like where is your level of tolerance? And I think that really depends on who you are. It really depends on what you're doing. So if you are a data science consultant in marketing or something like that, maybe it's something like this. Maybe it's like everything below and to the left of that line is safe enough for you, where either it is unlikely to make a mistake or if it makes a mistake, you'll probably catch it. But if you are doing nuclear safety or something like that, maybe that line is down here. Like maybe almost nothing is good enough. And if you are a LinkedIn influencer, then maybe the line's over here.

So that's it for my talk. One of the things that I regret not having more time is I've spent very little time telling you that if this has piqued your interest in maybe taking a stab at building these kinds of agents yourself, I have given you very little to go on. So this first link, this YouTube harnessing LLMs for data analysis is sort of the complement to this talk, where all I talk about is how to build a quick prototype using a minimum amount of Python code. ChatList, Shiny, and QueryChat are the three packages I've talked about. They're all well-documented. And we are around, we have a booth in the expo area. The company is Posit, and we are happy to talk to you about all these things. Thank you.

Q&A

So a reminder to ask questions in the Slack channel. One clarifying question that I'll ask first, this is from Jay. He didn't quite catch what happened when the LLM mentioned the trend when it was only the 2023 box plot. Did it look at the other file but not plot it? Or did it fabricate the comment about the other file and not plot it at all?

Or did it fabricate the comment about the trend entirely that happened to exist, but it didn't actually know? Or did it know about this data set from its training data and then it leaked into the response? Oh, that last one. That's the worst. That last one's the worst, like when you're working on a famous data set. And it's not looking, it doesn't need to look at the data to know the answer. That, oh, I hate that so much.

In this case, as far as I can tell, and to be clear, there's nothing happening, there's nothing that the model can see that we can't see. The whole point of this user experience is that it is completely transparent. So in this case, yeah, it just showed a single box and then the LLM fully hallucinated. It fully hallucinated that there was this trend. And I think there's a really good chance it's because it's true. Like it's because scoring has increased in recent years. So it just made that call. So yeah, I hate that, but that is reality. But then when it got the real data, then it made more specific observations, yeah.

And then another question related to this example specifically, did you notice that the SQL returned in your, or sorry, in the bird example was incorrect? You asked for it to be, it says 9 to 11. I think it was 6 to noon and it returned 6 to 11 instead. Are we sure it wasn't less than or equal to 11? Because that would be correct. Unclear. Okay, I'm sorry. I refreshed the window, so it's gone. Okay, so then we cannot verify. But then the real question is, how do we train humans to always verify the results of an LLM? Oh, good luck.

I mean, you know, if anything over the last, you know, if we've learned anything from the last three months, which in AI timescales is like a decade, we are racing in the other direction, right? I mean, vibe coding started as a joke, and I'm pretty sure there's going to be like college courses in it soon. I mean, I think the rate at which people have been willing to suspend their disbelief after having like five good experiences is incredible. I sound really judgy right now. But I really am trying not to be because like I really want to be humble about this because this is all like uncharted territory. But I will say that it feels like human nature to let our guard down.

A question from Andrea urban isn't one reason that we can't figure out what tasks LLMs are good or bad at is because we don't know what the training data looks like if we knew wouldn't we be better able to test and validate the model outputs I'm not an ML expert but I feel pretty confident in saying with like a small ML model yes what these giant LLMs not a chance not a chance I mean the kinds of behavior that we see come out of them are so far divorced from what's in the training data that it makes no sense I mean I think that's part of the reason why so many people were so resistant to these LLMs being able to do anything but regurgitate what's already in its training set I mean how many people call these things stochastic parrots well into Claude 3.7 Sonnet when we were so far past that I think like it's because the behavior is so unintuitive coming from this training set that yeah I think we're sort of we're in a very different place than than that I see a lot of really skeptical looks from this side happy to talk about it at the booth

Another question from Carlos what's your take on scientists that don't know SQL or Python using these types of systems oh my gosh the product people in our company are dying to tell that story I mean they're so excited to put these capabilities in the hands of you know to say like this is opening the audience for data science this is expanding the number of people of access to these tools and I love that vision and we are not there yet like so I keep pushing back and saying like that is so dangerous it is so dangerous to market this as now you don't need to know Python like that is just like handing a loaded gun with no safety to someone who doesn't know anything about gun safety I mean like so I really try to take a stand against that and it really feels like a lonely position I mean I think people in the company once I explain that they get on board but go to any conference on AI nobody's saying stuff like that everybody is just saying like look this solves all the problems look at our cherry-picked demo like deploy this and you know use it to operate your hospital

And then I guess one last question that's very related Evelyn asks I've heard a lot of mixed information on how LLMs and AI will take over engineering jobs when we kind of have automated out all of the prompting commenting data engineering what jobs will no longer be needed versus which ones will stick around you know I have no idea and this is such a personal question for me I mean not just because like I am a software engineer but I have a son who is gonna be a junior at Northeastern studying computer science and I have no idea what to tell him so you know I think we all need to be like very skeptical very curious and very humble about our predictions right at this moment in time but I will say we are gonna save a bedroom for him just in case