Transcript#

This transcript was generated automatically and may contain errors.

Welcome to the test set. Here we talk with some of the brightest thinkers and tinkerers in statistical analysis, scientific computing, and machine learning, digging into what makes them tick, plus the insights, experiments, and OMG moments that shape the field.

In this episode, Hadley Wickham and I sit down with Tristan Handy, CEO of dbt Labs, to discuss why SQL, once seen as part of the old world of data, has become central to how modern data teams work.

So basically, roughly a decade ago, a tool in the tidyverse called dbplyr enabled our users to query databases without ever really touching SQL. But around the same time, another tool emerged, dbt, and it took almost the opposite approach, making it easier to go hands-on and write SQL directly, while bringing in critical ideas from software engineering, like version control and automated testing.

In this conversation, we explore the emergence of tools like dbplyr and dbt, who they're for, and why SQL has ended up at the center of it all.

All right, hey everybody, welcome to the test set. I'm Michael Chow, and I'm joined here by Tristan Handy, CEO of dbt Labs, and my co-host, Hadley Wickham, who's chief scientist at Posit.

Tristan, I'm so excited to have you on, I have to admit, I'm dbt-pilled myself, and as a Python R user, I've really battled to explain why I'm obsessed with dbt. So I feel like I'm honored to have you here to, I think, really rep dbt and get people hyped about what dbt is cooking.

Hell yeah, I appreciate you having me.

Yeah, I was wondering, maybe to start out, I know we did a quick demo online of putting up a dbt project, and Hadley was there. I thought maybe to kick it off, we could just have, maybe Hadley, you could try to explain what your sense of dbt is?

I think the way I think about dbt is it's sort of trying to put SQL on an equal footing to other programming languages, i.e. giving you tools for version control, for writing reusable functions, or scripts, or whatever you want to, macros, whatever you want to call them, like doing documentation and testing. That's the thing that seems to resonate most with me, but I'd love to hear how you think of it, if that is at all correct.

Two worlds of data

Yeah, I think that that's correct. So this is fascinating because all of us here are data people, and yet we come from two very different trees inside the data community. You folks are in the R-Python world, and I very much am in the, I wouldn't even call it the SQL world, I would call it more of the enterprise data world, which certainly SQL is a common part of that, but also tools like data pipeline and data transformation.

So Informatica is a part of this world, where it's not really a part of your world at all, I think. Yeah, I don't even really know what Informatica is.

And so part of the world that we come from includes quote-unquote data modeling, which is like Kimball or Data Vault or, you know, choose your religious approach here. And it almost always makes sense in the context of a somewhat larger organization.

So when you are certainly the largest organizations in the world use a ton of R and Python, but a lot of times R and Python are these generalist tool sets where you can do data ingestion, you can do data transformation, you can do analysis, you can do all this stuff. dbt tends to be used for, you know, people torture it to use it for all sorts of different things, but really it's for data transformation.

So the goal is like somehow people in your organization loaded a bunch of data into a data lake, a data warehouse, whatever. And then you are kind of picking up the baton and saying like, well, now I need to turn that data into useful data sets. And I may not even be the consumer of that data set, but I know that there are, you know, a handful to dozens to maybe hundreds of consumers of this data set that I will create. But I care about very specific tooling to make that process extremely robust.

But it doesn't do other stuff. It doesn't build charts or graphs. It doesn't move data, this kind of stuff.

Yeah, I think that's a really interesting perspective because I've always, like the sort of the data science model, like the alpha data science model is always like you as the data scientist trying to push, like to own as much of the stack as possible. Like you, you know, hopefully you can get some of your data from a nice, you know, SQL database somewhere, but often you're going to might have to do some scraping or, you know, pull together 15 random Excel sheets. You're going to have to do the tidying. You're going to have to do the exploration, the visualization. And then finally, like you're going to communicate the results to someone else.

So it sounds like a lot of that, like, like one of the big distinctions is that like in your world, that's not one person who's kind of operating across that whole stack. You've got like one person, like someone's doing ingest. That's almost outside of the realm of dbt. And then you've got someone who's working with the data that's already been ingested. Maybe it's in 17 different tables and now you want to create like a nice data set that like someone else is going to use to create a dashboard or get fed into some other kind of report.

Yeah. I, I remember, gosh, this is way back in the day. I don't even remember who wrote this, but there was, maybe it was somebody at Stitch Fix wrote a blog post about the full stack data scientist. And I remember, I remember reading that and being like, oh my gosh, I, I could not disagree more with this, this approach.

But it makes sense in the context that you operate in. So like if I think the data scientist is a role that like kind of arose out of experimental design and validation, like you wanted to like try to run an experiment on the world. And maybe if that experiment had some validity to it, the organization that you worked with could, could be, it could, could see like significant upside when whatever the thing was that you were trying to prove.

In the world that I operate in, there are the internal operations of a company and those operations just require some data to, to do the things that they do. You know, it could be a healthcare provider, it could be like literally anything. But you're not, you may be trying to learn something in the process, but you may just like need to produce a customer 360 dashboard so sales reps can like do their renewal calls.

And when you operate in this kind of environment, actually the quote unquote full stack data science model is, is kind of terrible because the organization may have to do this business process for multiple decades. And certainly like individual people will come and go over that time period and your process needs to be like very standardized and, and, you know, transferable between humans.

It sounds, yeah, like a much, it's much more like business as usual when almost like the ideally like the data scientist is trying to like disrupt that, right? Like the data scientist is trying to figure out something that you need to do to change your business, like to, to make a material difference, not just to like understand like where you are right now and to do all this, like all that kind of reporting that people just need to know.

And, and a lot of times these two worlds are, are complimentary. You know, it is, I think very common that data scientists will use data sets that have been curated by dbt, but then kind of build from, from there. And I think it's also very common that features that get engineered in Python and our notebooks can kind of get ported upstream into data pipelines, whether they're built in dbt or Airflow or whatever.

The other thing that took me like a surprising long, surprising long time to kind of figure out is just like getting everyone to agree on like what a metric is, like turns out to be like surprisingly complicated because there's all these like little wrinkles that you were like when you look at the, at a high level, it seems pretty obvious, but when you start to figure, talk about like, you know, if you just want to know like the number of customers, like, like what is a customer? Like how do you count that? There's like a lot of fine grained decisions in like different parts of the organization have like different priorities when they're thinking about what a customer is and just getting everyone like, there's some like kind of technological problem or like, let's just calculate that number once in one place, but like a much bigger kind of like sociological problem of like, how do we get people to actually agree?

Yeah, and the, the thing that I think dbt has done a reasonably good job of over the past decade, so first commit to the open source repo was in 20, early 2016, so we're almost 10 years in, that by and large, there were localized versions of the truth that existed at organizations 10 years ago. Now some organizations were very good at this kind of thing, you know, famously, you know, companies like Facebook and Airbnb and Spotify, like they invested in their kind of global data infrastructure, but, but it, and many companies, like versions of the truth were much more localized and dbt introduced the idea that you should, you should push these versions of the truth kind of centrally and govern them in source control and that rather than taking your ball and going home, if you didn't like the centralized version of the truth, you would kind of argue it out in, in public and try to come, come to some consensus that worked for everybody.

dbt introduced the idea that you should push these versions of the truth kind of centrally and govern them in source control and that rather than taking your ball and going home, if you didn't like the centralized version of the truth, you would kind of argue it out in, in public and try to come, come to some consensus that worked for everybody.

Now obviously that's imperfect, but, but I think that we've made it honestly like a reasonable amount of progress along that continuum.

It does, yeah, it does feel like that has really changed. I think when I was doing my PhD, like 15 years ago, you know, I was working with scientists, you work with scientists and their data comes in like every form you can imagine. Like it's Excel spreadsheets, it's an API, it's a database, it's CSV files, like whatever. And so like absolutely a big part of a role of like a applied statistician or data scientist was just to figure out how do you get all that data into like one nice clean representation.

And I remember kind of thinking at that time, like this, like this, like this is just so tough in science, it must be so nice to like work in an industry job where you can just like query like the single source of truth and just get it. And now that it clearly like was not the case like 15 years ago, but it does feel like we've gotten like much, much closer to that now that you can now, like in many organizations there's like a decent chance that the data you need is kind of like nicely prepared somewhere.

The analytics engineer role

We kind of, back then we invented this term analytics engineer, which I think is pretty descriptive. Although I don't know that there's some great textbook description of it. Some people say that an analytics engineer is a pissed off data analyst and, you know, as technology changes, the capabilities of these different roles change as well as warehouses get more sophisticated, what analytics engineers can do increases.

The way that I like to think about an analytics engineer is somebody who fundamentally enjoys the process of curation as opposed to exploration. So like, I don't know, you can both tell me, but I think when I talk to data scientists, the thing that makes me different from them is that I want to make sure all the books on the shelf are in the right order so that when somebody comes up and tries to find one, they can find it. And the thing that a data scientist often wants to do is like read every page in a book and draw new conclusions from it. And like, sometimes I do that too, but that's actually not the thing that I take the most pleasure in.

Yeah, it's really interesting because I, yeah, I do think like what a data scientist is, you know, has changed over the last 15 years because I do believe like that is also the job of the data scientist to make sure the books are all like shopped correctly and you can find what you're looking for easily.

So, in the version of the world that I have in my head, there are really four main roles that are kind of on a continuum. And this is, again, like in my kind of enterprise data systems world and less in the like more experiment driven data science world.

But the spectrum kind of goes from at the left most, the most technical, the deepest in the stack is the data platform engineer. And then you move to the data engineer, and then you move to the analytics engineer, and then you move to the data analyst.

And so the data platform engineer kind of builds the infrastructure that everything else is built on, builds and maintains that infrastructure. And then the data engineer builds pipelines, but these pipelines might be the most technical pipelines. Maybe they require a high degree of focus on performance or uptime or something like that. But their primary focuses are technical and not business facing.

And then the analytics engineer is also building pipelines, but they're starting to build the pipelines that are like, that really contain a lot of semantic meaning. My favorite example of this is, I one time had to build pipelines for, I think they were essentially like an Instacart competitor, but they needed to calculate the cost of goods sold for shipments that they sent out. And it turns out that when you ask for a bundle of green onions, the way of calculating the cost of goods sold on top of a bundle of green onions is very challenging. And in order to build that data pipeline, you actually have to be close enough with your business counterparts that you understand all these rules in your brain. Data engineers, like honestly, if you make them think about cost of goods sold for green onions too much, they're going to quit and go get another job.

So there's hybrid where analytics engineers are this hybrid role where you have a lot of business context. You also have enough technical context to be dangerous. Then you go to the data analyst, but they are often not centralized. They are often embedded in local teams and they go to standups with the marketing team or the finance team or whatever. They live and breathe every day inside of that business unit.

SQL's glow-up

I'm also, one thing I'm curious about too from, from the time is I heard you mentioned, I think in one of your podcasts with Ben Stancil, something like in 2015, like nobody wanted to put SQL on their resume. Yeah. Like deeply unsexy. Could you say a little bit about that? Cause I, I'm so curious, like, I feel like dbt has brought SQL into the mainstream.

I'm so curious, like there were these two different worlds in, in data back in 2015, there was kind of the, the new stuff and there was the old stuff and the old stuff felt really clunky. And this is like, um, uh, Teradata and, um, Hadoop or, no, no, I would say I would put Hadoop. Hadoop was the new stuff. Yeah. Hadoop was like the new exciting stuff.

And, and, um, uh, but, but like all of the enterprise stuff was, was pretty deeply unappealing because it, it hadn't honestly moved that much in the past decade. Um, and, and then you had the, the Hadoop world, uh, which, you know, in 2015 Spark was really pretty nascent and, and, you know, really just shooting up the charts in terms of like number of stars on GitHub.

Um, and so people who want it and like people with those skills were getting slurped up by, you know, fang companies. Um, and so what you wanted on your resume as a, as a data person was, you know, Pig and Hive and Spark and Impala and, you know, all of, all of this. And, uh, if, if somebody told you that they spent most of their time in SQL, that probably meant they were a part of this old world and they probably were like, you know, living deep in the belly of like some, you know, 50,000 person enterprise doing nothing that interesting or new.

Um, but there were like two things that I think changed there. One is that like you were talking about before cloud data warehouses became super powerful and prevalent. Um, then the, the second thing is that, uh, SQL itself became much more expressive. And so, whereas previously SQL was a language that, you know, basically you, you had to, you could do some simple aggregation, but, uh, by, by the time that I was, you know, beginning to do this work in pure SQL, um, we actually couldn't find, uh, a, a use case for a data transformation that we couldn't express in SQL.

The most complex thing that I, data transformation that I needed to express in SQL was outlier detection on time series data, where I was detecting lift from TV ads. And that was a little hairy, but I was able to use window functions to do it. And it worked. It worked great.

dbt Fusion and cross-engine compilation

I think we need more delineation between the ergonomics of how you want to express your logic and the execution environment that they are executed in. And so one of the things that we are spending a lot of time on right now is dbt has a new engine powering it called Fusion. And Fusion starts from not just AST parsing but actual like full logical plan creation. And so what you can do when you can actually go all the way down to the logical plan emulation layer is that you can then reconstruct the actual symbols in the SQL in another engine.

And so all of a sudden, and we're not fully there yet, but like the technology is a part of Fusion, we will soon be able to say, okay, SQL that you've written for one engine, we can just like cross compile and provably correctly run it on another engine. But then that's not that far away from saying like, also, I want to have a Python front end or an R front end. And it just all is a part of the same like, you know, logical plan.

Yeah, yeah. I think the thing that's really interesting too, coming back to AI again, like that is an area where like you can like, yes, you can take a, you know, an R script or a SQL script and give it to an LLM and say translate it and you know, 90% of the time, it does a good job. But the thing that I've also found really interesting, I've been working on dbplyr lately, that's the dplyr backend that translates to SQL. Like that's also a really good application for LLMs because it can, it can generate so much of that translation code. But now I can actually like now it's deterministic, like it's not the LLM isn't doing the translation, it's generating that code. And then I can unit test it, it like accelerates my, my velocity there increases so much. And it's still like, it's like, I know it's correct, because there's all the unit tests. There's no like stochasticity from the LLM anymore. And that just seems like, like, really, that that's also just like, I don't know, that just seems cool.

100%. I think that like, there are these like weird fault lines within any kind of software communities that divide on language. And that that's just because of like, the way that our brains interface with language, like you, it like takes mental space to learn a language to like, keep it in your brain, all this stuff. And so we've got this stupid thing where like, people are Python users, or R users, or SQL users, or whatever. Like, it's all just like different ergonomic ways to express the same ideas. And I love the idea that we're getting closer to a place where like, we've got a universal babblefish.

It's all just like different ergonomic ways to express the same ideas. And I love the idea that we're getting closer to a place where like, we've got a universal babblefish.

It is. I will say too, like, I, I worked on like a little bit of porting like dplyr and dbplyr to Python and all that translation stuff. But what's so interesting to me now, and in like 2026, is that similar to like unnesting JSON, I've been amazed at like how expressive, yeah, like SQL is in a lot of databases like DuckDB. I used to bring up the example like six, seven years ago of like, why dbplyr? And it was like, well, if I want to select every column except one, like it's a huge nightmare in most databases. But now like, you have DuckDB with like excludes or except, I always forget the, they have like all these ways to select things and operate.

So it's interesting to your point of like, fusion, taking and being able to translate like down to the AST level. It's funny, like seven years ago, I feel like I would have wanted like an, a very data frame, like our Python tool translating to SQL. But now I could see like something like you described, be sitting a lot closer to SQL, like an R wrapper around a more SQL-y dialect.

To me, dbt is, started out a little bit like Rails. Where SQL is HTML, and nobody really wants to sit there and hand write HTML, that's like not an efficient use of time. And when, so when you use something like Rails, you kind of go up a level of abstraction. And similarly, that same thing happens with dbt. You know, you can, long before the engines themselves started doing things like exclude or accept for the select list, like you could implement that as a function in dbt, and then using the macro capabilities, you could just have it.

As we continue up the layers of abstraction, it just enables people to forget the implementation details. So like, here's a thing that goes on right now. There are companies that have spent like literally millions of human hours writing Spark pipelines. And they also, in different teams, have spent millions of human hours writing stored procedures. And these things fundamentally do the same stuff. Like they are not different. They just, for whatever reason, have been built in different technologies. And so these companies maintain separate data infrastructures for these two things. And then at some kind of final stage, they kind of make them all available to each other. But that's not that sensible. And if you could just say like, well, you've got a translation layer that kind of can, you know, read and operate on these things, regardless of how they were originally expressed, then that all those walls kind of fall away.

It's also sort of echoes a bit like the sort of story of like Hadoop to Spark to now. Like I think, yeah, people just write SQL, right? Like in the early days, you know, SQL databases couldn't handle the level of, like they didn't know how to split up jobs across, you know, hundreds of machines. And you were kind of forced as like a data scientist or data engineer to do this yourself and like explicitly manage all that computation. And like, I'm sure some people enjoy doing that. But I think most people, it was just like something in the way of doing your actual job. And like, as over time, you know, the databases have become more capable, like all of that just gets swept away into the background.

Building the dbt community

dbt has this incredible community. Like, I know we've talked a lot about like dbt, the tool and, you know, a fusion, but one thing that struck me is like the dbt Slack is so hopping. And there's, there's like so much going on. I'm, yeah, I'm really curious to hear like, what, what do you think like went into kind of creating such a nice community? I'm not trying to like butter you up, but dbt Slack so happening and coalesce is so bumping. I'm, I'm curious, like, is it just like analytics engineers are wild people or like, what, what do you think makes the dbt community so nice?

So I actually don't know that much about the, our community is, is there like a place that people gather to talk about best practices and stuff like that? Or is it just like so widely used that it's like, there's a million different separated communities. Is there a place where our users get together in person?

So I, I think they're like, I mean, they were like, there was like online, which was Twitter, like that's where people like shared knowledge. The R community is like largely abandoned Twitter for fairly obvious reasons. And there's like less of a sort of central online place. They definitely, there's lots of, yeah, but, but, but apart from that, there's like quite a few like regional R conferences, you're a positive as a conference, but these are all conferences on the order of like, you know, hundreds to maybe 1500, 2000 people, but like scattered all over the place. So that there's not like a really, I don't, there's like one central R event in person.

Yeah. And I will say there's like a big hex sticker crowd, like in, in R there's like, people love hex stickers for packages and there's a real like frenzy to pick them up at conferences when they get dropped. Yeah. Yeah. And so like, if you create a package in R or dbt, you want it to be a, a real package, like it has to have like a sticker associated with it and like that. And so then at the conferences, then it's like, you know, you're kind of like trading, you trade stickers with people. And I think that that's one of the things that's like pretty unique about the R community.

It's, it's, it's hard, you know, it is, um, probably anybody who has kind of been the seed crystal for, uh, a, a reasonably large community. Uh, it's, it's kind of an emergent phenomenon and you kind of never know exactly what the things are that made it happen. Um, I would, I would say that, um, the, the, the biggest, uh, the, the biggest trait that we were talking about before with analytics engineers is that they previously often previously were data analysts and they were leveling up in their careers. And oftentimes the most, uh, common motion as they went through that was a sense of overwhelm and a sense of imposter syndrome.

And so many technical communities are, uh, uh, have, have a very, like the people who run the communities are highly technical and they, there's a sense of like, RTFM, like, don't ask this question until you've like researched to the ends of the universe and only then bother me. Um, and, and we just acknowledge the fact that like this stuff is, uh, for, for many of the people that were starting to use dbt in 2016 through 2020, we're, um, just kind of felt overwhelming and they just needed some support along the way. And so we kind of seeded the community to be helpful and supportive and friendly. Uh, and we, we were very serious about moderating out any behaviors that kind of conflicted with that. And so it, that kind of creates a virtual cycle when people feel like they've been supported in their journey, they went in and then support other people in their journeys.

Yeah. It's like, I think there's interesting parallels to the art community because I think like 10 years ago, 15 years ago, there was like, like I was like four people with PhDs and statistics by people with PhDs and statistics. And you go on the, I'll help mailing list and you ask a stupid question. Someone will tell you what a fucking idiot you are basically. Um, and so like one of the things like that, like that, when I thought about the art community, like that is something I wanted to do the opposite of basically. And, you know, as the, as sort of these technology transitions from like mailing list to stack overflow from stack overflow to Twitter, kind of every point where there's opportunity to kind of like reinvent the community a little bit and move towards like a more like friendly and welcoming environment.

And I, and I think like, but that was just a tremendous net benefit to the art community as a whole. And I think we're also like, we, we also are lucky because the art community tends to be like more diverse because there's people coming from all branches of science, you know, diverse, both in their backgrounds and the applications and that, and that like cultivating that, yeah, like that, that feeling of being welcomed. And as you said, like that virtuous cycle, like I felt like really welcomed with the community. And now I'm going to like pay that forward and like, welcome the next generation of people. Like it really led to like a pretty remarkable transformation in the, in the art community.

AI's impact on community and open source

Yeah, we were, I was saying like very positive things about AI and how I'm hopeful for its impact on analytics engineers. I am a little up in the air as far as how AI's impact on community formation.

So the, the, the funny thing about when you have communities is that they not only help people get things done, but they also build social capital. And so whereas, you know, there, there were like really meaningful social relationships built in the early days of the dbt community when I was like a super, super active member. And, but, but they, they happened as a part of asking and answering of kind of boring technical questions. And now I would never ask those types of technical questions to a community in Slack. I would go to Claude or chat GPT or whatever. And they would give me, it would give me immediate answers that were probably of as high or higher quality.

And the other thing that is you know, we still have to see how it plays out is even open source feels just a little bit less now of obviously open source on the order of like R or dbt or, you know, something obvious like Linux or like that, that, that stuff's not going away, but there's also like this entire package ecosystem that, you know, I spent a lot of time curating a package called dbt utils, which was macros to do like useful utilities. And, and like now you could, you could just like ask Claude to say like, Hey, make me a macro that does this thing. And maybe, maybe the, yeah. So I, I don't know. I worry about that stuff a little bit, but then I feel like a grumpy old man.

But I do. Yeah. I feel similar. Like the other thing I worry about, I was like, well, like we're the, we're the incumbents, right? We're the people who created the software that if you ask Claude, it knows how to do it. Like that's all in the training data. And if you're like a, like a young person, like the way I promoted ggplot2 and dplyr was like, someone would ask a question on the internet and like, I'd be there like, Hey, like I'm going to both answer your question and I'm going to be like friendly. And like, so there's that, like, you're learning something new. Like you're not going to get that from a chat bot. And there's that like interpersonal relationship, which is also like gone now. So like, yeah, it's pretty clear that's going to have like profound implication on how like these communities form and like what people get out of them.

I mean, maybe like it, it frees us to, you know, focus more on communities, like for the sake of community, not just to get, you know, to solve your R or SQL programming problems. But I don't know, like it is. Yeah. I worry. I worry about that, about that loss of connection.

Yeah. I wonder how much, like, I remember like searching a lot too, and really appreciating like finding a blog post or like finding out someone kind of did a dive into what I was looking for. And I, I feel like I remember some of those to this day, which is like, I think Hillary Parker. So one person, the R committee way back wrote about like making an R package and somehow, even though it was like over a decade ago, it's like burned in my mind.

Um, I do, I do wonder how much, if like people will still feel as like encouraged to, I mean, maybe they'll write up just as much, but I, I know like dbt also like really, I feel like distinguish itself through just so many great blog posts and like deep dives. Um, yeah, there, there are posts that we, you know, there was this post we wrote in like 2017 or 2018, which was like how we configure snowflake for our clients with dbt. And that was used to configure so many snowflake instances, you know.

Certainly you can still write that type of content as much as you ever could. But, um, the, I think the economic incentives for it are like changing very quickly. Which in some ways is not entirely like, that's also what led to like all of this, you know, like the content, like the, all of this content farming around just like creating like tons of like pretty useless content. So whenever you search for something, you know, you get someone selling ads to you like that. I'm not like, not sad to see that go, but like the, like the blogs that people like poured so much heart and soul into.

And then I think like the other thing I'm nostalgic for is like, you would read, you know, you read someone's like really cool technical blog. You're like, Oh, that's awesome. Then you go and like follow them on Twitter and like, as well as the technical side, you also get like some snapshot of like their personality and the other stuff. Which is why the, our ecosystem, I could, you know, we didn't have to get into it, but like, I could tell you what their collective thoughts were on elections, US elections, because like all of this stuff bled together once you followed somebody on Twitter.

Yeah, for better or worse.

Yeah, right. Did everyone go to Blue Sky? Is that where the community is now? Or is it somewhere else? I mean, not every, I mean, not everyone, but I think that's, it feels like there's enough of a community, there's like, there's a strong enough nucleus there that you can go and interact and people interact back. And like, you know, it's enjoyable in the same ways as like early Twitter. Like I, you know, I tried Mastodon and I never got the same. And I'm on like LinkedIn, LinkedIn, which I kind of hate everything about, but like people use it. And like, I get gotten better, like, you know, feedback there than other places.

LinkedIn was not on my bingo card as the number one social site that I use. And I'm, I'm still troubled that that's the case. But I do feel like somehow it's so tame, that I'm like, this is fine as a social place. But I do miss like, I miss like reading posts by like, an account that claims to be like a raccoon digging through garbage or like, but that actually like, yeah, I, we, I'm getting that getting there from like blue sky now, like just like weird, like personas where you're like, this is, this is just like clearly like such a totally different like person from me. And I get to like experience a little bit of that.

Personal projects and coding agents

I'm curious if either of you have a fun personal projects going on right now in data. I feel like with, with coding agents, I have been doing more personal projects than ever. My current thing is that I'm trying to create my own iOS app that pulls data out of health kit using the like highly hard to access SDK so that I can get my health data into a Postgres database and screw around with it.

I love this. I just have to say like this sentence in like four years ago would have sounded insane as a project, but that this somehow this project makes so much sense to me. Like, even if you've never done an iOS app that you could have a good usable time, make an iOS app.

Yeah. I mean, I did this. I mean, I did, I also did an iOS app, but it's just a, it's a talk timer for like, if you're giving a talk at a conference where you're cheering a stage, like I've always had in my head, this kind of like platonic ideal of what I want from a timer. And like, when you go, like, I've looked, tried so many different apps and they're always like, oh, the bloated or ugly or like full of ads. I'm like, Oh wow. Actually, I can just create this now. And I did. And it was like, so like fulfilling to like create this thing in Swift, which I'd never used before. And like, it works and I like it. And yeah, that's.

I think I mentioned this last, maybe the last episode we had, but I've been doing a lot more Cantonese study. So my dad speaks Cantonese and it's, it's a tough language because it's not written traditionally. So like it's, it's rare to have like transcribed Cantonese. It's like only spoken. And then people learn to write and read Mandarin. But it's, it's so easy today to get like whisper to transcribe Cantonese videos. That it's, it's kind of mind blowing to have like whisper transcribing material. And then agents are really good at speaking or writing, like writing out the Cantonese.

And so it's, it's been really nice to like be able to generate sentences and have like a tutor that can kind of take what vocab I've studied and kind of like remix and, but I will say the nicest thing about this activity for me has been like, right now I have like so much focus on using things like Claude code to generate. I find language studies has been really nice for almost like getting back in touch with like picking up a skill and kind of like fluency, like kind of that's felt good in the way that like coding fast felt good before too. Like just being able to produce words and like read and understand somehow like feels nice.

Wait, how, how far along is this iOS app? Are you like, did you run it any hitches or did it go off the ground pretty easy? What was the, it's gone. Okay. So far it's, it's, it's Claude code or what's the, yeah, I'm using Claude code. It's, it's not complete yet. I have, I have a late night tonight. Yeah. Actually as we're recording, I have Claude working in the background.

Right. Nice. That's I feel that constant talk. You're like, I might as well have you working while I'm doing other stuff. Last week I gave a talk, like an internal talk about like using Claude code and like someone like during and the zoom like polls the audience of like how many people are watching this talk while Claude code is doing something in the background. And this is like 20% of the people watching. So yeah, geez.

I, I actually realized like I had like this, this is definitely toxic, but I realized like I didn't for a while. I was thinking like as many of meetings as not being like real work. Like, so whenever I was in a meeting, like I didn't count as work and like one of the reasons that I found like Claude code. So like appealing, it was, it suddenly turned to the meeting felt like it was work because Claude code was working for me in background.

Tristan, have you fired Claude code at like dbt projects? Like have you had any reckoning moments where you just turn it loose on dbt? Like what, what's that been like?

It's, I mean, it's shockingly good. The, as, as you mentioned before, Hadley, there's enough dbt code in the training data that it just knows how to do that. And now, so we, we, we built an MCP server ship that in, I think April of last year that has seen very rapid adoption. And that now allows Claude to pretty straightforwardly kind of execute stuff. And, and also like test its own, like validate its own code.

So yeah, it's, it's good. We, we did, I think it was maybe two months ago or something where we were able to go from zero to a pretty sophisticated project in the space of one hour. And so there's, it, it felt like kind of a, a moment.

We have like an internal thing we call demo bot, which is like our sales team use it. So instead of where, like when they go to talk to a customer, instead of just showing them like, Oh, here's a generic, like there's a New York bicycle share data. We have like demo bar, which is like very simple Claude script. That's just like create a sample data set for this industry, make a dashboard, make an API, make a report. And like, even though the data is like completely made up, it's like so much more compelling to see something like related to your related to your industry that, that that's been, yeah. People really like people really like stuff that's like customized just for them. And like, it's easier now than, than ever to do that.

Predictions for 2026

Do you want to make some famously bad predictions for us, so in a year's time we can be like — Tristan only thought we needed one LLM for the entire United States.

I think that this is the year that Iceberg goes from a topic of conversation at CIO dinners to actually implement it in the wild. I think that AI is going to be layered on top of things that we already do. It is not going to be the death of dashboards or any such catastrophic things. Those are the two things that I mostly expect from this year.

I do think that agents, even saying this is the year of agents, I see the same thing. I think that the reason that agents have come for coding first and best is that, one, they're developed by software engineers and so it's easy to automate your own work. But two, a lot of times software is the least stateful thing and so it's actually easier to dummy up data and still do real work and this kind of stuff. It always takes a little bit longer in the world of data because state is just harder. But I think we're starting to resolve some of the kind of permission or all of these types of things that allow agents to safely get at the massive repositories of structured data that companies have.

Yeah, I think that there's a lot of companies that make tools, whether you're talking about Salesforce or Workday or whatever, these kind of purpose-filled applications. And most of these companies want you to build agents within the context of that piece of software. And so you can do that. And certainly if you do that, there's certain advantages. Probably your agents can have a lot more context around what it's operating and maybe it can also take action as well. But there's real advantages to building agents in kind of a more horizontal, generic way on top of your data lake. Because then they can access any data, not just the data in that one application, and they're a lot more flexible, etc.

But when you're gonna put an agent on top of a data lake, you have to think about an agent just like a human. You're just gonna say, go have at it. You know how to read Parquet, do whatever. So you need to make sure that you give it access to data in the same way that you would give a human employee access to certain specific data and not other data. And I think we're just starting to get there in terms of how to think about doing that.

People have a lot of expectations of agents. They think about them as an automated version of a human. Whereas previously, we would have a service token, and that service token could control things within one certain application. But an agent workflow, now you expect it to interact with four to 10 different tools. And so all of a sudden, it's got an auth profile that looks kind of like mine as an employee. You've got to map it to a bunch of different applications. So it's not trivial.

I don't know. It just feels so hard to make predictions right now. I think we're going to see a lot of change. Some of it we can anticipate, some of it it's just going to be surprising, like second-order effects. To me, when I think about AI, it's all about being nimble, continuing to experiment and try stuff out and accepting whatever I believe today might be wrong in six months' time. But compared to six months ago, I don't know, I feel more optimistic, I guess. I still feel like software engineering is valuable and useful and that there's so many of the skills we know still continue to be useful. And now it's starting to think, well, what does this mean for data scientists? What are the skills that you need to apply, even though you're maybe no longer handwriting all of the code? So I have no spicy predictions.

To me, when I think about AI, it's all about being nimble, continuing to experiment and try stuff out and accepting whatever I believe today might be wrong in six months' time.

Yes, I now have six fruit trees planted. I planted them mid-season last year. This coming year will be the first growing season. I have a big deer fence around everything. I built a bunch of raised beds. So in a year, if we talk again, I will tell you if I was successful or not.

Is this also kind of like your backup plan? Like if AI does take your job, at least you can still eat. Bushels of fruit.

I know that you're partially kidding, but the more that I go down the road of AI and everything that's happening right now, there's like a digital dysphoria that makes me more and more want to get my fingernails dirty. And so this scratches that itch.

Yeah, I'm like partially kidding and partially not. I think me and my husband are going to take a welding course later this year. Oh, cool. Because that's like a fun... I've looked at doing some metalworking stuff, but you need a lot of equipment and access to somebody else's workshop. I couldn't figure out how to make that happen. Yeah, there's just discovered there's like a cool maker space that's pretty close to us that does a bunch of stuff like that.

No, I have no needs to weld. I don't even know what to do with these skills, but it seems like a fun thing to learn.

Tristan, thanks. Thanks so much for coming on. I mean, honestly, I think it's a dream to be able to talk about dbt in this space. And like you mentioned, it is kind of like two worlds. And I think it's been so helpful to hear about the similarities and differences between these worlds. And I'm just such a big fan of dbt and all the work y'all are doing. So really appreciate you coming on. And thanks so much. It's been a lot of fun.

The Test Set is a production of Posit PBC, an open source and enterprise tooling data science software company. This episode was produced in collaboration with creative studio Adji. For more episodes, visit thetestset.co or find us on your favorite podcast platform.