Episode Transcript
[00:00:01] Speaker A: Hello. Welcome to the rubber duck Dev show. I'm Kresten.
[00:00:05] Speaker B: And I'm Michael.
[00:00:06] Speaker A: And today we're going to talk about postgres features, specifically explain and explain, analyze and probably a little bit of pg statements as well. So developers who are interested in postgres, this is an episode for you.
But before we get into that to typically we talk about our weekend reviews. So I'll go ahead and start first. So then working diligently on my scaling postgres course, my performance optimization course basically, and also doing a lot of plate spinning because I have a lot on my plate with a couple of different shows and consulting and my product and all these types of things. So trying to find the time for everything fitted in is really quite hard. So a lot of spinning plates this week, but still doing the best I can to do the work and get the course across the finish line in an acceptable period of time.
But that's me. What about you, Michael?
[00:01:09] Speaker B: Nice. Sounds very familiar. Also working a little bit on my product, adding, improving a tip we have around slow planning time. So I've been in the weeds of that and also recording. Also do a postgres related podcast. So just started doing guests on that this month and have four of those lined up. So two just recorded, one edited and published, and two to record. So lots more editing work ahead, sadly. But I'm looking forward to those conversations. But I need to prep for those as well. So yeah, plenty of plates spinning, a couple of smashed in the background.
[00:01:51] Speaker A: Try not to show it to others.
[00:01:54] Speaker B: Right?
[00:01:56] Speaker A: Yeah, I need to get on and I need to book a lot more guests as well in the future. I kind of do just in time booking, which is not great. So I could definitely feel with you with that. All right, so let's go ahead and get started talking about postgres. Before we get into that, do you mind giving us a little bit of your background about how you got into postgres, like where you kind of came from to start learning about it.
[00:02:27] Speaker B: Yeah, sure. And I'd love to hear yours as well in a moment.
So my background, I actually studied maths at university and ended up working at a company called Redgate Software, who make a lot of developer tools and database tools, but mostly at the time it was for SQL Server and Microsoft SQL Server, and I ended up. But they used to do these week long, they called them down tools weeks, where the whole development team got to work on whatever they wanted for a week as long as it was somewhat helpful to the company. And one of these weeks one of the developers worked on a tool for doing an equivalent tool to one of our more popular tools. But instead of for SQL Server, they did it for MySQL. And then later down the line, a team built a product for Oracle and I ended up leading the team that ran those tools. So the Oracle tools, MySQL tools, and we kept seeing postgres come up in conversations. And I was there for five years, but I think if I was there for a bit longer, hopefully we could have persuaded management to give us a bit of time and a few people to build the tools for postgres as well.
[00:03:49] Speaker A: I'm sorry, what did some of these tools do?
[00:03:52] Speaker B: So the more popular ones that we built for the other databases as well were around comparing databases and source controlling schema so you could compare schema and deploy between, let's say there was an issue in production that you hadn't seen in staging or the other way around. Compare the two, see what differences there are, version control, those changes. If you want to compare the data, sometimes you want to do like a deployment of data, or sometimes you want to. Anyway, there might be data differences between two databases that should be the same tools for comparing and deploying those, and version controlling the schema. There was ones that we did like.
[00:04:35] Speaker A: A diff on steroids, but for database.
[00:04:38] Speaker B: Yeah, exactly. And kind of migration aware as well. So just diffing that.
You might not want to do a diff in a certain way. Let's say you notice that a table in one, you might not want to do a certain type of deployment. You might want to make sure it's not going to take production down by running that script. And the order matters as well. Right.
You can't just run migrations in any order. So there's some logic behind those as well.
[00:05:09] Speaker A: Yeah.
Okay. So you were doing that at Redgate and you heard postgres was a desired additional platform to add for the tooling. So where did it go from there?
[00:05:24] Speaker B: Yeah. And then this was like 2010 to the end of 2014, and then I moved to a startup in London. So this was up in Cambridge in the UK. Then I moved to London and worked at a company called Go Cardless who do payments. But they were big postgres users and loved it. And the team there, their enthusiasm was infectious and it was a really impressive team.
I was only there two years, just under two years, but I learned so much from the people there. And them being such big postgres advocates made me realize it wasn't just large companies trying to save money moving to postgres. But it was also startups picking it as their database to scale on. It was being picked not just by old kind of cost conscious companies, but by new growth focused companies too.
Okay, so yeah, that was then my first kind of experience seeing it being run in production, but I wasn't in that team.
[00:06:32] Speaker A: What were you doing at Gokartless, for example?
[00:06:35] Speaker B: Yeah, all sorts in the end, actually. But I was supposed to be going to do product management, which was the main thing I did at the previous company, but I ended up startups are a fun world. There were only 35 or so people when I joined, and what they ended up really needing was somebody to run their customer success, of all things. So I ended up running customer success there for about a year and a bit before going back into product management somewhere else. Okay, so yeah, it's a funny world, but yeah, then basically I had it in my head that postgres was a really cool platform. I loved that it was open source and it felt like a good platform to build for.
I did want to someday build software that I could run myself or build software for developers, and postgres seemed like a great platform for it. So in spare time, in kind of evenings and weekends, started building, or started interviewing people and finding out what should we build, what could we build that would be useful? And with a friend built pg mustard in evenings, weekends, one day a week, that kind of thing. But yeah. What about you? Have you answered this before on the show or how did you get involved with postgres?
[00:07:53] Speaker A: I think I've probably answered it somewhere, I can't remember where, but basically I mentioned this multiple times, but in different shows in different ways. But I actually started doing, I guess my earliest kind of technical thing related to databases is I was doing a conversion project. So I was actually a project manager, and they needed someone to actually who knew something about the databases. And a little bit before I was a project manager, I had a lot of systems administration experience. So not so much a software developer, but like setting up computers, setting up servers, configuring them, optimizing them, things of that nature. And we kind of needed a DBA of sorts. So I kind of took on those responsibilities. And it was for that. Yeah. So I did that in that capacity, like a part time project management quarter time or whatever you would call it, database administrator. But in that I was working for a line of business unit, basically fundraising arm at a university. So I started needing to do all sorts of reports. So I really got to understand SQL and all the data and running reports for people that were asking questions. So that kind of gave me more exposure in that area. I eventually moved to another institution and started using the product that used Oracle. And I kind of got the Oracle DBA responsibilities as well because again, it was a more technical role. So kind of learned Oracle and whatnot. And then it was in 2009 ish that I started my company and we were going to be doing a software as a service application that does basically online giving forms, online event registration. And so we had to pick a, you know, knowing the price of Oracle around that time or, you know, we weren't going to be using the Microsoft platform, didn't have money for Oracle, so it was only going to be MySQL and PostgreSQl. So basically I gave the opinion even though at the time we were using rails and a lot of like, I think basecamp was using MySQL. But I'm like, I like the fact that PostgreSQL at least had functions or something semi approximating stored procedures in Oracle and Microsoft SQL server. So it just felt more mature in that area.
And I think that was probably the main reason, but it was still a little bit of fate, I don't know, probability. That said, all right, let's use PostgreSQL. So basically I took my knowledge, DBA knowledge and applied it to PostgreSQL. So that's kind of how I got started with PostgreSQL. I never used it at any company before the one I founded in 2009 ish. So that's kind of my background.
[00:11:07] Speaker B: Wow, that's cool.
[00:11:09] Speaker A: Yeah.
So what do you use any other databases or. It's pretty much pure postgresql.
[00:11:22] Speaker B: No, I think. Well, first, yeah, I also run a software as a service business, but we try to keep the kind of minimum amount of data we can. So as much as I sometimes am tempted by the allure of having lots of data and therefore giving ourselves performance problems just to be able to use the product a little bit more for ourselves, no, mostly we just don't store much, therefore we don't have much data. We do use postgres because it'd feel mad not to.
But the one database I am keeping an eye on because there's one database and one kind of tool that I'm keeping an eye on the database is SQL lite. I'm fascinated by how people are pushing that database at the moment.
It feels like a lot of energy around that and a lot of impressive engineering behind it as well.
Normally for different use cases than postgres, but increasingly maybe not, which is interesting. And then the tool that's got dB in the name, but I'm not sure you could call it a database is duckdB. Looks really impressive.
[00:12:34] Speaker A: Oh yeah. So these are probably used for more local or embedded solutions. I guess not 100%, but yeah, that's.
[00:12:45] Speaker B: What they were designed for initially, but increasingly you can host them server side. There's no reason you couldn't.
[00:12:57] Speaker A: Thinking about that.
Any open source project, it's kind of where the core committers want to take it so it can go anywhere. I mean that's what's happened to postgres. That it has this huge feature set is because people said, I want this, I need this, I'm willing to commit to the code base to do it. And that's where it comes from. Yeah, I mean, my only exposure to sqlite is basically ruby on rails. That's kind of like the database you don't have to configure. And it comes when you're doing a development project that's like the default database for it. So I used it insofar as that. But yeah, that's my only exposure to it.
Are you using that for other particular purposes or you just find it fascinating?
[00:13:48] Speaker B: Database, yeah. And as a tool builder, I think I'd love to run my own company for decades and live my whole career. Well, this part at least after working for other people, work for myself, ideally for the rest of it. And I'd love postgres to last that long. And I personally am betting on it. But you do need to keep your finger on the pulse a little bit and see if something was to replace this in 20 years time, what would it be? What would it look like? We're probably still going to have similar data patterns. And if the past is anything to go by, we're probably going to end up with bigger volumes in the future.
But what we use to manage it might be different. And if so, what does that look like? And what would people be using for businesses like ours in the future? What would a simpler thing look like? What would a cheaper thing look like and what tools do those people need is normally the next question. So that's why I try and keep my finger on the pulse a little bit. But yeah, talking of the base camp guys, I think I saw Dhh tweeting about SQL lite performance. I think they're using it for their Once products, the new non software as a service products that they're letting people install themselves. And it's a nice thing that you can package alongside your code and you have the database there and they're sitting right next to it. People don't have to manage it separately. So it's an interesting use case.
[00:15:22] Speaker A: Yeah, I'll have to look at that. I haven't looked at that recently.
Yeah. But this is probably a good segue to kind of talk about explain because this is of course related to your product and it's PG mustard, so feel free to talk about the product as well. But you said in terms of a topic, talk about explain.
So why is that important? Why does someone need to know about that, at least insofar as using PostgreSQL.
[00:15:52] Speaker B: Yeah, and I think this is part of your course, right?
You had a few lessons and one of them was on the very basics of explain as well.
[00:16:00] Speaker A: Yeah.
[00:16:01] Speaker B: So it comes up normally for performance reasons. So if you put the word explain before your query, if they put explain, then you're normally a select query. But any update, delete and execute that instead of getting the results of the query, or even instead of executing the query. What it will do instead is show you the execution plan.
So how postgres would get you that data or do that operation. And that's really helpful for performance reasons mostly. So knowing what choices would it make? So often because SQL is a declarative language, we say what result we want. We don't specify the algorithms to use. Postgres uses a cost based optimizer to say maybe it doesn't have any options. By the way, maybe a query can only be executed one way because you don't have any indexes and you're querying all of the rows from a small table or something like that. It only has one choice. It won't spend long planning that query because it only has one route. Here's the plan. We're going to scan the table sequentially, and maybe we'll stop if you put a limit there at some point.
But often if you're joining multiple tables, it can choose which order to join those in. It can choose which strategy to use to join those.
It could choose which scan type if you have indexes. So there's a bunch of choices that it can make that are really important for performance reasons. Joining tables in a certain order can make a huge difference to the performance of a query. Choosing an index scan over a sequential scan can be a massive difference.
So knowing what choices it's making can help you then work out why it's making those choices. Or things you could do to make it faster. So yeah, that's why it's normally performance related that you're looking at this. But I guess it could be for other reasons as well.
[00:18:04] Speaker A: Yeah, I think I probably only use it for performance reasons.
I'll have to think if there's any other reason I would.
Yeah.
In terms of explain. So you also mentioned explain analyze, so how is that a little different?
[00:18:26] Speaker B: Yeah, so explain analyze will also execute the query behind the scenes. Now you won't get the results back yourself. You'll get a similar looking query execution plan, but with additional performance data. So the one that stands out to people, the one that's super valuable normally is the timing data.
So you get not only the time that the query took to execute, but also each operation that it did, for example, each scan that it did, or the join operations. How long did each of those take? And that lets you look into where was the bottleneck, or was there one bottleneck or were these all kind of, roughly lots of things that took a little bit of time. So that can be extremely useful for diagnosing a performance issue. So that's why explain analyze comes up a lot. Analyze is just a parameter of explain. And it's a bit confusing because some people might be familiar with. There is a separate analyze parameter for postgres that's used to gather, it's performance related as well. It's used to gather statistics to help the query planner. So it is a bit confusing. So if you've ever heard vacuum analyze, that is the same analyze as regular analyze, but neither of those are tool related to explain analyze. So yeah, difficult medium to explain that over. But maybe that's a blog post that needs to be written.
But that's only one parameter of explain. It's probably the most useful, I'd say it's definitely the most useful parameter for explain, but there's loads of other useful ones as well. Buffers is one that I'd love to see on by default with analyze that gives you information about the amount of work done by the query. So not just how fast it was, but like the data read and where it came. Was it in memory or was it potentially, did we have to go to disk for that?
Did any kind of big operations spill to disk? Information like that you get via buffers? Then there's also verbose, there's loads more settings as well.
And yeah, each of those are just parameters for explain to get more information back.
[00:20:50] Speaker A: Yeah, I hate to admit for you. Oh, sorry, go ahead.
[00:20:58] Speaker B: Well, when you use explain. Do you tend to use explain analyze or do you use more parameters? What's your kind of default?
[00:21:10] Speaker A: I was going there. I said, I hate to say it, but I pretty much just use explain, analyze. I rarely use buffers because I don't know. I mean, my perspective is, okay, where's the slowest node? What's the slowest bottleneck? And I just typically use the timing.
I guess it's only if something, I'm having trouble understanding about why something is happening a certain way, I might throw on buffers. And I know I've read multiple blog posts, I've heard people talk about that buffers should be, as you say, a default. I was like, okay, but in terms of where I look on the resulting query plan, I basically look for the node timings for the most part. And it's kind of funny, like reading and explain output.
When I look at it, I remember back to the, have you ever seen the movie the matrix?
[00:22:11] Speaker B: Yeah.
[00:22:12] Speaker A: And there's a point at which the guy is looking at the code going down, and he says, how do you read the code? Or how do you read the. And he says, oh, I see red dress, blonde hair or whatever on the code. And I have a feeling for the people just getting started. And you could correct me if I'm wrong. I have a feeling that the explain plan output, a lot of people, even new developers, they look at it and it looks like the matrix code. They have no idea what's going on. But once you take a little bit of time to actually understand what each part is, then you kind of look at it. Oh, yeah, this is slow here because of this, and there's too many loops or whatever it is.
I may be taking this off a tangent, but do you actually think some of the explain output is a bit, well, challenging for new people? I mean, I can read it now quite easily, but do you think it's challenging for new developers to kind of get a hold of.
[00:23:10] Speaker B: Yeah. And that's kind of why we built the product.
[00:23:15] Speaker A: Yeah. The PG mustard. Yeah.
To solve that pain point.
[00:23:20] Speaker B: It really is. And even people that we noticed. So I spoke to a bunch of people and also watched a lot of online forums where people were asking performance questions and there was back and forth and questions. And even when an expert had what looked like all the information, sometimes they had to ask follow up questions or give choices. Like it might be this, it might be that, or sometimes the experts would get things wrong, like they'd misread it slightly or jump to a conclusion.
I have to catch myself sometimes, because the arithmetic can be tricky. Even if you look at the code, there's no thousand separators in any of the timings. For example, sometimes you don't spot that there are a number of loops and you don't quite work out that once you multiply that tiny number by that.
[00:24:13] Speaker A: Huge number, this one millisecond with 10,000 loops means 10,000 milliseconds or 10 seconds.
[00:24:23] Speaker B: Right? Or even like, if it's 1700 of a millisecond and that's times you just think fast and then you see, oh, it's 117,000 loops. What? So these things add up quickly and it's taxing.
Even the experts we see can kind of load it into memory. And then if they get distracted, ten minutes into looking at a slow query plan, they might have to start again. So there's some of these stuff that humans aren't as good as computers at. And we're not the only tool. We were partly inspired by a couple of really good tools. There's a tool by Depeche. Sorry, explain depeche.com.
[00:25:10] Speaker A: Great tool I mispronounced for years. If you looked at scaling postgres, my pronunciation of the site name was awful. I just did it by letters. And then I think it was, oh, gosh, who was it?
I think Lucas fiddle. Is it PG? Analyze. On one of his videos, he says, depeche, I'm like, I'm such an idiot.
[00:25:38] Speaker B: Online I heard some good advice to never think somebody's less smart because they pronounce something wrongly. It probably means they learnt it by reading it rather than hear. And people that read often people that read a lot.
Some of the smartest people you'll ever meet, yes, but I think they're from Poland, so the pronunciation should be Depeche. And I have said it wrong in a talk before and been politely corrected afterwards by, I think it was Vic fearing in all of his might saying, by the way, I think it's Depeche.
[00:26:19] Speaker A: Yeah, it's true story.
My first or second or third episode of scaling Postgres, someone sent me a link and said, you're saying postgresql wrong, and sent me a link that had the audio file of how the community pronounces because I can't remember what I used, but it wasn't postgreSql.
Thank you. If you were one of those people that kindly say, I think you're saying it wrong.
[00:26:52] Speaker B: Yeah, but also I'm kind of in favor of not correcting people, you know what they meant. No one. That sequel, there's no confusion what you're talking about.
I'm all for being kinder. And if one of your first interactions with the community is you're wrong, it doesn't feel super welcoming. So yeah, I take your point. It is nice to know how to not sounds like you don't know what you're talking about in front of a room of people or something.
But the rest of us, I think once, you know, remember, we didn't know once as well. And if you want to cheat, just the simple version of it is just call it postgres.
[00:27:35] Speaker A: That's what I tend to do. I can't help. Any times I've had like recording scaling postgres episodes, I just mumble over the words. But yes, postgres is much easier.
[00:27:47] Speaker B: Yeah, I've forgotten where we were. Explain. So yeah, do I see it as difficult read?
[00:27:57] Speaker A: Yeah.
And you said that's exactly kind of why you're building the PG muster tool.
[00:28:05] Speaker B: Well, yes. So explain depeche.com. And a newer one, the one that was around at the time that we started was called Tatians. It was pev P-E-V postgres. Explain. Visualizer much more visual version. The depeche one preserves the text format, whereas the depev one makes it more visual. That's been kind of taken over, and a version, two of that's been built by the team at Dalabo. Both of those tools do a good job, in my opinion, of at least the basic timing calculations. So which operation took the most time? If you try and assign appropriate amounts of time to each one. Now, you can't always break it into neat numbers that add up to 100%, but they at least take the simplest approach of subtracting each child operation from its parent operations, multiplying by the number of loops. So all the things that we're trying to do in our head when we're reading a text format, explain. Without one of these tools, that's the bare minimum. So first, work out where the timing is going. That was a problem, but those tools mostly solve it. Like, there's some edge cases that I've spent far too long looking into.
But then there's always that, even with those tools, quite often the next step is, okay, that bit slow. What can I do to speed it up? Is the next thing. And even people like regular developers, I'm not talking about postgres dbas, but regular developers, even ones with like 15 years experience, really exceptional rails developer, for example, might only be looking at a slow explain plan a couple of times a year. Or like maybe they only have performance issue a couple of times a year. And I don't know about you, but things I only do a couple of times a year, I don't remember the details of.
So putting that context back in front of somebody when they need it felt like an opportunity and helping them with that next step of, okay, so this part's slow. What can you do about it? Or what are your options? Like, do you have multiple options? Might it be this kind of issue? Might it be that kind of issue? Sometimes it's not. Even with the information we have, it's not obvious whether if you improved, for example, the row estimates to help the planner make better decisions, would that change the plan or would it not? Did it have a choice or would it reach that threshold or would it not? You literally don't know if you're only looking at one execution plan. So giving advice in a way that's kind of treating people as the expert of their own domain and just saying, well, in postgres, this might be the issue, but it might not. Try this, come back, see if it's improved.
So it felt like there was an opportunity there that could maybe help these good developers that just don't have to use postgres all day, every day.
[00:31:01] Speaker A: Okay.
[00:31:03] Speaker B: If that makes.
[00:31:09] Speaker A: That's so that's pretty much what the PG mustard tries to do with regard to.
[00:31:16] Speaker B: Yeah, good point.
So we are a visualization tool. We will do it similar to the Dalibo tool. Now, it doesn't preserve the text format with Depeche, one of the nice things for experts and probably for yourself, you might prefer that tool because even though you get the timing calculations, it still shows you what the text format was on the right hand side. So if you need to fall back to that or want to fall back to that, it's still there for you. We don't and Dallibo doesn't. But what we do instead is add tips. So on each operation, you'll see a series of issues that might be the case, and then they'll be scored. So depending on how likely they are to speed up your query, a bunch of them will be scored zero. This is unlikely to have any impact, or even, we call them anti tips. So, like, cache performance here was 100% zero out of five. So nothing you can do here will help.
But the top three issues we spot across the whole query plan. So we do a bunch of, like, it's just an algorithm, basically.
Bunch of calculations and the top issue, or the one that if a tip could drastically improve the performance of your query, we'll give it five stars.
If it's like moderate, it will be somewhere like three stars, four stars. If it could make a small difference, it'd be like one or two. So it's scored tips and then we'll only highlight the top three. So here's the first thing to look into. Here's the second thing to look into. And if neither of those helped, here's a last chance, like, I guess, baseball style, one last go at the problem.
[00:33:06] Speaker A: So out of curiosity, do you keep any metrics about the plans that are input? And do you have an example of the most insane? How many lines a particular query is that they've submitted?
I ask because I know at times, like in my consulting, someone comes to me and says, hey, this query is slow. And I look at the explain plan and it's like 200 lines of explainer. I'm like, all right, time to get to work.
[00:33:37] Speaker B: Yeah. So two answers quickly. First answer, no. We do give people the option of storing their plans with us, but it's off by default. And I'm kind of glad we're quite privacy conscious and we'd rather not. Or you can't lose data you don't have. And query plans can contain sensitive information because you're often pasting the real query into like if you, if you want, if you want to do performance work, you kind of have to have realistic test data. Realistic test data until relatively recent. Well, is it possible to create? Of course it is, but most people don't bother. Most people use their production data. So still to this day, sadly, that still seems to be true.
So it can contain like Pii in your query.
For those reasons, we don't store them by default and opt in rate is low. Like we might get 5% of plans submitted get stored.
So no, I don't have great stats on that. But the second thing, because we're software as a service, we do have had the OD case where people have pushed the upper size limit. So bear in mind I'm talking about JSON format plans, which until a year or two ago we only supported JSON format plans, not text format plans. And the JSON is bigger. So you have what you have white space, but you also have every key and value. And whereas in the text it's pretty much just values and multiple per line as well.
And on top of that, we suggest people request buffers, as we talked about before, which actually includes ten to twelve keys per node. So that 200 line one you're talking about, that's another 200 times ten, at least another 2000 lines to the plan and out verbose as well. We put on, which doesn't, despite the name, doesn't add as much as buffers. But it can if there are lots of columns, like if a query plan, and this is quite common in things like rails or other frameworks, often the default will be to return all the columns. And some people have very wide tables. And if you're returning all the columns at every node of a 200 query plan, that's a lot of data as well. But to give you the short version is the largest query plan we've had submitted so far was over 50 megabytes.
Okay, right.
At first we couldn't handle that. We now do a bunch of compression and stuff. To be able to handle that nicely, but still takes a, it still thinks for a couple of seconds before processing one of those. And yeah, this again to your point, processing a 200 line query plan, even as an expert is, okay, I'm going to have to ramp up to this. Whereas we wanted to build something that maybe gives you a shortcut. Like if it's all one of those 200 lines is like the entire issue is on one of those 200 lines, which it can be. How do we get you to there really quickly? And one of the tips I have for people is check the bottom of the query plan first. You'll probably do this naturally anyway. But firstly the execution times there planning time, but also so is trigger related information, which can be the bottleneck. Planning time can be the bottleneck and so can just in time compilation, especially on complex queries. Those are all reported right at the bottom. And if the issue is there, you can avoid looking at that 200 plan like 200 line plan and just point out the issue is already here. So yeah, there's a couple of shortcuts, but we built the tool partly to be able to avoid people having to look through those two, or at least the computer helping you. Like maybe look at this one first.
[00:37:39] Speaker A: Yeah, in that particular case, like that really long one, it was actually here's where I had to step back and say, all right, I think the whole premise is wrong and we need to think how this whole query is. Unfortunately that's what happens and you have to rethink the whole thing. And some of that in the worst case scenarios goes to, okay, the schema is set up this way, but you're wanting to ask for questions this way. I think there's a little bit of a disconnect in that if you want faster output, you're probably going to have to make some schema changes. So then there's points where you need to step back and say, I mean, it's great. When it's like, hey, it's this node, you put the index on this node. Voila. But then in this particular case, this really long, when I was like, it wasn't changing the data, like we didn't have to change the schema in this case, but it was grouping and summarization and merging things together actually had to be done in a reverse order to get the best performance. I can't really explain clearly more than that, but I just basically had to rethink how the query was chosen to be processed.
[00:39:07] Speaker B: Yeah, and that's part of it as well. Right. How do you point out when something's pretty much optimal already, even if it's taking 3 seconds, it's like reading it entirely from cache. It's pretty compact, like it's efficient read wise. So this is where the buffers come in. You've not got a bloated table, a bloated index, but you're returning 30,000 rows, or you're doing a count over them, or you're doing some big aggregation, and you have to look at all those, all of that data. And how do you point out to people that that's quite efficient already? Maybe you need a rethink, and that's a tricky problem for tools and for humans. Right, like tricky problem, full stop. Things we've tried to do are trying to not provide tips when there aren't any good things to do. It's okay to have no tips if something's fast already, or if something's efficient already doesn't mean it's fast if something's pretty optimal for what postgres can currently handle, don't try and give only zeros and say, look, this might be all right already, or you have to have a rethink. Maybe you're going to need to pre aggregate, or maybe you're going to have to estimate, or maybe you're going to have to do some other thing like you mentioned.
[00:40:23] Speaker A: Yeah, I think. Are you familiar with the concept of local maxima?
[00:40:30] Speaker B: Yeah.
[00:40:30] Speaker A: So it's like when you have some of these complex queries, you can optimize little bits and pieces here, but what you've optimized for is a local maxima.
But if you totally rethink how you're doing it, you find far away another point that gives you ten times greater performance. I mean, just spitballing here, then it's like, oh, if we totally change how we're, that's kind of what I'm talking about. If you totally change how we're thinking about it, we can achieve this type of performance. So a lot of times if the solution is not simple like, oh, this index or partial index or whatever it may be, because I'm speaking from a consulting perspective, I kind of have to step back and say, all right, let me rethink how we could potentially do this to get the best performance. Or it's basically the what if scenario. What if we didn't have to use the schema? What if we could do things this way?
[00:41:39] Speaker B: Yeah, what if we didn't need this query? The best way of speeding something up is not doing it at all. But as a consultant, you have to ask questions to be able to know whether that's even an option, right? You have to think, what's the context that this query is being run in? How often does that need to be done? Does it change? Does the result change? How often does it change?
There's a bunch of context you need for that as well.
And also, by the way, we mentioned indexes being like this separate case, but sometimes you don't want to add in it, even if the index in this case would speed things up 1000 x. This query is a reporting query, run once a month and on a really high ingest table that's being like, that needs to be able to handle super high peaks.
Adding that index might not be a good idea in your specific case.
It's rarer. I think we give it as an example for reasons, but almost nothing in performance is free. There's almost always a trade off at some level.
And that's why I find it personally very interesting. But it also makes advice difficult.
And you can't just always say it depends, right? People want some for some opinion.
[00:43:03] Speaker A: Yeah, I was just doing a performance optimization for some, or looking into something and I'm like, okay, I can think of because they were wanting something in real time, but the amount of data they had to parse, trying to achieve real time to get the answer, like they were using triggers to try to get real time answers to something, storage of data. But the problem is they had to go through all enough accounts, like in an insert or an update. They had to go through all of these accounts to update something and they started experiencing problems with updates unlocking. And it's like, okay, we're at the point where something's got to give. So I gave them. Here are the four options. And from a business case, here's ways to achieve it.
This will require the most effort to implement. It's the most complex, but it can give you real time data without causing blocking. Right. This one is easier to implement, but you have eventual consistency. Eventually the data will be updated. So it's just giving them these options as. All right, which option do you want?
[00:44:25] Speaker B: Yeah, exactly. And that's where we get the hardest trade offs right at the limit of things.
And it's great. That's partly why I love it. There's not just like an obvious answer, there's very much which of these do you want over the other one? And you can't have both. No matter how much money you throw at this, you can't have both.
[00:44:48] Speaker A: Yeah. It's like when running projects. Was it time, cost, scope, pick two or whatever?
Yeah, definitely trade offs.
[00:44:58] Speaker B: Yeah.
[00:44:59] Speaker A: One thing you had said about the explain is also talking about the importance of testing on realistic size data sets. Did you want to talk about that a little bit?
[00:45:09] Speaker B: Yeah. So this is one of those things that I get is tricky, but it also trips people up a bunch.
A lot of the time people are using explain is around looking at, let's say a production workload. If the server is busier than we want it to be, or we would like to reduce our instance size or something that's like system wide. It's often very useful for looking at those most expensive queries.
But another time it can be useful is when you're building a new feature. But for your new feature you might not have data yet. So there's like test scenarios you might want to build. On the flip side, if you are trying to speed up one of the existing queries, but you're working locally and you don't have access to production, you might be struggled to reproduce the performance issues on production. And that's like a multifaceted problem. It's not just query size. There might be other, like how busy the server, there might be all sorts of other reasons as well. But the primary issue, the primary thing I see trip people up is data volume. So those two cases, one is trying to reproduce issues locally and the other is trying to build a new feature and basically thinking performance is fine, but realizing quite quickly once it hits production that performances, or at least on some larger customers, once they have a bit of data, performance is not fine. So the reason for that is the execution plan. We mentioned it briefly. I guess the execution plan changes depending on the amount of data. And that's the big factor. It's normally the number of rows are the biggest factor on which, on the costs on that cost based optimizer. So the cost of doing certain things is cheaper at the beginning than at the. So how to explain this?
Sometimes you have two big costs. One is the cost of getting started, and then there's the incremental cost of each additional row. So if we're scanning a table sequentially, the cost of getting started is nearly zero.
We could just start, we just start scanning, like start reading a book, open the first page and keep going to the second, 3rd, 4th. But the cost of each, doing each one after that is about the same as the first one.
It adds up over time, whereas an index scan, the cost of doing that first lookup is a bit more expensive. You have to go to the back, have to find it in the index, and then go to the page. So if we just want any one page, we're better off just scanning it sequentially, because we don't have to go to the index first and look it up and things. But very quickly, once we want more than four, if we want a specific page, or if we want one that contains some information, very quickly, our expected cost of looking up in the index overtakes the cost of looking at things sequentially. And the same is true for join algorithms. So if we're joining very small thing with another thing, a nested loop might be our most efficient thing, because there's very little cost to getting started. Whereas once we're joining more medium sized things, hashing those in advance, even though that costs a bit more upfront, saves us on that incremental cost of further ones. So it's the size of these relations or the size of the tables that is the primary driver in these cost based decisions on join, order, join algorithm, scan choice. And it really trips people up, I think, in two things. One is this was fast on my dev machine, why is it slowly? I had some data, that's the first one. But then it also trips people up in the opposite direction. Like I've got 100 rows and I've definitely added the index that I know will make this fast. But when I run explain postgres, it won't use my index. Why won't it use my index? And that's because even though you've got 100 rows, they're all on one eight kilobyte page still in postgres. And it's easier for it to just scan the only page in that table to give it you back instead of looking up in an index, like if someone gave you a one page PDF and said, could you write an index for this? I happily do it for you, but it's not going to save you much time scanning that PDF, especially because postgres can only like, the minimum amount it will return is one page.
[00:49:50] Speaker A: Yeah.
[00:49:52] Speaker B: What would you add to that?
[00:49:54] Speaker A: No, that's it, because as you had said, it's a cost based optimizer, and it looks at these statistics, how the amount of data, the ratio of data, the cardinality of values within columns to make determination on the plan it's going to use. So if you have a local dev environment that's going to be entirely different from production, or even if you have a larger, like a staging area that the data is a little bit larger, it's still not going to give you the same results as production, because maybe that's fabricated data and it'll give you entirely different plans. So absolutely. When I do a consulting engagement, one of my requests is can I have a replicated copy of production to do my performance testing, because I'd rather not touch production. But having the copy of the data as it exists, at least from a point in time, removes so many variables from the equation of getting different plans.
Absolutely.
So one area that you were talking about, also wanted to talk about is PG stat statements, which that is my favorite extension, given I like to do performance optimization. But what's your experience with using it?
[00:51:23] Speaker B: Yeah, I would love to hear more about yours as well.
My experience is mostly from people who don't know whose system is overloaded or they have performance issues, but they don't know what's causing it. But I have found it also, it is just as useful for, well, yeah, I guess it is most useful for that, but there's like a bunch of specific things you can look for as well. So it's a system view that's off by default as well, I think. I'd love to see it on by default. And increasingly I come across people, customers, friends that are using cloud providers. So it tends to be on by default in most of those these days, which is good, but it's a system view that will track your statements in a normalized fashion. So two queries that, let's say one is select id from customers where id equals x and where id equals y. Those two queries would be normalized to the same query. So you can see which query kind of groups are responsible for the most time spent in execution or the most buffers that is in there as well. Most I o. So if you're on Aurora or some other database that charges you based on your I o instead of on standard fixed cost, you might be looking at which queries doing the most I o. So you might want to order by that, but I. And probably you normally order by total execution time. I tend to add planning time to that as well and order by that. But it doesn't normally change much.
It's normally for those. Sorry, go in.
[00:53:19] Speaker A: No, you bring up a very good point at the beginning where you were talking about on by default, and I was like, yeah, why isn't this in the core? I would really love to have it in the core. And just then I'm sure the concern is, well, we don't want to track too many queers. And I guess it does hurt. Performance was the Heisenberg uncertainty principle. Basically you are having a cost on the system by having it on, but it really is quite minimal. And could they just have very minimal configurations like, oh, it just tracks the last 100 queries, something very small in the configuration parameters, and then you can choose to bump it up. But yeah, I would definitely endorse having it as a particle as opposed to an extension.
[00:54:09] Speaker B: But yeah, I think you're right. I think people are scared of the overhead. And to their credit, or in their defense, if somebody's benchmarking, let's say, Mysql versus postgresql, it'd be kind of nice if we weren't putting loads of features on by default that were costing us performance wise on those stupid benchmarks that don't configure anything.
But on the flip side, I think for users, for real users, not people benchmarking, it's so valuable, and it's valuable in hindsight. So if my system is loaded now, it's really useful to know what's been run in the past, what has work has been happening. If I only find out, if you or I get contacted by a friend or a customer saying my postgres is on fire, help me. And they don't have PGStat statements installed. We're somewhat limited.
If they don't have any logging on, which is mostly off by default as well, it can be difficult to help them. You're definitely at a disadvantage, or it takes you longer to help them than it would if this was on by default.
But yeah, so I get that it's a trade off like we're talking about. Performance is a trade off. Right? But I think normally the performance overhead is worth heard. I think I'm stealing this from. I think I heard Jeremy Schneider quote it, and I think it was from an Oracle performance expert, maybe Tom kite, maybe somebody else. And they said they were asked to estimate the overhead of all of Oracle's instrumentation on performance, and they thought about it and then they said negative 10%. He said if we had none of this, it would be at least 10% slower than it, than otherwise because of all the improvements we've been able to make because of this instrumentation. So I thought that was a neat way of looking at it as well. It's like, yeah, sure, there might be like single digit maybe. I've seen some estimates that it could be as high as 2% on some systems, but I've never been able to measure it.
Not even close to 2% for sure for the systems I've worked on.
But the queries we've sped up because of having that on have definitely improved performance overall by more than ten or 20%. So I like that answer.
[00:56:24] Speaker A: Yeah, because it's got to be minimal, and at a small instance size performance, the incremental doesn't matter that much anyway.
I'm sure others in the community, or at least of the core team, would definitely have their opinions on it.
Yeah. So as we start to wrap up, are there any other things you wanted to cover or any common issues that you see or common issues that you see developers doing that they can maybe do differently to improve their postgres experience?
[00:57:07] Speaker B: Yeah, let me think.
It would be great to check if you do have PGSTAT statements on already, just in case you ever need it in the future.
Pretty simple query you can do. I think anybody can query it just using show.
But yeah, on the explain side, if you're reading them by hand or using a Depeche style tool, I would highly encourage people to look at filtered rows, not just from sequential scans. So when they're looking for index potential, I see a lot of people see index scan and then ignore it. Don't look any further than that. But if an index scan is still filtering tens of thousands or hundreds of thousands of rows, and bear in mind that loops number, that's still really inefficient, you're still doing loads of work that you don't need to do. So that's a big one for people that are even a bit familiar with this already that I see a lot of people miss. And yeah, the only other one that is a whole topic in itself is it's worth learning about multicolumn indexes and how the order matters in those. I see a lot of even experienced developers not understand how much the order of those matters for serving different queries. And if you get the right multicolumn index, you can make a single query extremely fast. And you can probably avoid having to add two or three indexes if you can get the right ordering across those and serving multiple queries with that same multicolumn index.
Well worth learning about those, definitely.
[00:58:55] Speaker A: All right, so thanks so much for coming on and sharing your knowledge. I greatly appreciate it.
I hope you guys enjoy that as well. Please go ahead and like and subscribe if you want to get more insight into what's going on with the rubber Duck Dev show, get notifications of the episodes and the different content that's being released in terms of our videos, be sure to visit rubberduckdevshow.com and go ahead and sign up for our mailing list apart from that. Again, Michael, thank you so much for coming on and for everyone else, I'll see you next week. And until then, happy coding.
[00:59:34] Speaker B: Thanks Kristen, thanks for having me. Take care everyone.