Building a 2 Terabyte Database For My Postgres Course

[00:00:00] Speaker A: Hello, and welcome to the rubber Duck Dev show. I'm Kreston. [00:00:04] Speaker B: And I'm Coda. [00:00:05] Speaker A: And today we're going to talk about building a two terabyte example database for the course I'm building. But before we get into that, we usually do a weekend review. So I'll go ahead and start with that. So, my weekend review is I've been super busy working on the course, but particularly the marketing side of things, because I launched a webinar to basically educate people about part of the course as well as a vehicle to sell the course. So I've been incredibly busy creating the webinars, creating the sales page, getting the course ready for release, and everything related to that. So that's pretty much how my time's been spent. Coda, how about you? [00:00:55] Speaker B: Well, so I was traveling for work for around three weeks, something like that. And then I hopped around east Asia, and then I got Covid. [00:01:09] Speaker A: And then the party ended. [00:01:13] Speaker B: I was lying in bed for a while and had some very unproductive, some very unproductive work days as well. And the worst part about having both jet lag and Covid at the same time is both of those have a lot of drowsiness involved with it. So I literally have no idea how off I am is Covid and how much is jet lag. But I do know I perk up quite a bit in the middle of the night, so might be more jet lag than I really have been realizing here. Yeah. So, Creston, I think in your update you mentioned a course, and I know that's what we're talking about a lot today, but what is your course really about? [00:02:04] Speaker A: So it's called ludicrous speed postgres. It's basically a performance optimization course for postgres. So basically, having people go through the course to really find how to squeeze the most performance out of postgres in terms of laying out your data in terms of the schema, and applying indexes and other techniques and tools within postgres to make it as performant as possible. I'm actually announcing the course and having a webinar talking about postgres as a part of the launch of the course. So I set up this postgres performance demystified webinar. Learn a simple framework to achieve a thousand fold improvements. So I just had a webinar, actually, the day we're recording this in the morning, and I have another one next week. So another opportunity for people to attend, to try and jump in on the webinar to see what it's all about. And this is meant for developers or sysadmins or database administrators if they don't have as much knowledge quite yet, basically imparting a way for them to really become a postgres performance master and get really large wins for certain queries that. [00:03:22] Speaker B: They'Re working on and we can find this on. Looks like scaling postgres. [00:03:27] Speaker A: Yeah. So if you go to scalingposgrows.com, every page has this at the top so you can just click in the header. So just scalingposgrows.com and click the link to get to the webinar page. But kind of what I thought would be a good topic to talk about with regard to this is I'm really trying to make this as close as possible classroom experience, and not necessarily like we're all going to be live doing the same thing. I'm trying to pre record all the video education content, all the videos that are part of the course, but I am also providing a community so that the students can come and talk and ask questions of the other students that are in the course and myself. And we are also walking through the course over the course of eight weeks. So it's eight modules of study that we're going to be covering over eight weeks, and then there's going to be a live q and a where I'm going to be present and we're just going to talk about the things in the course. People can ask questions where they got stuck and things of that nature. So teacher student interaction, if you will. So really trying to make it something more than just hey, I made a video course and I'm throwing out there for anyone to buy and they can do it or not do it. This is really trying to enforce. If you want to make a commitment to this, I'm making a commitment myself, trying to be there and present to help you learn this material. Then another part of it, which I really haven't seen other places do, is building this really large database. I have experience working with databases that are tens of terabytes, so I wanted to have that as part of the course because usually things start to slow down around hundreds of gigabytes, but things get a little bit more interesting when there's terabytes involved. So that's kind of why I wanted to do that. And the first thing I'm going to show here is just an example of one of the states of the database at the time I took the screenshot. This is just a PSQL prompt. So it's the database prompt. And it's showing you the size of the database. It's just over two terabytes. Well, it's going to be a little bit more than that. But at the time that I took this as two terabytes. And then taking a look at some of the relations or the tables in terms of size. So, like there's an events table that capture different events happening in the system. That's about 5 billion rows. An email transmissions table is a billion row table. Then 100 million payments, 20 million users, about 4000 accounts. And this is built as a multitenant type application. So each account kind of owns things. I'll talk a little bit more about that. So basically it was built around these accounts or these tenants. [00:06:38] Speaker B: And so when you were creating this, you mentioned that you've worked with a lot of data or a lot of databases that are five or ten or more terabytes. Right. There are all sorts of different structures that you see and all sorts of kinds of data that you see as well. Why did you select this set of data? [00:07:00] Speaker A: Well, I was actually targeting two terabytes because, hey, I could see two terabyte hard drives. So, okay. That was kind of why I set. I wanted to be a little over terabyte, but two terabytes. Now I was thinking, should I go all the way up to two terabytes? But I actually had some early adopters that signed up for the course for an early discount. And I was getting advice for them. Basically in exchange for a discount, get some advice from them. And they said, we really want to be two terabytes more. I'm like, okay, there you go. [00:07:34] Speaker B: So you actually have a hard drive somewhere with this data? Okay. [00:07:41] Speaker A: Yeah, I purposely upgraded my machine. I had another Nvme hard drive to be able to handle it, to create it. [00:07:50] Speaker B: Yeah. Wow, interesting. So you added like a four terabyte or something like that? [00:07:54] Speaker A: Exactly. That's what I got. So that's what I did to be able to go ahead and prepare for this. In case anyone's curious, I went kind of meta with it. Meaning that what the database is, is basically a course site. So you can purchase course SAS apps. Well, okay, that's what I built with this. So there's a courses table and there are different editions for it. And there's different modules. And those modules have lessons. And so that's basically what the database is. It's the course environment. And you can send emails, of course, and make payments as people purchase courses and things of that nature. [00:08:33] Speaker B: But this is of course less than the total number of people that you'll have for your course. Right? You're going to have something. [00:08:40] Speaker A: Oh, in my dreams I would have. [00:08:43] Speaker B: Like ten times this, right, exactly, yeah. So this is really just preparing yourself for the real wave that your course is. [00:08:53] Speaker A: So that's kind of the scope of it. So I was like, all right, how am I going to tackle building this thing? There are some methods that are frequently used to generate fake data. So when you're talking with postgres, postgres has a function called generate series, and it's a very quick and easy way to make a million rows of something, or you could even make a billion rows of something. So it's very fast to do that. It's all happening within the database, and at least the postgres database is pretty much in c. So all of that is going to be very fast. But it would require a lot of custom SQl for good fake data. So like there are libraries that say, hey, give me a random email address, give me a random name, give me a random state city, all this sort of complex data. These libraries have been built. I wasn't aware of one that postgres has I may have missed, but so I didn't necessarily want to go this route even though I knew it would be really fast. The other option is postgres also comes with a PG bench program, which is really for benchmarking, but you can use custom scripts to be able to insert data. So it's definitely fast, but it still has the same kind of problem, probably not as fast to generate series, although maybe it depends how you do it, but it still requires a lot of custom SQL for good fake data. So again, I didn't really pursue these types of postgres tools, but there are faker libraries for different languages. So I said, all right, I'll go ahead and do that. Now, their speed is going to be dependent upon the language I am most familiar with, Ruby, but of course that language is relatively slow compared to so many others that exist. But it does give you good or good enough fake data for what I was looking for. But I went with my language familiarity as opposed to trying to do it in a whole new language. That would just take me more time, like rust or, I mean, I'm familiar with elixir, but again, that would take more time to do. But the Ruby library is actually based on Python's favorite library, which I think has been around longer, but still the Ruby has a lot of know. You say, hey, give me an IP address, gives you an IP address. Give me whatever you need. It can do that. So I kind of went that route, but I wanted to do it as fast as possible. This was the plan I had. So I was going to use Ruby's the language and use it the faker gem. Now I decided not to use the active record Orm. So this is the Orm object relational mapper for Ruby. It would probably just slow things down, instantiating objects and all that kind of stuff. So it was just a pure Ruby program and it basically had the following gems. It had the PG gem, which is the Postgres jam library, to connect to a postgres database from Ruby had the faker gem. And for some of the stuff I was doing, I wanted date and JSON libraries as well. So that's pretty much only thing that was in it. The slowest part of the whole process was running the faker gem or asking the faker gem to do stuff. The ruby part was the slowest part. So basically I was only going to use faker where it was beneficial. But apart from that, I definitely use random functions generously, meaning that when I create an account, it randomizes how many courses there are, how many modules in each course, how many lessons in each course. So everything is random and pretty much non deterministic on what it's going to create. And I also wanted to spend as much time in SQL as possible, because what that means is that we're going to be leveraging the C language within the database itself for doing as much of the process, avoiding ruby as much as possible, just to give it as much speed as possible. And I alluded to this. So basically the intent was to build a multi tenant database. So nearly every table is owned by an account or a tenant, if you would. So there is an accounts table and pretty much almost all of the tables have an account id defined that this particular piece of data, this course, this lesson, this module is owned by the account that owns it. And then because of that there was just a small subset of global tables, but then every other table and all the data associated with it, I could build an account at a time. So basically build one account, build all the courses for it, modules, lessons, users, emails, events. And with that I would say the average time to build an account was, it varied given the randomness, but it was maybe ten minutes, 15 minutes per account, and there were 4000 accounts. So I guess you could project how long it would take to build that. [00:13:51] Speaker B: But this was probably single threaded, right? [00:13:55] Speaker A: Well, yes, that's another excellent point. I have a six core machine. So I ran six instances of the Ruby program in parallel. So that was my way of getting around. [00:14:10] Speaker B: Right. [00:14:12] Speaker A: I didn't try to ruby multithread anything, but I was like, all right, I got six cores, I'm going to run six instances of this at a time. So you can take 4000 times, maybe ten minutes divided by six. And that was probably rough calendar time or so. [00:14:28] Speaker B: I see. Okay. [00:14:29] Speaker A: To build a whole database. [00:14:30] Speaker B: Wait, so you said 4000 accounts? [00:14:35] Speaker A: Yeah, approximately. Almost 4000. [00:14:38] Speaker B: Is the script actually something that you're making available to the people in your course as well? [00:14:45] Speaker A: Not in this first version of the course. I might release it as a script that you can run and define how many accounts that you want. And frankly, I'm using Ruby. I think I would much prefer having a compiled language to do this. Make it a program like elixir or rust, because then it can just be faster for everyone. I mean, I just did this because me being mostly a Ruby developer, it was the fastest way for me to do it programming wise, but not the fastest to run it. [00:15:15] Speaker B: Yeah. So do you actually have like a dedicated machine that you just were running this on or was it. No, I was just running, I was just running it overnight while you're in. [00:15:25] Speaker A: Some early iterations where it wasn't as efficient. It was warming up my room fairly. Okay. [00:15:31] Speaker B: Yeah. So I guess in Florida that's not usually that necessary, but here up in Boston that's a regular perk of being a heavy computer user. [00:15:41] Speaker A: I'd rather again write it in like rust and elixir and it's just a delivered program. You say how many accounts you want and it will just run it. I would want it to get it to that state before I would share it with the students of the course. All right. Now in terms of implementation, because there's actually two things I considered. First was creating each account using Ruby. So basically created in memory. So create all the objects needed, all the modules, lessons, et cetera, users, and then write that to a file and load using postgres copy because this might be the fastest way to do it. Everything just works in memory as long as you have sufficient memory to do it. Write it to a file very quickly and then postgres copy is super fast for loading data into postgres. But I said this is going to be really hard to try and handle referential integrity since the database uses integer primary keys. So basically when you create a record, it's going to create an integer for you and give it to you back. Now I could have disabled the referential integrity. I could have generated my own primary keys and override what it's going to try to generate for it. I said no, the integer is going to be seven, seven two or whatever. Also it could be using a whole lot of memory because I'm generating a fair number of paragraphs of data for each lesson. So basically there's really large text that's being generated. So these would be really huge objects. So I elected not to go this particular way. So the next choice was using multiline inserts. So when you do an insert, you can insert a row at a time or you can insert ten rows, 1000 rows, 10,000 rows at a time. So this is the actual choice that I went ahead and went with. And what I did was basically just create an array of rows in Ruby. So I still was building the objects and I wasn't building the whole structure of the account. It's just, all right, let's build out what the courses are going to be. And then for that create an array of all the data of each row and then say, all right, update all these rows into the database. And then the way I did again, because I wanted to spend as much time in SQL as possible, I said all right, once about half the data is loaded. Basically when I was using Faker the most I could then drop down to using direct insert statements. So once I have the courses in, the lessons in, the modules in, and some of the users, there's all sorts of connecting tables. Like there's an addition user tables to show what addition users have purchased or there are lesson user tables, meaning what lessons users have completed or not. So there, I don't need to actually generate that using the faker gem. I can just say, hey, some percentage of users have completed these lessons and just do it in an SQL statement, hitting all the rows at once. So that was far faster. So basically as soon as I could only work in SQL, I did that. [00:18:58] Speaker B: So really you basically just use Ruby for the parts where you're interfacing with faker to generate data? Yeah, pretty much basically everything. You just use raw SQL calls, SQL. [00:19:13] Speaker A: Calls, whatever I could do in SQL I did because that gave me speed. Again, if someone was doing this using rust or elixir or go, I'm sure they wouldn't have to worry as much about that. But that's what I did for this. [00:19:28] Speaker B: I'm curious, did you think at all about using different Ruby implementations or were you just using the, what is it like C Ruby? [00:19:39] Speaker A: Just the mainline ruby? Yeah, also used the returning capability of SQL to be able to build arrays of ids or other data for insertion. So once I inserted a bunch of data, give me the ids back and then use them or join them as necessary to build additional parts of the database. And it was really interesting building this. I felt like I was just building one half of an application because I was constantly thinking about, all right, what does this application need to do? What kind of tables should be here, what actions need to be done to define the comms that go in there, but there's no UI for it. And I should add with regard to number two, like once half the data is loaded, use direct insert statements as much as possible. I still needed to be able to join to other tables to do that, so I only added sufficient indexes for good performance. Otherwise, I tried to keep the number of indexes to a minimum because that helps of course, speed up inserts when you don't have as many indexes. All right. And that's pretty much how I did it. In terms of other advice, if you're doing something like this, increasing Max wall size to minimize the frequency of checkpoints of the database, that can be advantageous to do. And then of course the postgres docs have this particular HTML post that talks about loading data in and things you can do to speed that up. Some of their recommendations are more for loading in preexisting databases as opposed to creating something new like I was doing here. So I only did a subset of some of their recommendations that were listed with this. But I just thought this might be a topic that might be of interest to people who want to know how I did this. [00:21:23] Speaker B: Yeah. So if people were to do this kind of on their own, it sounds like the main things that you'd recommend. If it's something that would speed up their time to do it in probably some sort of language that might be a little bit faster is the biggest thing. Is there anything else that you found as kind of a major sort of stumbling block or anything like that? I guess. Are you happy with the way you did it ultimately? [00:21:54] Speaker A: Well, I'm happy that it got done. I didn't quite know how long it would take. I did run into an issue where Ruby was really slow. I don't know if it was cleaning up memory efficiently. I never really looked at that because if I tried to create multiple accounts with the same Ruby process, the memory kept increasing. I wouldn't want to just give this to someone, hey, look at this great program. Because I'm like still it's kind of rough around the edges. It did the job, it got it done in an acceptable period of time. [00:22:26] Speaker B: But the one time utility, right? [00:22:30] Speaker A: Yeah. I would never do what I did in this program, for example, and put it on a web app because of SQL injection. I was just embedding certain things within the SQL itself because I didn't have the active recognized ease of being able to avoid SQL injection. So there's all sorts of stuff in there because I was just interested in getting it fast and getting it done. [00:22:53] Speaker B: Right and doing it locally. [00:22:55] Speaker A: Yeah, and it's all fake data, so there's no concern there. So it was really just about getting it done as fast as possible. The only thing that I think I would do differently is if I had more time. I'll think about doing in the future iterations of the course and providing it as a part of it is doing it in a program that could be delivered like elixir, go rust, something like that. That is just faster. That's pretty much it. [00:23:22] Speaker B: Do you foresee yourself using this data set for any other future courses or blog posts or anything like that? Is this something where you can see yourself doing? [00:23:36] Speaker A: I could potentially use it for other courses I'm thinking about. Like for example, this is strictly a performance optimization course that I'm doing, but I also have my mind to do an SQL course so it could be reused for that. If I do that, like just learning SQL and then window functions and different things like that. So if I do that, definitely could reuse this. That's pretty much it. It was built for this one purpose. [00:24:02] Speaker B: So I guess one question for you with regards to your course, I know you're focused a little bit more on these larger data sets, but what kind of scale would you start recommending people take a look at your course. [00:24:17] Speaker A: Well, really, if you want to learn how to optimize databases more in general. So it's really, are you at the stage where you're thinking, hey, I really want to do this, or you're seeing a need within the organization you're working with, maybe for potential opportunities, then maybe you'd like to try to join the course for that. I usually get called in for consulting engagements at 100gb or higher is where it tends to happen. So that's where standard advice or status quo type thing of, hey, setting this up, setting up a few indexes on the table, that's where things start breaking down, right? So if you have just a basic knowledge of postgres and how to optimize it, that's where it may start falling down a little bit at the 100 and plus gigabyte level. [00:25:08] Speaker B: And I guess there's a big kind of trade off as well in terms of how many rows you have versus how much data each row actually takes up and things like that. Right. As well, depending kind of on that. I think we have some databases where we store point cloud data for robots where we basically have stored positions and what we're expected to see in this. So those are kind of like XYZ position data, and we just have this as like a binary blob that we just kind of throw in there as well. So something like that will end up seeing actually huge amount of space, even with a pretty small number of rows. [00:25:54] Speaker A: So you're storing immense amount of points data within a blob in the database? [00:26:00] Speaker B: Yes. [00:26:01] Speaker A: Okay. Did you ever, just out of curiosity, what's the structure of the data, storing it as JSon B or in arrays or anything like that? [00:26:10] Speaker B: Yeah, actually, no, maybe we use an array. It's been a little while since I've taken that, and that's one of the main reasons I think, that we weren't using sqlite since these are actually on robot. [00:26:26] Speaker A: But the events table, for example, is more on the narrow side, the amount of data, but many, many rows in this example database I set up, whereas the lessons that contain all the text for each lesson is very wide, each row has a ton of data that goes into the toast, which once data exceeds 2 kb, it goes into a separate storage area of the table. So there's going to be a balance of being able to look at both of those. [00:27:05] Speaker B: I know our case is always a little bit different. Right. It's one of those unique, strange use cases. So I think we're one of the only companies out there using elixir for robotics. And then I guess as far as this data set goes, you mentioned that it was pretty tricky kind of thinking about how you would, that it almost felt like you were building out an application. Right. And so what really were kind of the biggest considerations that you made there in terms of when you were designing the table and things like that, when you were thinking about how that would be used, what's kind of your vision in terms of that? [00:27:47] Speaker A: Well, really, a number of years ago, I released a course called Discover Ansible. So it was about Ansible and I did it, but I was like, I looked at course platforms and I said, this is not that hard to actually do. So actually just built my own course platform on a server I had. So I had the courses and the modules and the lessons all in there that people could go through and do the lessons. So that was kind of why I went this route. I was like, well, I already got this poor man's course site here, so I'll just take that, and I'll just keep adding more features. So the challenge was thinking of, okay, what are all the different features I could add to this to make it large enough to hit that milestone that I was looking for? It's roughly two terabytes and some tables having a billion rows. Also, the tables are a little bit thin in terms of number of columns, like in my consulting engagements and even my own databases, there's two to three times as many columns as are in this database. But the more time I spent saying, okay, what other columns can I add to this to make it bigger? And I didn't want to just put in fake stuff, but it was trying to think of, okay, well, I could add this, but it had to make sense or reasonable sense to do it. I didn't just want to throw random data in there. So that was a challenge and know to where to draw the line, because I know at any point I could say, all right, just create more accounts, and it gets bigger. But the challenge is, I think it's missing some additional columns that I think would be in an actual production database, but it is what it is. All right, so I think we'll go ahead and stop the show there. [00:29:31] Speaker B: Preston, that was very interesting. [00:29:33] Speaker A: Thank you. So I hope you guys enjoy that. Be sure to like and subscribe this episode if you did, also be sure to visit rubberductdevshow.com, where you can find links to all the topics discussed. And when you're there, go ahead and sign up for the email list so you can get an email every time we send out new content. Now, if you wanted to learn more about the course, there is one more webinar that's available. The one on the 15th was this morning. There's one webinar available on the 21st. And you can find out more about the course if that's something you're interested in. But apart from that, I hope everyone has a great week, and we'll see you next time. Happy coding. [00:30:15] Speaker B: Happy coding. Bye.

Show Notes

Episode Transcript

Other Episodes

Episode 15

When To Use UUIDs | Rubber Duck Dev Show 15

Episode 20

How Do You Start Testing | Rubber Duck Dev Show 20

Episode 5

Object Oriented vs. Functional Programming | Rubber Duck Dev Show 5

Building a 2 Terabyte Database For My Postgres Course | Rubber Duck Dev Show 116