samwho.dev

https://samwho.dev/rss.xml

A Commitment to Art and Dogs

.dog-line { display: flex; flex-wrap: nowrap; flex-direction: row; width: 100%; height: 10rem; margin-top: 2rem; margin-bottom: 2rem; } .dog-line img { flex-grow: 1; height: auto; margin: 0; padding: 0; object-fit: contain; } .dog-grid { display: grid; grid-template-columns: repeat(4, 1fr); grid-gap: 1rem; margin-top: 2rem; margin-bottom: 2rem; } Back in Memory Allocation, I introduced Haskie. The idea behind Haskie was to create a character that could ask questions the reader might have, and to "soften" the posts to make them feel less intimidating. I got some feedback from people that Haskie was a bit too childish, and didn't feel like he belonged in posts about serious topics. This feedback was in the minority, though, and most people liked him. So I kept him and used him again in Hashing. Having a proxy to the reader was useful. I could anticipate areas of confusion and clear them up without creating enormous walls of text. I don't like it when the entire screen is filled with text, I like to break it up with images and interactive elements. And now dogs. Then in Bloom Filters, I found myself needing a character to represent the "adult in the room." If Haskie was my proxy to the reader, this new character would serve as a proxy to all of the material I learned from in the writing of the post. This is Sage. I liked the idea of having a cast of characters, each with their own personality and purpose. But I had a few problems. # Problems Both Haskie and Sage, because I have no artistic ability, were generated by AI. Back when I made them I was making no money from this blog, and I had no idea if I was going to keep them around. I didn't want to invest money in an idea that could flop, so I didn't feel bad about using AI to try it out. Since then, however, I have been paid twice to write posts for companies, and I know that I'm keeping the dogs. It wasn't ethical to continue piggybacking on AI. While ethics were the primary motivation, there were some other smaller problems with the dogs: The visual style of them, while I did like it, never felt like it fit with the rest of my personal brand. It was difficult to get AI to generate consistent dogs. You'll notice differences in coat colouration and features between variants of the same dog. The AI generated images look bad at small sizes. So I worked with the wonderful Andy Carolan to create a new design for my dogs. A design that would be consistent, fit with my brand, and look good at any size. # Haskie, Sage, and Doe The redesigned dogs are consistent, use simple colours and shapes, and use the SVG file format to look good at any size. Each variant clocks in at around 20kb, which is slightly larger than the small AI-generated images, but I'll be able to use them at any size. Together the dogs represent a family unit: Sage as the dad, Haskie as the youngest child, and Doe as his older sister. They also come in a variety of poses, so I can use them to represent different emotions or actions. We were careful to make the dogs recognisable apart. They differ in colour, ear shape, tail shape, and collar tag. Sage and Doe have further distinguishing features: Sage with his glasses, and Doe with her bandana. Doe's bandana uses the same colours as the transgender flag, to show my support for the trans community and as a nod to her identity. # Going forward I'm so happy with the new dogs, and plan to use them in my posts going forward. I suspect I will, at some point, replace the dogs in my old posts as well. I don't plan to add any more characters, and I want to be careful to avoid overusing them. I don't want them to become a crutch, or to distract from the content of the posts. I also haven't forgotten the many people that pointed out to me that you can't pet the dogs. I'm working on it.

2024/6/1

articleCard.readMore

Bloom Filters

2024/2/19

articleCard.readMore

Hashing

2023/5/24

articleCard.readMore

Memory Allocation

.memory { width: 100%; margin-bottom: 1.5em; margin-top: 0.5em; } input[type=range]:focus { outline: none; } a[simulation] { cursor: pointer; } .size { color: #0072B2 !important; font-weight: bold; } .free { color: #009E73 !important; font-weight: bold; } .allocated { color: #D55E00 !important; font-weight: bold; } .usable-memory { color: #E69F00 !important; font-weight: bold; } One thing that all programs on your computer have in common is a need for memory. Programs need to be loaded from your hard drive into memory before they can be run. While running, the majority of what programs do is load values from memory, do some computation on them, and then store the result back in memory. In this post I'm going to introduce you to the basics of memory allocation. Allocators exist because it's not enough to have memory available, you need to use it effectively. We will visually explore how simple allocators work. We'll see some of the problems that they try to solve, and some of the techniques used to solve them. At the end of this post, you should know everything you need to know to write your own allocator. # malloc and free To understand the job of a memory allocator, it's essential to understand how programs request and return memory. malloc and free are functions that were first introduced in a recognisable form in UNIX v7 in 1979(!). Let's take a look at a short C program demonstrating their use. If you have beginner-level familiarity with another language, e.g. JavaScript, Python, or C#, you should have no problem following along. You don't need to understand every word, as long as you get the overall idea. This is the only C code in the article, I promise. #include <stdlib.h> int main() { void *ptr = malloc(4); free(ptr); return 0; } In the above program we ask for 4 bytes of memory by calling malloc(4), we store the value returned in a variable called ptr, then we indicate that we're done with the memory by calling free(ptr). These two functions are how almost all programs manage the memory they use. Even when you're not writing C, the code that is executing your Java, Python, Ruby, JavaScript, and so on make use of malloc and free. # What is memory? The smallest unit of memory that allocators work with is called a "byte." A byte can store any number between 0 and 255. You can think of memory as being a long sequence of bytes. We're going to represent this sequence as a grid of squares, with each square representing a byte of memory. In the C code from before, malloc(4) allocates 4 bytes of memory. We're going to represent memory that has been allocated as darker squares. Then free(ptr) tells the allocator we're done with that memory. It is returned back to the pool of available memory. Here's what 4 malloc calls followed by 4 free calls looks like. You'll notice there's now a slider. Dragging the slider to the right advances time forward, and dragging it left rewinds. You can also click anywhere on the grid and then use the arrow keys on your keyboard, or you can use the left and right buttons. The ticks along the slider represent calls to malloc and free. Wait a sec... What is malloc actually returning as a value? What does it mean to "give" memory to a program? What malloc returns is called a "pointer" or a "memory address." It's a number that identifies a byte in memory. We typically write addresses in a form called "hexadecimal." Hexadecimal numbers are written with a 0x prefix to distinguish them from decimal numbers. Move the slider below to see a comparison between decimal numbers and hexadecimal numbers. 0 == 0x0 Here's our familiar grid of memory. Each byte is annotated with its address in hexadecimal form. For space reasons, I've omitted the 0x prefix. The examples we use in this article pretend that your computer only has a very small amount of memory, but in real life you have billions of bytes to work with. Real addresses are much larger than what we're using here, but the idea is exactly the same. Memory addresses are numbers that refer to a specific byte in memory. # The simplest malloc The "hello world" of malloc implementations would hand out blocks of memory by keeping track of where the previous block ended and starting the next block right after. Below we represent where the next block should start with a grey square. You'll notice no memory is freed. If we're only keeping track of where the next block should start, and we don't know where previous blocks start or end, free doesn't have enough information to do anything. So it doesn't. This is called a "memory leak" because, once allocated, the memory can never be used again. Believe it or not, this isn't a completely useless implementation. For programs that use a known amount of memory, this can be a very efficient strategy. It's extremely fast and extremely simple. As a general-purpose memory allocator, though, we can't get away with having no free implementation. # The simplest general-purpose malloc In order to free memory, we need to keep better track of memory. We can do this by saving the address and size of all allocations, and the address and size of blocks of free memory. We'll call these an "allocation list" and a "free list" respectively. We're representing free list entries as 2 grey squares linked together with a line. You can imagine this entry being represented in code as address=0 and size=32. When our program starts, all of memory is marked as free. When malloc is called, we loop through our free list until we find a block large enough to accommodate it. When we find one, we save the address and size of the allocation in our allocation list, and shrink the free list entry accordingly. Where do we save allocations and free list entries? Aren't we pretending our computer only has 32 bytes of memory? You caught me. One of the benefits of being a memory allocator is that you're in charge of memory. You could store your allocation/free list in a reserved area that's just for you. Or you could store it inline, in a few bytes immediately preceding each allocation. For now, assume we have reserved some unseen memory for ourselves and we're using it to store our allocation and free lists. So what about free? Because we've saved the address and size of the allocation in our allocation list, we can search that list and move the allocation back in to the free list. Without the size information, we wouldn't be able to do this. Our free list now has 2 entries. This might look harmless, but actually represents a significant problem. Let's see that problem in action. We allocated 8 blocks of memory, each 4 bytes in size. Then we freed them all, resulting in 8 free list entries. The problem we have now is that if we tried to do a malloc(8), there are no items in our free list that can hold 8 bytes and the malloc(8) will fail. To solve this, we need to do a bit more work. When we free memory, we should make sure that if the block we return to the free list is next to any other free blocks, we combine them together. This is called "coalescing." Much better. # Fragmentation A perfectly coalesced free list doesn't solve all of our problems. The following example shows a longer sequence of allocations. Have a look at the state memory is in at the end. We end this sequence with 6 of our 32 bytes free, but they're split into 2 blocks of 3 bytes. If we had to service a malloc(6), while we have enough free memory in theory, we wouldn't be able to. This is called "fragmentation." Sadly not. Remember earlier we talked about how the return value of malloc is the address of a byte in memory? Moving allocations won't change the pointers we have already returned from malloc. We would change the value those pointers are pointed at, effectively breaking them. This is one of the downsides of the malloc/free API. If we can't move allocations after creating them, we need to be more careful about where we put them to begin with. One way to combat fragmentation is, confusingly, to overallocate. If we always allocate a minimum of 4 bytes, even when the request is for 1 byte, watch what happens. This is the exact same sequence of allocations as above. Now we can service a malloc(6). It's worth keeping in mind that this is just one example. Programs will call malloc and free in very different patterns depending on what they do, which makes it challenging to design an allocator that always performs well. malloc, the start of the free list seems to fall out of sync with allocated memory. Is that a bug in the visualisation? No, that's a side-effect of overallocating. The visualisation shows "true" memory use, whereas the free list is updated from the allocator's perspective. So when the first malloc happens, 1 byte of memory is allocated but the free list entry is moved forward 4 bytes. We trade some wasted space in return for less fragmentation. It's worth noting that this unused space that results from overallocation is another form of fragmentation. It's memory that cannot be used until the allocation that created it is freed. As a result, we wouldn't want to go too wild with overallocation. If our program only ever allocated 1 byte at a time, for example, we'd be wasting 75% of all memory. Another way to combat fragmentation is to segment memory into a space for small allocations and a space for big ones. In this next visualisation we start with two free lists. The lighter grey one is for allocations 3 bytes or smaller, and the darker grey one is for allocations 4 bytes or larger. Again, this is the exact same sequence of allocations as before. Nice! This also reduces fragmentation. If we're strictly only allowing allocations of 3 bytes or less in the first segment, though, then we can't service that malloc(6). The trade-off here is that reserving a segment of memory for smaller allocations gives you less memory to work with for bigger ones. the first allocation in the dark grey free list is 3 bytes! You said this was for allocations 4 bytes and up. What gives? Got me again. This implementation I've written will put small allocations in the dark grey space when the light grey space is full. It will overallocate when it does this, otherwise we'd end up with avoidable fragmentation in the dark grey space thanks to small allocations. Allocators that split memory up based on the size of allocation are called "slab allocators." In practice they have many more size classes than the 2 in our example. # A quick malloc puzzle What happens if you malloc(0)? Have a think about this before playing with the slider below. This is using our free list implementation that mandates a minimum size of 4 bytes for allocations. All memory gets allocated, but none is actually used. Do you think this is correct behaviour? It turns out that what happens when you malloc(0) differs between implementations. Some of them behave as above, allocating space they probably didn't have to. Others will return what's called a "null pointer", a special pointer that will crash your program if you try to read or write the memory it points to. Others pick one specific location in memory and return that same location for all calls to malloc(0), regardless how many times it is called. Moral of the story? Don't malloc(0). # Inline bookkeeping Remember earlier on when you asked about where allocation list and free list information gets stored, and I gave an unsatisfying answer about how it's stored in some other area of memory we've reserved for ourselves? This isn't the only way to do it. Lots of allocators store information right next to the blocks of memory they relate to. Have a look at this. What we have here is memory with no allocations, but free list information stored inline in that memory. Each block of memory, free or used, gets 3 additional bytes of bookkeeping information. If address is the address of the first byte of the allocation, here's the layout of a block: address + 0 is the size of the block address + 1 is whether the block is free (1) or used (2) address + 2 is where the usable memory starts address + 2 + size -- the size of the block again So in this above example, the byte at 0x0 is storing the value 29. This means it's a block containing 29 bytes of memory. The value 1 at 0x1 indicates that the block is free memory. size twice? Isn't that wasteful? It seems wasteful at first, but it is necessary if we want to do any form of coalescing. Let's take a look at an example. Here we've allocated 4 bytes of memory. To do this, our malloc implementation starts at the beginning of memory and checks to see if the block there is used. It knows that at address + 1 it will find either a 1 or a 2. If it finds a 1, it can check the value at address for how big the block is. If it is big enough, it can allocate into it. If it's not big enough, it knows it can add the value it finds in address to address to get to the start of the next block of memory. This has resulted in the creation of a used block (notice the 2 stored in the 2nd byte), and it has pushed start of the free block forward by 7 bytes. Let's do the same again and allocate another 4 bytes. Next, let's free our first malloc(4). The implementation of free is where storing information inline starts to shine. In our previous allocators, we had to search the allocation list to know the size of the block being freed. Now we know we'll find it at address. What's better than that is that for this free, we don't even need to know how big the allocation is. We can just set address + 1 to 1! How great is that? Simple, fast. What if we wanted to free the 2nd block of used memory? We know that we want to coalesce to avoid fragmentation, but how do we do that? This is where the seemingly wasteful bookkeeping comes into play. When we coalesce, we check to see the state of the blocks immediately before and immediately after the block we're freeing. We know that we can get to the next block by adding the value at address to address, but how do we get to the previous block? We take the value at address - 1 and subtract that from address. Without this duplicated size information at the end of the block, it would be impossible to find the previous block and impossible to coalesce properly. Allocators that store bookkeeping information like this alongside allocations are called "boundary tag allocators." Surprisingly, nothing truly prevents this. We rely heavily, as an industry, on the correctness of code. You might have heard of "buffer overrun" or "use after free" bugs before. These are when a program modifies memory past the end of an allocated block, or accidentally uses a block of memory after freeing it. These are indeed catastrophic. They can result in your program immediately crashing, they can result in your program crashing in several minutes, hours, or days time. They can even result in hackers using the bug to gain access to systems they shouldn't have access to. We're seeing a rise in popularity of "memory safe" languages, for example Rust. These languages invest a lot in making sure it's not possible to make these types of mistake in the first place. Exactly how they do that is outside of the scope of this article, but if this interests you I highly recommend giving Rust a try. You might have also realised that calling free on a pointer that's in the middle of a block of memory could also have disastrous consequences. Depending on what values are in memory, the allocator could be tricked into thinking it's freeing something but what it's really doing is modifying memory it shouldn't be. To get around this, some allocators inject "magic" values as part of the bookkeeping information. They store, say, 0x55 at address + 2. This would waste an extra byte of memory per allocation, but would allow them to know when a mistake has been made. To reduce the impact of this, allocators often disable this behaviour by default and allow you to enable it only when you're debugging. # Playground If you're keen to take your new found knowledge and try your hand at writing your own allocators, you can click here to go to my allocator playground. You'll be able to write JavaScript code that implements the malloc/free API and visualise how it works! # Conclusion We've covered a lot in this post, and if it has left you yearning for more you won't be disappointed. I've specifically avoided the topics of virtual memory, brk vs mmap, the role of CPU caches, and the endless tricks real malloc implementations pull out of their sleeves. There's no shortage of information about memory allocators on the Internet, and if you've read this far you should be well-placed to dive in to it. Join the discussion on Hacker News! # Acknowledgments Special thanks to the following people: Chris Down for lending me his extensive knowledge of real-world memory allocators. Anton Verinov for lending me his extensive knowledge of the web, browser developer tools, and user experience. Blake Becker, Matt Kaspar, Krista Horn, Jason Peddle, and Josh W. Comeau for their insight and constructive reviews.

2023/4/13

articleCard.readMore

Load Balancing

.simulation { width: 100%; display: flex; justify-content: center; align-items: center; margin-bottom: 2.5em; } .load-balancer { color: black; font-weight: bold; } .request { color: #04BF8A; font-weight: bold; } .server { color: #999999; font-weight: bold; } .dropped { color: red; font-weight: bold; } .lds-dual-ring { display: inline-block; width: 80px; height: 80px; } .lds-dual-ring:after { content: " "; display: block; width: 64px; height: 64px; margin: 8px; border-radius: 50%; border: 6px solid #000; border-color: #000 transparent #000 transparent; animation: lds-dual-ring 1.2s linear infinite; } @keyframes lds-dual-ring { 0% { transform: rotate(0deg); } 100% { transform: rotate(360deg); } } Past a certain point, web applications outgrow a single server deployment. Companies either want to increase their availability, scalability, or both! To do this, they deploy their application across multiple servers with a load balancer in front to distribute incoming requests. Big companies may need thousands of servers running their web application to handle the load. In this post we're going to focus on the ways that a single load balancer might distribute HTTP requests to a set of servers. We'll start from the bottom and work our way up to modern load balancing algorithms. # Visualising the problem Let's start at the beginning: a single load balancer sending requests to a single server. Requests are being sent at a rate of 1 request per second (RPS), and each request reduces in size as the server processes it. For a lot of websites, this setup works just fine. Modern servers are powerful and can handle a lot of requests. But what happens when they can't keep up? Here we see that a rate of 3 RPS causes some requests to get dropped. If a request arrives at the server while another request is being processed, the server will drop it. This will result in an error being shown to the user and is something we want to avoid. We can add another server to our load balancer to fix this. No more dropped requests! The way our load balancer is behaving here, sending a request to each server in turn, is called "round robin" load balancing. It's one of the simplest forms of load balancing, and works well when your servers are all equally powerful and your requests are all equally expensive. # When round robin doesn't cut it In the real world, it's rare for servers to be equally powerful and requests to be equally expensive. Even if you use the exact same server hardware, performance may differ. Applications may have to service many different types of requests, and these will likely have different performance characteristics. Let's see what happens when we vary request cost. In the following simulation, requests aren't equally expensive. You'll be able to see this by some requests taking longer to shrink than others. While most requests get served successfully, we do drop some. One of the ways we can mitigate this is to have a "request queue." Request queues help us deal with uncertainty, but it's a trade-off. We will drop fewer requests, but at the cost of some requests having a higher latency. If you watch the above simulation long enough, you might notice the requests subtly changing colour. The longer they go without being served, the more their colour will change. You'll also notice that thanks to the request cost variance, servers start to exhibit an imbalance. Queues will get backed up on servers that get unlucky and have to serve multiple expensive requests in a row. If a queue is full, we will drop the request. Everything said above applies equally to servers that vary in power. In the next simulation we also vary the power of each server, which is represented visually with a darker shade of grey. The servers are given a random power value, but odds are some are less powerful than others and quickly start to drop requests. At the same time, the more powerful servers sit idle most of the time. This scenario shows the key weakness of round robin: variance. Despite its flaws, however, round robin is still the default HTTP load balancing method for nginx. # Improving on round robin It's possible to tweak round robin to perform better with variance. There's an algorithm called "weighted round robin" which involves getting humans to tag each server with a weight that dictates how many requests to send to it. In this simulation, we use each server's known power value as its weight, and we give more powerful servers more requests as we loop through them. While this handles the variance of server power better than vanilla round robin, we still have request variance to contend with. In practice, getting humans to set the weight by hand falls apart quickly. Boiling server performance down to a single number is hard, and would require careful load testing with real workloads. This is rarely done, so another variant of weighted round robin calculates weights dynamically by using a proxy metric: latency. It stands to reason that if one server serves requests 3 times faster than another server, it's probably 3 times faster and should receive 3 times more requests than the other server. I've added text to each server this time that shows the average latency of the last 3 requests served. We then decide whether to send 1, 2, or 3 requests to each server based on the relative differences in the latencies. The result is very similar to the initial weighted round robin simulation, but there's no need to specify the weight of each server up front. This algorithm will also be able to adapt to changes in server performance over time. This is called "dynamic weighted round robin." Let's see how it handles a complex situation, with high variance in both server power and request cost. The following simulation uses randomised values, so feel free to refresh the page a few times to see it adapt to new variants. # Moving away from round robin Dynamic weighted round robin seems to account well for variance in both server power and request cost. But what if I told you we could do even better, and with a simpler algorithm? This is called "least connections" load balancing. Because the load balancer sits between the server and the user, it can accurately keep track of how many outstanding requests each server has. Then when a new request comes in and it's time to determine where to send it, it knows which servers have the least work to do and prioritises those. This algorithm performs extremely well regardless how much variance exists. It cuts through uncertainty by maintaining an accurate understanding of what each server is doing. It also has the benefit of being very simple to implement. Let's see this in action in a similarly complex simulation, the same parameters we gave the dynamic weighted round robin algorithm above. Again, these parameters are randomised within given ranges, so refresh the page to see new variants. While this algorithm is a great balance between simplicity and performance, it's not immune to dropping requests. However, what you'll notice is that the only time this algorithm drops requests is when there is literally no more queue space available. It will make sure all available resources are in use, and that makes it a great default choice for most workloads. # Optimizing for latency Up until now I've been avoiding a crucial part of the discussion: what we're optimising for. Implicitly, I've been considering dropped requests to be really bad and seeking to avoid them. This is a nice goal, but it's not the metric we most want to optimise for in an HTTP load balancer. What we're often more concerned about is latency. This is measured in milliseconds from the moment a request is created to the moment it has been served. When we're discussing latency in this context, it is common to talk about different "percentiles." For example, the 50th percentile (also called the "median") is defined as the millisecond value for which 50% of requests are below, and 50% are above. I ran 3 simulations with identical parameters for 60 seconds and took a variety of measurements every second. Each simulation varied only by the load balancing algorithm used. Let's compare the medians for each of the 3 simulations: You might not have expected it, but round robin has the best median latency. If we weren't looking at any other data points, we'd miss the full story. Let's take a look at the 95th and 99th percentiles. Note: there's no colour difference between the different percentiles for each load balancing algorithm. Higher percentiles will always be higher on the graph. We see that round robin doesn't perform well in the higher percentiles. How can it be that round robin has a great median, but bad 95th and 99th percentiles? In round robin, the state of each server isn't considered, so you'll get quite a lot of requests going to servers that are idle. This is how we get the low 50th percentile. On the flip side, we'll also happily send requests to servers that are overloaded, hence the bad 95th and 99th percentiles. We can take a look at the full data in histogram form: I chose the parameters for these simulations to avoid dropping any requests. This guarantees we compare the same number of data points for all 3 algorithms. Let's run the simulations again but with an increased RPS value, designed to push all of the algorithms past what they can handle. The following is a graph of cumulative requests dropped over time. Least connections handles overload much better, but the cost of doing that is slightly higher 95th and 99th percentile latencies. Depending on your use-case, this might be a worthwhile trade-off. # One last algorithm If we really want to optimise for latency, we need an algorithm that takes latency into account. Wouldn't it be great if we could combine the dynamic weighted round robin algorithm with the least connections algorithm? The latency of weighted round robin and the resilience of least connections. Turns out we're not the first people to have this thought. Below is a simulation using an algorithm called "peak exponentially weighted moving average" (or PEWMA). It's a long and complex name but hang in there, I'll break down how it works in a moment. I've set specific parameters for this simulation that are guaranteed to exhibit an expected behaviour. If you watch closely, you'll notice that the algorithm just stops sending requests to the leftmost server after a while. It does this because it figures out that all of the other servers are faster, and there's no need to send requests to the slowest one. That will just result in requests with a higher latency. So how does it do this? It combines techniques from dynamic weighted round robin with techniques from least connections, and sprinkles a little bit of its own magic on top. For each server, the algorithm keeps track of the latency from the last N requests. Instead of using this to calculate an average, it sums the values but with an exponentially decreasing scale factor. This results in a value where the older a latency is, the less it contributes to the sum. Recent requests influence the calculation more than old ones. That value is then taken and multiplied by the number of open connections to the server and the result is the value we use to choose which server to send the next request to. Lower is better. So how does it compare? First let's take a look at the 50th, 95th, and 99th percentiles when compared against the least connections data from earlier. We see a marked improvement across the board! It's far more pronounced at the higher percentiles, but consistently present for the median as well. Here we can see the same data in histogram form. How about dropped requests? It starts out performing better, but over time performs worse than least connections. This makes sense. PEWMA is opportunistic in that it tries to get the best latency, and this means it may sometimes leave a server less than fully loaded. I want to add here that PEWMA has a lot of parameters that can be tweaked. The implementation I wrote for this post uses a configuration that seemed to work well for the situations I tested it in, but further tweaking could get you better results vs least connections. This is one of the downsides of PEWMA vs least connections: extra complexity. # Conclusion I spent a long time on this post. It was difficult to balance realism against ease of understanding, but I feel good about where I landed. I'm hopeful that being able to see how these complex systems behave in practice, in ideal and less-than-ideal scenarios, helps you grow an intuitive understanding of when they would best apply to your workloads. Obligatory disclaimer: You must always benchmark your own workloads over taking advice from the Internet as gospel. My simulations here ignore some real life constraints (server slow start, network latency), and are set up to display specific properties of each algorithm. They aren't realistic benchmarks to be taken at face value. To round this out, I leave you with a version of the simulation that lets you tweak most of the parameters in real time. Have fun! EDIT: Thanks to everyone who participated in the discussions on Hacker News, Twitter and Lobste.rs! You all had a tonne of great questions and I tried to answer all of them. Some of the common themes were about missing things, either algorithms (like "power of 2 choices") or downsides of algorithms covered (like how "least connections" handles errors from servers). I tried to strike a balance between post length and complexity of the simulations. I'm quite happy with where I landed, but like you I also wish I could have covered more. I'd love to see people taking inspiration from this and covering more topics in this space in a visual way. Please ping me if you do! The other common theme was "how did you make this?" I used PixiJS and I'm really happy with how it turned out. It's my first time using this library and it was quite easy to get to grips with. If writing visual explanations like this are something you're interested in, I recommend it! # Playground

2023/4/10

articleCard.readMore

Practical Problems with Auto-Increment

In this post I'm going to demonstrate 2 reasons I will be avoiding auto-increment fields in Postgres and MySQL in future. I'm going to prefer using UUID fields unless I have a very good reason not to. # MySQL <8.0 auto-increment ID re-use If you're running an older version of MySQL, it's possible for auto-incrementing IDs to get re-used. Let's see this in action. $ docker volume create mysql-data $ docker run --platform linux/amd64 -e MYSQL_ROOT_PASSWORD=my-secret-pw -p 3306:3306 -v mysql-data:/var/lib/mysql mysql:5.7 This gets us a Docker container of MySQL 5.7 running, attached to a volume that will persist the data between runs of this container. Next let's get a simple schema we can work with: $ docker run -it --rm --network host --platform linux/amd64 mysql:5.7 mysql -h 127.0.0.1 -P 3306 -u root -p mysql> CREATE DATABASE my_database; Query OK, 1 row affected (0.01 sec) mysql> USE my_database; Database changed mysql> CREATE TABLE my_table ( -> ID INT AUTO_INCREMENT PRIMARY KEY -> ); Query OK, 0 rows affected (0.02 sec) Now let's insert a couple of rows. mysql> INSERT INTO my_table () VALUES (); Query OK, 1 row affected (0.03 sec) mysql> INSERT INTO my_table () VALUES (); Query OK, 1 row affected (0.01 sec) mysql> INSERT INTO my_table () VALUES (); Query OK, 1 row affected (0.01 sec) mysql> SELECT * FROM my_table; +----+ | ID | +----+ | 1 | | 2 | | 3 | +----+ 3 rows in set (0.01 sec) So far so good. We can restart the MySQL server and run the same SELECT statement again and get the same result. Let's delete a row. mysql> DELETE FROM my_table WHERE ID=3; Query OK, 1 row affected (0.03 sec) mysql> SELECT * FROM my_table; +----+ | ID | +----+ | 1 | | 2 | +----+ 2 rows in set (0.00 sec) Let's insert a new row to make sure the ID 3 doesn't get reused. mysql> INSERT INTO my_table () VALUES (); Query OK, 1 row affected (0.02 sec) mysql> SELECT * FROM my_table; +----+ | ID | +----+ | 1 | | 2 | | 4 | +----+ 3 rows in set (0.00 sec) Perfect. Let's delete that latest row, restart the server, and then insert a new row. mysql> DELETE FROM my_table WHERE ID=4; Query OK, 1 row affected (0.01 sec) mysql> SELECT * FROM my_table; ERROR 2013 (HY000): Lost connection to MySQL server during query $ docker run -it --rm --network host --platform linux/amd64 mysql:5.7 mysql -h 127.0.0.1 -P 3306 -u root -p mysql> USE my_database; Database changed mysql> SELECT * FROM my_table; +----+ | ID | +----+ | 1 | | 2 | +----+ 2 rows in set (0.00 sec) mysql> INSERT INTO my_table () VALUES (); Query OK, 1 row affected (0.03 sec) mysql> SELECT * FROM my_table; +----+ | ID | +----+ | 1 | | 2 | | 3 | +----+ 3 rows in set (0.00 sec) Eep. MySQL has re-used the ID 3. This is because the way that auto-increment works in InnoDB is, on server restart, it will figure out what the next ID to use is by effectively running this query: SELECT MAX(ID) FROM my_table; If you had deleted the most recent records from the table just before restart, IDs that had been used will be re-used when the server comes back up. In theory, this shouldn't cause you trouble. Best practice dictates that you shouldn't be using IDs from database tables outside of that table unless it's some foreign key field, and you certainly wouldn't leak that ID out of your system, right? In practice, this stuff happens and can cause devastatingly subtle bugs. MySQL 8.0 changed this behaviour by storing the auto-increment value on disk in a way that persists across restarts. # Postgres sequence values don't get replicated Like MySQL 8.0, Postgres stores auto-increment values on disk. It does this in a schema object called a "sequence." When you create an auto-incrementing field in Postgres, behind the scenes a sequence will be created to back that field and durably keep track of what the next value should be. Let's take a look at that in practice. $ docker volume create postgres-14-data $ docker run --network host -e POSTGRES_PASSWORD=my-secret-pw -v postgres-14-data:/var/lib/postgresql/data -p postgres:14 With Postgres up and running, let's go ahead and create our table: $ docker run -it --rm --network host postgres:14 psql -h 127.0.0.1 -U postgres postgres=# CREATE TABLE my_table (id SERIAL PRIMARY KEY); CREATE TABLE And insert a few rows: postgres=# INSERT INTO my_table DEFAULT VALUES; INSERT 0 1 postgres=# INSERT INTO my_table DEFAULT VALUES; INSERT 0 1 postgres=# INSERT INTO my_table DEFAULT VALUES; INSERT 0 1 postgres=# SELECT * FROM my_table; id ---- 1 2 3 (3 rows) So far so good. Let's take a look at the table: postgres=# \d my_table Table "public.my_table" Column | Type | Collation | Nullable | Default --------+---------+-----------+----------+-------------------------------------- id | integer | | not null | nextval('my_table_id_seq'::regclass) Indexes: "my_table_pkey" PRIMARY KEY, btree (id) This output tells us that the default value for our id field is the nextval of my_table_id_seq. Let's take a look at my_table_id_seq: postgres=# \d my_table_id_seq Sequence "public.my_table_id_seq" Type | Start | Minimum | Maximum | Increment | Cycles? | Cache ---------+-------+---------+------------+-----------+---------+------- integer | 1 | 1 | 2147483647 | 1 | no | 1 Owned by: public.my_table.id postgres=# SELECT currval('my_table_id_seq'); currval --------- 3 (1 row) Neat, we have a bonafide object in Postgres that's keeping track of the auto-incrementing ID value. If we were to repeat what we did in MySQL, delete some rows and restart, we wouldn't have the same problem here. my_table_id_seq is saved to disk and doesn't lose its place. Or does it? If you want to update Postgres to a new major version, the way you typically accomplish that is by creating a new Postgres instance on the version you want to upgrade to, logically replicate from the old instance to the new one, and then switch your application to talk to the new one. First we need to restart our Postgres 14 with some new configuration to allow logical replication: $ docker run --network host -e POSTGRES_PASSWORD=my-secret-pw -v postgres-14-data:/var/lib/postgresql/data -p postgres:14 -c wal_level=logical Now let's get Postgres 15 up and running: $ docker volume create postgres-15-data $ docker run --network host -e POSTGRES_PASSWORD=my-secret-pw -v postgres-15-data:/var/lib/postgresql/data postgres:15 postgres:14 -c wal_level=logical -p 5431 Next up, we create a "publication" on our Postgres 14 instance: postgres=# CREATE PUBLICATION my_publication FOR ALL TABLES; CREATE PUBLICATION Then we create our "my_table" table and a "subscription" on our Postgres 15 instance: postgres=# CREATE TABLE my_table (id SERIAL PRIMARY KEY); CREATE TABLE postgres=# CREATE SUBSCRIPTION my_subscription CONNECTION 'host=127.0.0.1 port=5432 dbname=postgres user=postgres password=my-secret-pw' PUBLICATION my_publication; NOTICE: created replication slot "my_subscription" on publisher CREATE SUBSCRIPTION After doing this, we should see data syncing between old and new instances: $ docker run -it --rm --network host postgres:15 psql -h 127.0.0.1 -U postgres -p 5432 -c "SELECT * FROM my_table" id ---- 1 2 3 (3 rows) $ docker run -it --rm --network host postgres:15 psql -h 127.0.0.1 -U postgres -p 5431 -c "SELECT * FROM my_table" id ---- 1 2 3 (3 rows) $ docker run -it --rm --network host postgres:15 psql -h 127.0.0.1 -U postgres -p 5432 -c "INSERT INTO my_table DEFAULT VALUES" INSERT 0 1 $ docker run -it --rm --network host postgres:15 psql -h 127.0.0.1 -U postgres -p 5431 -c "SELECT * FROM my_table" id ---- 1 2 3 4 (4 rows) So what's the problem? Well... $ docker run -it --rm --network host postgres:15 psql -h 127.0.0.1 -U postgres -p 5432 -c "SELECT nextval('my_table_id_seq')" nextval --------- 5 (1 row) $ docker run -it --rm --network host postgres:15 psql -h 127.0.0.1 -U postgres -p 5431 -c "SELECT nextval('my_table_id_seq')" nextval --------- 1 (1 row) The sequence value is not replicated. If we tried to insert a row into Postgres 15 we get this: $ docker run -it --rm --network host postgres:15 psql -h 127.0.0.1 -U postgres -p 5431 -c "INSERT INTO my_table DEFAULT VALUES" ERROR: duplicate key value violates unique constraint "my_table_pkey" DETAIL: Key (id)=(2) already exists. Note: It's tried to insert id=2 here because when we called nextval earlier, it modified the sequence. This can make major Postgres version updates very tricky if you rely heavily on auto-incrementing ID fields. You need to modify the sequence values manually to values you know for a fact won't be reached during the process of the upgrade, and then you likely need to disable writes during the upgrade depending on your workload. # Conclusion You can avoid all of the above pain by using UUID fields instead of auto-incrementing integers. These have the benefit of being unpredictable and not leak information about the cardinality of the underlying table if you do end up using them outside of the table (which you shouldn't). Thanks to this article from the wonderful folks at Incident.io, I am now aware of the German tank problem. Well worth reading both the linked article, and the Wikipedia page, for more reasons not to use auto-increment ID fields.

2023/3/25

articleCard.readMore

Getting an Autism Diagnosis

On the 3rd of March 2022, we received a letter informing us that our eldest son, Max, has Autism Spectrum Disorder. The letter was the end result of a long process. I’m going to talk about that process from start to finish, in as much detail as I can. This post would not have been possible without my wife's dedication to our 2 children, her persistence in the face of long odds, and her diligent note taking. Sophie, I love you. # Prologue This post spiritually follows on from this one. I ended that post by saying I’d like to write another post about the first months of parenthood. That never happened, because the first months of parenthood are an extreme test of patience and resolve. Being new parents, we didn’t know what we were doing. Max was irritable, difficult to get to sleep, and dropped down to the 9th percentile for weight. This last piece of information was a shock to us, and to our health visitor, and led to the revelation that Sophie wasn’t producing enough breastmilk. Nobody’s fault, just the way it was. To tell the truth, this was a relief. After we switched to formula feeding, Max’s temperament changed almost overnight. He was content, he slept better, I was able to help with feeding, and Sophie’s nipples were able to heal. Max was 9 months old when the UK went into its first full lockdown of the COVID-19 pandemic on March 26th 2020. The week before the lockdown was announced, we had visited a local nursery and been given forms to fill in to confirm Max’s attendance. We were excited to get him spending time with kids his own age, and we were looking forward to getting some time off from parenting. Instead, we all had to stay home 24/7 by law. I was in the fortunate position to already be working from home, and working in an industry that was relatively unaffected by the pandemic. But none of this helped Max’s social development. I tell you all of this because it’s relevant to Max’s diagnosis. While we got his diagnosis earlier than most, a fact we are ever grateful for, it did take longer than it would have had there been no pandemic. We attributed a lot of his behaviour to having lived most of his life in lockdown. # How we realised Max was different Max had his first “settling in” session at nursery on February 8th 2021. These are short sessions, often only a few hours long, designed to ease your child in to the nursery setting. Max’s first settling in session didn’t go very well, a fact that when taken in isolation isn't unusual. He spent the majority of the time screaming, and we ended up cutting it short. Despite this, Sophie and I settled into a good rhythm with nursery. Max went there twice a week. It was a good balance between cost, us getting time off, and Max getting time with his peers. But Max wasn’t taking to it, and one evening a member of staff took Sophie to one side and said Max “isn’t where we expect him to be developmentally.” It's hard to describe how this made me feel. Is this our fault? Are we not parenting well enough? Nursery's concerns were that Max screams a lot, can’t follow instructions, and his speech was less developed than his peers. They suggested it could be Max’s hearing, and they recommended we talk to our GP to set up a hearing test. We called our GP and had a frustrating conversation in which he asked us if we think Max has problems hearing. We explained that we had been recommended to get a hearing test by his nursery. But did we think he had problems hearing? I didn’t think Max had any problems hearing, but I didn’t want to say that because I am not a medical professional. Nursery also recommended we contact our health visitor to ask for a developmental review. In our area, kids used to go through a developmental review at 2 years old. This changed in recent years, and it’s done at 3 years old now. The review is made up of questions that gauge things your child can and cannot do. At the end, you get a score. If you want to see what’s on it, you can search “ASQ:3 24 months” and you’ll find loads of PDFs, all very similar. Max didn’t do well at this, we had to say that he wasn’t able to do most of the things they asked about. Because of his low score, we did another questionnaire called the ASQ:SE:2. This one focuses on social and emotional development. Again, Max scored low. It was the result of these two tests that led our health visitor to refer us to our local council’s special educational needs and disability (SEND) team. This happened on the 7th May 2021. This referral included appointments with a paediatrician (which we were told would take a few months, but actually ended up taking almost a full year), speech and language therapist, and the SEND team themselves. The 7th of May was the first point at which it felt “serious,” and we started to suspect he may be autistic. # Jumping through all the hoops On the 12th of August we had our appointment with the SEND team. We’ll call her Janet. It took place at nursery, and the day before it Janet had been at nursery to spend time with Max and observe his behaviour. I remember messaging my boss saying I need to be away from work for “an hour or so.” The meeting lasted 4 hours. It was obvious that Janet had spent a lot of time with Max, and had taken detailed notes. It was in this meeting that we asked “do you think Max is autistic?” Janet said yes, he probably is. We had asked other people this same question, because it had been on our mind since the ASQ, but everybody had been cagey about it. “Oh I couldn’t say”, “I’m not qualified to make that diagnosis”, etc. We appreciated how forthcoming Janet was with us, and the risk she took personally being open about it. She confessed in us later that a lot of parents don’t like hearing that their child might be autistic, so she wasn’t surprised that most people didn’t want to say. After this meeting, Janet referred us to the Early Years SEND panel. It was decided by them that Max had special needs. It’s important to note at this point that we didn’t have an autism diagnosis. Help is based on need, not diagnosis. Autism or not, Max needed help with his development and our local council would give us that help regardless. Janet even went as far as to say that the diagnosis is irrelevant to the SEND team, he’ll get the help he needs based on what they observe. On the 26th of August 2021 we had our first speech and language therapy (SLT) appointment. This was an introductory session, Max played with some toy cars while we spoke about his behaviours. One of the things we took away from this appointment was to put Max's toys into clear plastic containers that he would need to ask us to open for him. This helps to cement the need for communication. We still do that to this day, and we do believe it has helped. On the 13th of September 2021, our special needs practitioner got in touch with us for the first time. We’ll call her Fay. She arranged “play sessions” with Max, these happened every few weeks from 22nd of September 2021 to 23rd of September 2022. In these sessions, Fay presents Max with toys designed to test him in different ways. Some of them require him to match colours, some of them shapes, some of them are things you play with with another person, and some are toys you aren’t meant to touch at all. All of these test how he reacts to situations, how focused he is, how well he appreciates and accepts playing with others. Something those play sessions taught me is that playing with children is a skill. Fay was able to get Max to play in ways we would have said were impossible without seeing it for ourselves. She was able to get him to do things when she said so, and more crucially to not do things he obviously wanted to do. He responded to her extremely well, and it was a joy to watch her work with him. We had heard it previously from Janet, but Fay confirmed it: Max has no difficulty learning. The way he learns is different to most other kids, though, and will benefit from a more tailored approach. Looking back, it is probably around this time we started forming our opinion that Max should attend a special needs school. On the 30th of September 2021 we had Max's first hearing test. The way that they wanted to test Max's hearing was with a set of stacking cups. The doctor would demonstrate making a noise, then putting a cup onto the stack. Then she would make the noise again and add another cup. Then she would try and get the child to do the same. Max, however, wasn't at the level of understanding required to complete this test. The backup test that she had involved a shelf of toys. The shelf was a grid, like a set of IKEA Kallax shelves, and in each square was a toy that could make a noise. The doctor would trigger each toy to make a noise and the idea was for Max to look at the toy that made the noise. Unfortunately, Max was terrified of the toys and screamed uncontrollably upon seeing them. After this, the test was rescheduled for January 10th 2022. This time they tried to have him listen to a cartoon on the television while they put a sensor in his ear to take some measurements. He refused to let them put the device in his ear long enough to get any readings. They tried the first set of tests again, but he reacted the same way he did before. A third test was scheduled on February 3rd 2022. This time they wanted to try and do the test while Max was asleep, but we weren't able to get Max to sleep at the time of the appointment. However, to everyone's surprise, he went in to the doctor's office and did the stacked cup test immediately without prompting. He passed with flying colours, ruling out that hearing was causing his language difficulties. To this day we don't know what changed. On the 17th of February 2022, 6 months after the first appointment, we had our second SLT appointment. We had to chase them up to get this to happen, as we had not heard back from them. In this appointment we learned about "transition objects", objects to help children go from one activity to another. We used bubbles to help get Max to get in the bath, and a toy game controller to help get him in to the car. He still uses the toy controller today, though he has grown to enjoy bath time enough to not need the bubbles. On the same day, in the afternoon, we got a phone call telling us that there had been a short-notice cancellation with the paediatrician and would we like to do our appointment on the 20th of February? You're damn right we would! We had been waiting for this appointment since May 2021, and had heard from friends that waiting over a year was common. Some people wait more than 2 years. To get seen in less than a year is rare. Janet, our SEND coordinator, had compiled all of the paperwork from our other appointments and sent them to the paediatrician. She had also let the paediatrician know that we were receptive to a diagnosis (not all parents are). Unfortunately, for whatever reason, this documentation didn't get to the paediatrician ready for the appointment. The paediatrician had moved from another area and wasn't fully set up with her email yet. However, with what Sophie was able to find on her phone, and after observing Max for about an hour, the paediatrician told us she felt comfortable giving him a diagnosis of autism there and then, with an official letter to follow. # Wrapping up Getting an autism diagnosis before the age of 3 is uncommon, and I have to express appreciation for everyone involved in the process. While the help given to children should be based on needs alone, we have found it helpful to have a recognised diagnosis. What I plan to write about next is the process we went through to get Max in to a special needs school, a process which came to an end just a few weeks ago. We are ecstatic.

2023/3/2

articleCard.readMore

I finally figured out how to take notes!

I’ve never been good at taking notes. I’ve tried. Oh boy, have I tried. Name a piece of note taking software, odds are I’ve tried it. I’ve even tried going old school with pen and paper. Nothing sticks. Until recently. Some time ago, I learned about Apple’s Shortcuts app. It’s an app on iOS, iPadOS, and MacOS that allows you to automate actions between apps. It’s a little like IFTTT. I played with it and made a few fun things. I created a keyboard shortcut that could turn my lights on and off, for example. I didn’t take it much further than that. Since the start of the new year, I’ve been taking on more responsibility at work. This has meant an increase in meetings, and an increase in me being responsible for making sure things are moving forward. This means I often have to follow up on things after a meeting, and I would sometimes forget to do this. This would not do, I thought, and decided it was time to start taking meeting notes. I had some requirements in mind: I want to be able to tag notes. I’d like to track things like date, who was there, what the key topics were, and be able to search based on these tags. I need the ability to create action items, and be able to ask “what action items have I not yet done?” It has to be super easy. I want to be able to jump into a meeting and have my meeting notes ready to go. Turns out, combining Apple Shortcuts with Bear hits all of these requirements. # Shortcuts I have two Shortcuts I use to make my note taking life much easier: A shortcut that creates a meeting note. A shortcut that opens or creates a daily “scratch” note, for note taking outside of meetings. The meeting note shortcut does the following: Looks in my work calendar for the most recent meeting that started in the last 30 minutes. It then creates a note with the meeting title as the note title, and it adds tags for each person who accepted the calendar invite. It also adds a tag for the current date, my current location, and the current temperature outside. Just a bit of fun. I trigger this shortcut by typing cmd+ctrl+m. Any meeting I go in to, the first thing I do while I’m waiting for people to arrive is hit that shortcut, the note pops up a few seconds later, and I’m ready to take notes. The daily scratch note shortcut is much simpler. It creates a note with the current date as the title, and all of the same non-meeting-specific tags as the meeting note: date, location, temperature. The only difference is it first searches for a note with the current date as the title, and if it finds it it opens that instead of creating a new one. I trigger this shortcut with cmd+ctrl+s. After a second or two, a note that looks like this opens up on my screen: # Bear Other than being a beautiful demonstration of not implementing every single feature your user base asks for, the primary thing Bear excels at in my workflow is TODO management. At any point in any note, you can create a TODO. This manifests as a list item with a checkbox, much like GitHub’s TODOs. You can have as many TODOs as you want in a note, and Bear has a section of its navigation menu that will show you all notes with outstanding TODOs. # Conclusion I’ve been using this new system for about a week now, which is longer than I’ve been able to stick with any other note taking system. Nothing else has ever felt as natural to me as this does. The key outcome, though, is that I feel more on top of things now. I’m not dropping the ball on things people ask me to do in meetings. People don’t have to chase me for things as much, which makes me feel good and I’m sure it makes them feel good as well.

2022/2/14

articleCard.readMore

Adventures in Homelab: Part 1

If you work in tech, and you use the cloud in any way, you've probably heard of Kubernetes. It's inescapable now, and there's no shortage of takes on it. I've worked in a few companies that have used Kubernetes, but never been close to it. I've used in-house tools that communicate with it, or CI/CD systems that deploy my code in to it automatically. This has left me not really knowing what Kubernetes is or how it works. That changes now. I'm embarking on a journey to create a production-ready Kubernetes cluster in my own home. # What's in this post At the end of this post I will have shown you: How I installed Arch Linux on 3 Raspberry Pi 4Bs and got them ready to be kubelets How I bootstrapped a bare metal Kubernentes cluster on those Raspberry Pis How I set up pod networking in the cluster # Things I bought I've also wanted to slide down the /r/homelab rabbit hole for a while, so here's what I bought to get started: 3x Raspberry Pi 4 model B with power supplies and SD cards A rack mount for the Raspberry Pis A 12u 19" cabinet A 16 PoE port gigabit ethernet switch A rack mountable power strip A rack mountable shelf Some teeny weeny network cables Here's what it all looks like when put together: Also pictured here is the shelf, which is holding a UPS on the left and a NAS on the right. I had those things already, so didn't list them as part of what I bought for this project. While the UPS is optional, the NAS is quite critical to my setup. It will eventually host all of the persistent data for my cluster. More about this in the 2nd part of this series. # Preparing the Pis The first step is to get an OS running on the Raspberry Pis. While the official documentation on creating a bare metal Kubernetes cluster recommends using a deb/rpm-compatible Linux distribution, I'm a long-time fan of Arch Linux. Surely I can't be the first person to want to do this on Arch? Fortunately, I'm not. Morten Linderud, part of the Arch Linux security team, has written a great blog post on getting a bare metal Kubernetes cluster working using Arch Linux. There's only one small gotcha: he didn't do it on Raspberry Pis. Before running through the steps in his blog post, we need to get Arch running on the Pis. I followed this guide from the official Arch Linux website, which worked perfectly. I followed the ARMv7 installation guide because the disclaimer for AArch64 put me off a little. This decision hasn't hurt me so far, though I have occasionally had to look harder for docker images built for ARM and not ARM64 (thanks, Apple). I'm going to use kubeadm to bootstrap my cluster, and while kubeadm is an officially supported package in the Arch Linux repos, there's no ARM build of it. There is, however, an ARM build in the AUR. I installed yay as my preferred AUR tool. To save some time, I'll tell you I needed to install all of the following on each pi: yay -S kubeadm kubelet crictl conntrack-tools ethtool ebtables cni-plugins containerd socat A lot of them came up during the kubeadm init process. It runs a set of "preflight checks" that require you to install necessary binaries. It also checks to make sure your system has various capabilities, and one of these was missing for me: memory cgroups. I had to add the following onto the end of /boot/cmdline.txt: cgroup_enable=memory And reboot. It also warned me that the hugetlb cgroup wasn't enabled, but it was an optional dependency and I decided to ignore it. This hasn't bitten me so far. The last thing I did was set the hostname of each of the nodes. Modify /etc/hostname and name the nodes as you see fit. I used kubernetes-master, kubernetes-worker-1, and kubernetes-worker-2. I also gave them static IPs in my local network, and DNS names to make communicating with them easier. # Bootstrapping the cluster Step 1 to bootstrapping a cluster is to set up your master node. The Kubernetes project ships a tool called kubeadm (Kubernetes admin) that makes this very easy. I ran the following: kubeadm init --pod-network-cidr 10.244.0.0/16 --upload-certs The flag --pod-network-cidr is the desired subnet you want pods to live in. I chose something that's very different to my home network so I would be able to distinguish them. The flag --upload-certs I'm not really sure about. Martin Linderud uses it in his blog post, so I did as well. From reading the documentation on the flag it looks like I didn't need it, so try without if you're feeling adventurous. kubeadm init runs a set of preflight checks first. It's possible you will fail some of those checks. In that case, make sure you do some searching to figure out what's wrong and fix it before continuing. When kubeadm init finishes, you'll see output that looks like this: Your Kubernetes control-plane has initialized successfully! To start using your cluster, you need to run the following as a regular user: mkdir -p $HOME/.kube sudo cp -i /etc/kubernetes/admin.conf $HOME/.kube/config sudo chown $(id -u):$(id -g) $HOME/.kube/config You should now deploy a Pod network to the cluster. Run "kubectl apply -f [podnetwork].yaml" with one of the options listed at: /docs/concepts/cluster-administration/addons/ You can now join any number of machines by running the following on each node as root: kubeadm join <control-plane-host>:<control-plane-port> --token <token> --discovery-token-ca-cert-hash sha256:<hash> Save that kubeadm join command, you'll need it in a few minutes. It was at this point that I also copied the ~/.kube/config file to my main development machine and closed the SSH connection to my master node. # Pod networking Pod networking has come up a couple of times now, but what is it? A Kubernetes cluster consists of 0-n nodes. A node is a physical machine running the kubelet daemon configured to be a part of your cluster. On a node, 0-n pods can be running. A pod is a collection of 1-n containers that share a local network. They're called pods as a reference to a pod of dolphins, according to the book Kubernetes: Up and Running. Because the networks that Kubernetes clusters are deployed in are extremely varied (from cloud providers to datacenters to home networks), and needs will differ dramatically, Kubernetes doesn't ship clusters with a way for pods to communicate with other pods by default. You need to select a third-party solution that fits your needs. Deciding what pod networking solution is best for you is outside of the scope of this article, I'll just say that I went with flannel. It sounded simple and just sorts out networking between pods without any extra fancy features. Its limitations, primarily that nodes must be on the same physical network to each other, was not a concern for me. Normally, you would install flannel like this: kubectl apply -f https://raw.githubusercontent.com/coreos/flannel/master/Documentation/kube-flannel.yml But I found that this didn't work for me. There were two reasons: I was missing the cni-plugins package The default backend flannel uses, vxlan, didn't work for some reason While 1 took some time to figure out, largely by doing lots of kubectl describe pod commands, it was a simple fix once I saw the error message. 2, however, was tricky. Pod to pod communication by pure pod IP address worked fine, but any communication through a cluster IP address hanged indefinitely. After a lot of searching, I found someone suggesting to switch away from flannel's default vxlan backend, to the host-gw backend. What does all of this mean? Fuck if I know. All I know is that it fixed the problem I was having. If you download the flannel manifest from the command above and find the ConfigMap called kube-flannel-cfg, modify the bit called net-conf.json so that it looks like this: { "Network": "10.244.0.0/16", "Backend": { "Type": "host-gw" } } Last but not least, I found that I had to restart my master node after all of these changes. It takes a minute or two to boot back up, but when it did I was greeted with this: $ kubectl get nodes NAME STATUS ROLES AGE VERSION kubernetes-master.local Ready control-plane,master 14m v1.21.0 # Adding the worker nodes Remember that kubeadm join command I said to save for later? Now is later. Adding nodes to your cluster is as simple as running that join command on each node. One bit of weirdness I experienced is that after newly joining a node to the cluster, it would get stuck in the NotReady state. This resolved itself after rebooting each node. Not sure what that's all about, I'm assuming network voodoo with flannel. $ kubectl get nodes NAME STATUS ROLES AGE VERSION kubernetes-master.local Ready control-plane,master 20m v1.21.0 kubernetes-worker-1.local Ready <none> 2m v1.21.0 kubernetes-worker-2.local Ready <none> 1m v1.21.0 # Conclusion So now we have a working bare metal Kubernetes cluster, we're ready to start running things on it. We still have a long way to go until our cluster can run any kind of workload we want. We'll need to handle load balancing, persistent storage, and ingress resources. All of that is going to be in part 2. We have an even longer way to go until we could call this a production-ready cluster. The main thing missing for that is that we'd need to run 3 master nodes, and this is something I want to explore in a future post.

2021/5/2

articleCard.readMore

Simple Complex Easy Hard

You might have noticed the last time you were doing chores or tackling a tricky problem at work, that when something is hard it's not always hard in the same way. The hard you experience when doing chores, that mindnumbing , I-can't-be-bothered hard, is different to the hard you might experience when debugging an elusive bug in a distributed system. Why? # The 2 axes of difficulty There are many things that determine whether a task is difficult or not, but you can make a start on getting more granular by splitting difficulty into two axes: simple-complex and easy-hard. What's the difference? A sudoku puzzle is complex. It depends on your skill at sudoku puzzles, and you're able to do more complex sudoku puzzles the more you practice and hone this skill. A task is complex when the number of people that could do it tends toward zero. If a lot of people could do it, that would make it simple. The other scale measures how much effort must be expended to complete the task. If you're put off by how much effort will be involved, it's likely hard. Something is hard when the number of people that are willing to put the effort in tends toward zero. # Why does this matter? This distinction can help you with time estimations. Complexity is one of the key things that introduces uncertainty in to estimating how long a task will take. It's not unusual to get half way through a sudoku puzzle and realise you've made a mistake somewhere, and need to backtrack. On the other hand, you know quite accurately how long it's going to take you to mow the lawn. Adding this vocabulary to your work tasks can help people to understand what to expect. A simple-easy task is likely to be predictably quick. A simple-hard task predictably long. A complex-hard task is anyone's guess, and may even be worth breaking in to several smaller tasks. # Other examples I'm sure you might disagree with some of these, I didn't spend an enormous amount of time thinking about them. If you have other examples, I'd love to hear about them.

2021/4/18

articleCard.readMore