Garbage collection is something that a self-taught programmer like me tended to ignore, since, at least in the beginning of our careers, we don’t really have to deal with it. But just because you don’t have to deal with it doesn’t mean it’s not important.
Going further with your career you may notice that you are missing key knowledge on doing it the right way. Garbage collection is important — in trading, airplane or autonomous cars control or medical equipment control types of software where wrong choices can have a huge impact (airplanes crash, car goes the wrong way, and patients get misdiagnosed or receive improper treatment).
It might very possibly be that at some point you’ll eventually have to deal with garbage collectors (GCs) and to do that you’ll need to have some knowledge about how they work.
In this article we are going to focus on automatic garbage collection rather than manual. Specifically the one in the JVM.
So let’s start our journey into the GCs’ world where everything’s nice and clean. Well, at least to the point where your parents don’t yell at you anymore because it’s all a mess…
Java Object Lifecycle
Let’s start from the very beginning — an object.
The image above represents simplified lifecycle that an object in Java and most programming languages goes through (actually the whole lifecycle can be divided into 7 stages, but for the sake of simplicity we will discuss only the three main steps).
Creation — in Java, an object is created using the new keyword. At first, JVM needs to allocate some memory for the object. Once it’s done, the reference to that object is created in order to keep track of the object— in our case it’s `obj`.
In use — that’s the life of the object. As long as object has some references to it, we’ll refer to it as being in use.
Destruction —the destruction of an object happens when it’s no longer referenced by any other object (e.g. if we do `obj=null` or object goes out of scope). Once this happens, finalize() method is called for the object and object is garbage collected.
In Java the first two stages (creation and in use) are controlled by the programmer, while destruction is the responsibility of the GC. Although there are some methods that can be called manually, like `System.gc()` or overriding finalize (deprecated starting JDK 9), you should never trust them.
OK, so now that we know when object is eligible for garbage collection, the question should be why? Why not let the programmers decide when to free memory? After all they are the ones who should know better.
Not really.
We are people, we are prone to making mistakes and being inattentive (well at least me). But that’s not just that. Main reasons why some programming languages have GCs built-in include:
Focusing on higher level problems — instead of worrying about memory, developers can focus more on the program itself, logic, code structure, reusability etc.
Facilitating memory safety — pointer arithmetic can cause some serious bugs or security violations (e.g. possibility to access memory you’re not supposed to), but in Java and some other garbage-collector languages notion of pointers is abstracted and there are no raw-pointers and pointer arithmetic.
Helping prevent memory leaks — memory leaks happen when objects are no longer accessible by the program, but still reside in the heap and GC has no way to deallocate the memory they are using. The word helping is highlighted, since it can still happen.
Dangling pointer bugs—this means having a pointer to the deallocated memory chunk. Since GC destroys only the objects that have no references to it (i.e. no “pointer” exists), there’s no way for the Java reference to be a dangling point (except when using WeakReferences and SoftReferences).
On the other hand, there are some trade-offs for all this awesomeness, the main ones being:
It pauses the application — while these pauses may seem really small (hundreds of milliseconds) they might be noticeable or event fatal (previously mentioned applications in airplane and autonomous cars control or medical equipment control types of software).
Most of the time it’s slower than explicit memory management. Well, this one is tricky and I am a bit afraid to go any deeper into this rabbit hole, but in my research on the speed of explicit memory management vs using GCs, I found only one paper proving the opposite (but please correct me if I’m wrong), and only in a particular case: Comparing runtime, space consumption, and virtual memory footprints over a range of benchmarks, we show that the runtime performance of the best-performing garbage collector is competitive with explicit memory management when given enough memory. In particular, when garbage collection has five times as much memory as required, its runtime performance matches or slightly exceeds that of explicit memory management.
But of course it all depends on the implementation, program, type of GC and lots of other things, so again — this one is tricky and you might choose not to trust me on this one.
Nevertheless most of the time we are good with these cons, as long as we are safer and there’s less work to do.
OK, so now, since we know when and why we need GC, we can start getting into how.
Strategies for identifying garbage
There’s no ultimate GC. Actually, there are plenty of them, each with different implementation, pros and cons. When you start reading about them, it seems like they are all separate trees in a wild forest, but if you look a bit closer you may realized that you are standing in front of the nicely planted garden.
In order to see this structure we have to look at the main GC strategies used for identifying garbage.
Tracing — this is the most popular strategy used in GCs .In fact it’s so common that “garbage collection” sometimes refers to “tracing garbage collection”. It works by tracing each object from the so called GC roots by its references. GC roots include thread stacks, static variables, special references from JNI (Java Native Interface) code and other areas where live objects are likely to be found. Objects that are traced (so have something referencing them) are considered reachable, while others become eligible for garbage collection.
Reference counting — here every object has a count of something referencing it. Object becomes garbage when reference count is 0.
GC algorithms
Once we decide on how to identify which objects can be garbage-collected (most probably using the tracing strategy), we need to decide on how to do it.
There are three main algorithms used for garbage collection:
Mark-sweep — marking is basically deciding which objects are no longer needed and which are still in use. Work done in this step is linear to the amount of live objects in the heap. After all the objects are marked, sweep part comes in, “sweeping” all unreferenced objects, and thus deallocating space for future usage. Work done in the sweeping step is proportional to the heap size.
Mark-sweep-compact — the issue with mark-sweep is that initially we can reach a point where there’s enough available space in the heap for a new object, but it is useless, because it’s fragmented (i.e. we might have 200KiB of free space, but not be able to fit an 8KiB object anywhere, because all we have is 200 free slots, each at 1KiB). Another issue is that writing to the memory becomes more time-consuming as finding available slots of required size is no longer trivial. To avoid this “Swiss cheese” problem there’s one additional step after sweeping — we compact the objects. Work done in the marking and compacting steps is proportional to the size of live data, while in the sweeping — to the heap size.
Mark-copy — this algorithm also deals with a fragmentation problem, but in a different manner. It copies all the live objects into a new memory area. So the result of sweeping and compacting is achieved in one step. The drawback of this method is that GC requires space twice the size of the maximum live data for a program. Work done in a copying step is linear to the size of the live set and copying almost always (few exceptions will be introduced later) stops-the-world (more on it later, but it basically means that your application will stop working while this phase of garbage collection is running).
As you can see, all these algorithms have a common phase - marking. Starting from GC roots, GC tracks all of the objects and “constructs” an object graph with all of the references between them.
So far so good, right? Well, almost. In order to run any of these algorithms, at some point GC must stop-the-world and that’s not so great, since you probably wouldn’t like your game to stop for a while when player is in the middle of the final battle. And the bigger this pause time is, the worse.
One of the things that might help here is Generational Hypothesis.
Generations
To enhance the garbage collection process, JVM divides the heap memory into two generations: Young Generation and Old Generation (also called Tenured). There used to also be a Permanent Generation, which is now changed to MetaSpace, but we won’t cover that.
This division of heap into generations was done based on an observation that you either die as an Object part of the Young generation, or live long enough to see yourself end up in the Old generation.
Young generation is further divided into:
Eden space — every new object, once created, goes into this area (allusion to the Garden of Eden).
Survivor space 0 and 1—these reside next to the Eden space and are much smaller (in most of the default configurations of GCs the ratio between the size of Eden and each Survivor space is 8:1). It is important to notice that one of the two Survivor spaces is always empty. The reason for having two survivor spaces is to avoid memory fragmentation (sometimes called a “Swiss cheese” problem). If there were only one Survivor space, then once GC would run through it, it would leave some memory holes there and thus compaction would be needed.
Garbage collection is triggered once the Eden space is full (objects are marked using the tracing strategy) and live objects are moved into one of the survivor spaces (S0 or S1) — this garbage collection is called minor garbage-collection and it always stops-the-world. Moving live objects to the other space indicates that the algorithm used in Young Generation garbage collection is mark-copy.
After the first time GC is triggered (i.e. the first time Eden space gets full) one of the survivor spaces is populated with compacted live objects, while Eden and the other survivor spaces are left empty.
With the application running, at some point Eden space gets full again and GC is triggered again. But this time all live objects from both the Eden space and the previously occupied Survivor space are moved into the other Survivor space (remember — one of the two Survivor spaces is always empty).
Each time GC is triggered in Young Generation (so-called minor GC) and live objects are moved, JVM keeps track of how many garbage collection cycles every object “survived”. When this count reaches some predefined number (the default is 15 for most JVMs but can be changed using `XX:+MaxTenuringThreshold` flag), an object is “promoted” into Old Generation (promotion can also happen when Survivor space is no longer big enough to hold all the live objects inside it).
Combinations of different strategies and algorithms covered above, together with some other tweaks and principles, is what allows us to have lots of different garbage collectors, each with it’s own pros and cons. But let’s talk about it some other time.
My point here, in the first part covering JVM GGs, is to make it clear that garbage collection matters — although no one might be talking about it, everyone’s still doing it because they have to. And also provide you, and myself, with some ground knowledge and concepts in order to be able to analyze them, think about, and maybe one day, with some additional reading, tune them or be able to decide which to use for your project.
Stay tuned for my next part…
This post was written by Brygida Ivona Pliška
Photo by Paweł Czerwiński on Unsplash
For more engineering updates and insights:
Visit us on GitHub
Subscribe to our YouTube channel