T

tschatzl @ github | Thomas Schatzl

Thomas Schatzl, Oracle Java Hotspot GC 团队成员。在 Java GC 方面有丰富经验。

JDK 25 G1/Parallel/Serial GC changes

JDK 25 RC 1 is out, and as is custom by now, here is a brief overview of the changes to the stop-the-world collectors in OpenJDK for that release. The full list of changes for the entire Hotspot GC subcomponent for JDK 25 is here, listing around 200 changes in total that were resolved or closed. This matches recent trends. Let’s start with the interesting change breakdown for JDK 25: Parallel GC The most important change for Parallel (and Serial) GC has probably been JDK-8192647. If your application used JNI extensively, you might have encountered premature VM shutdowns after Retried Waiting for GCLocker Too Often Allocating XXX words warnings. These can not occur any more with either collector beginning with this release. The fix makes sure that a thread that is blocked to perform a garbage collection due to JNI is guaranteed to have triggered its own garbage collection before giving up. Serial GC Fixing the same GCLocker issue as described for Parallel GC. Serial GC fixes an issue to avoid full GCs in some circumstances in JDK-8346920. G1 GC JDK-8343782 enables G1 to merge any old generation region’s remembered set with others for additional memory savings compared to last release. Some graphs can be seen in the PR for this change. The following graph taken from there show a large reduction in remembered set memory usage for that particular, heavy remembered set using, benchmark. The red line shows remembered set native memory usage for the mentioned benchmark on a 64 GB Java heap, while blue the same for after this patch. Native memory consumption for that part of the VM drops from 2 GB peak to around 0.75 GB peak. G1 now uses an additional measure to avoid pause time spikes at the end of a old generation Space-Reclamation phase. They were caused by suboptimal selection of candidate regions due to lack of information. The current process is roughly like this: in the Remark pause, after whole-heap liveness analysis, G1 determines a set of candidate regions that it is going to collect later. It tries to choose the most efficient regions, i.e. where the gain in memory is large, and the effort to reclaim is low. Then, for these candidate regions, from the Remark to the Cleanup pause, G1 recalculates the remembered sets, the data structure containing all references to the live objects in these regions. Remembered sets are required to evacuate objects in the following garbage collection pauses because they contain the locations of references to the evacuation objects. These locations must, after moving, point to the new object locations after all. Determining the candidate regions is where the cause of the high pause times lies: evacuation efficiency is a weighted sum of amount of live data in the region and the size of the remembered set for that region; G1 prefers to evacuate regions with low liveness and small remembered sets as they are fastest to evacuate. However, at the time the candidate regions are selected, G1 does not have information about the remembered sets, actually G1 plans to recreate this information just after this selection. Until now, G1 solely relied on liveness, assuming that the remembered set size is proportional to the amount of live data in a region. Unfortunately, this is not true in many cases. In the majority of cases this does not matter: evacuation is fast enough anyway and the pause time goal generous enough, but you may have noticed that later mixed collections are typically slower than earlier ones. In the worst case, this behavior causes large pause time spikes typically in the last mixed collection of the Space-Reclamation phase as only the most expensive-to-evacuate regions are left to evacuate. One option for this problem is to just skip evacuating these regions: but the problem is, the initial candidate selection was based on freeing enough space during the Space-Reclamation phase to continue without a disrupting full heap compaction. So continuously leaving them out from Space-Reclamation can lead to full heap compaction. The change in JDK-8351405 instead improves the selection mechanism: we do not have the exact metric for the remembered set size at that point, but is there something cheap and fast to calculate that could be a substitute? Actually, yes, there is: during marking, for every region G1 now counts the number of incoming references (just counting, not storing them somewhere) it comes across. This is actually a very good substitute as the remembered set stores these locations, just in a more coarse grained way. Additionally the amount of work to be done in the garbage collection pause is proportional to the number of references after all. G1 uses this much more accurate efficiency information during candidate selection in the Remark pause. This avoids these pause time spikes. The following figure from the PR shows the impact of this change in pause time: It shows mixed gc pause times for JDK 25 (brown), JDK 24 (purple), and JDK 25 with the change (green) with a pause time goal of 200ms - the default. The former two graphs show that the last mixed gc in the Space-Reclamation phase is always very long (that the spike is the last can be deduced by the large amount of time between these mixed garbage collections - there are some young-only garbage collection pauses not shown inbetween). As one can see, this technique completely eliminates these spikes taking up to 400ms. The write-barrier improvements explained in this post originally scheduled for JDK 25 did not make it in time due to process issues. Hopefully it can make it for JDK 26. There were some (here and here) fixes to avoid some superfluous garbage collection pauses in some edge cases. Another change improves performance of certain garbage collections. Update 2025-09-03: There is a regression where G1 fails to obtain large pages with Transparent Huge Pages (THP) with G1 we noticed too late for the JDK 25 release. This may cause performance regressions in all applications that use THP with the madvise setting. This will be fixed in one of the next update releases. What’s next Currently a lot of effort in the G1 area is spent on implementing Automatic Heap Sizing for G1: an effort where G1 automatically determines a maximum Java heap size that is efficient and does not blow the available memory budget by tracking free memory in the environment, and adjusting heap size accordingly. A few changes necessary for automatic heap sizing (like above fixes that avoid superfluous garbage collection pauses, but also here or there) already landed in JDK 26, and a few stepping stones are out for review (e.g. this and that). Another JEP we would like to complete in this release time frame is to make ergonomics always select G1 as default collector to avoid any more confusion about this particular topic :) Thanks go to… … as usual to everyone that contributed to another great JDK release. Looking forward to see you next (even greater) release. Thomas

2025/8/12
articleCard.readMore

JDK 24 G1/Parallel/Serial GC changes

JDK 24 has been released a few weeks ago. This post provides a brief overview of the changes to the stop-the-world collectors in OpenJDK in that release. Originally I thought there was not much to talk about, hence the delay, but in hindsight I have been wrong. Similar to the previous release JDK 24 is a fairly muted one in the GC area, but there are good things on the horizon particularly for JDK 25 that I will touch on at the end of this post :) The full list of changes for the entire Hotspot GC subcomponent for JDK 24 is here, showing around 190 changes in total having been resolved or closed. Nothing particular unusual here. Besides, there has been a JavaOne conference again. Parallel GC Some unnecessary synchronization in the (typically very hot) evacuation loop has been removed with JDK-8269870. This may reduce pause times in some situations. Serial GC Cleanup and refactoring of Serial GC code continued. G1 GC G1 uses predictions of various aspects of garbage collection that are dependent on both the application and the environment the VM to meet pause time goals. Examples for these predictions are how much does memory copying actually cost, how many references typically need to be updated. G1 retrains related predictors every time the VM is run. At startup, since there are no samples for the predictors yet, G1 uses some pre-baked values for these predictions. Ideally, these would be updated and maintained regularly, and appropriate for all environments. As you can imagine, this is an impossible task: these pre-baked values were actually determined a long time ago on some “large” SPARC machine, and then… forgotten. The inadequacy of these values has actual drawbacks: nowadays almost any machine the JVM runs on is faster than that reference machine, the values are very conservative, meaning that e.g. expected costs are much higher than actual so G1 GC pauses tend to be much shorter than necessary until newly measured values eventually effectively overwrite these pre-baked values. In some measurements, it takes around 30 or so meaningful garbage collections until G1 is primed for the current environment. The change in JDK-8343189 modifies the way new, actual values are used in the predictors: instead of adding to the existing pre-baked predictor values, the first actual values directly overwrite these pre-baked predictor values. The effect is that apart from the first garbage collection, actual samples for these predictions are used. This speeds up adaptation of G1 to the application and environment tremendously. There is one small snag here: The original, very conservative estimates did have the advantage that you were almost guaranteed that G1 does not exceed the pause time goal in the first few garbage collections. They dampened the predictor outputs quite a bit. So with the new system, there may be more initial overshoots in pause times in the first few garbage collections. Additionally we have been working on decreasing native memory usage, targeting the remembered sets once again. As written in the respective section of the garbage collection tuning guide, G1 now manages remembered sets for the young generation as a whole, single unit, removing duplicates and so saving memory. Less remembered set entries for the young generation also decreases garbage collection pauses (very) slightly. The JDK 23 blog post describes the idea with some additional graphs at the end. What’s next Last time this section contained a lot of information about how the idea of using single remembered sets for multiple regions applies to any set of regions and its effects. Extending merging young generation remembered sets, JDK-8343782, which has already been integrated in JDK 25, implements this idea for old generation regions for even larger native memory savings. I.e. with that change, merges sets of old generation regions by default. Scheduled for JDK 25 there is probably one of the largest changes to the interaction between the application and the G1 garbage collector ever: the write-barriers, small code that is executed for every reference write in an application in the VM to synchronize with the garbage collector, which have a large impact on application throughput, especially so in G1 were completely overhauled. By reducing this write barrier significantly, large improvements are possible. We are in the process of preparing a JEP to get this change integrated in the next months. For more information please also look at the implementation CR, the respective PR and the extensive blog post about this topic on this blog. Overall, one can expect free throughput improvements up to 10%, shorter pauses, better code generation (and code size savings) from this change. There is some minimal regression in native memory usage, but we think it is tolerable, let me explain with the following graph: The graph above shows native memory usage of the G1 collector for the BigRAMTester benchmark, on a 20GB Java heap, for recent’ish JDKs (Fwiw, JDK8 uses 8GB native memory). The drop of the base memory usage at the start from JDK 17 to JDK 21 from 1.2GB to around 800MB is because of the change described here. The JDK 24 line (orange) shows the impact of the mentioned use of a single remembered set for the entire young generation (and JDK-8225409). Finally, JDK 25 (shown as dashed red “latest” line) adds JDK-8343782, remembered set merging for the old generation, and the changes in the write barrier that enabled further memory reductions. So where is the regression - that line for JDK 25 certainly looks best? The truth is, these changes need one more, larger static memory block than before (of fixed ~0.2% Java heap size). Its impact is shown at the very beginning of the graph. While in this benchmark, the savings from the remembered sets outweigh the addition of this new data structure, in general this is not the case. While this milestone is complete now and apart from bringing this change along the JEP process, to ultimately integrate it into the release, more work in this area is ongoing as we were somewhat conservative in our changes. There is reason to believe that throughput will improve even more in the future. Other than that, there are some potentially interesting topics currently being discussed on the hotspot-gc-dev mailing list: Longer term, there is a project underway to bring Automatic Heap Sizing similar to ZGC to the stop-the-world collectors too. First contributions are coming, thanks for Microsoft and Google contributing to this effort! There is some public discussion on making G1 the “true” default collector: right now VM ergonomics may select the Serial garbage collector in environments with small heaps and little CPU resources instead of G1 if you do not specify any garbage collector on the command line. More and more, measurements show that G1 is very close to or supersedes the Serial collector in most or all metrics when run out-of-box, i.e. without additional tuning. If you happen to have some interesting data points for the latter, feel free to chime in. Thanks go to… … as usual to everyone that contributed to another great JDK release. Looking forward to see you next (even greater) release. Thomas

2025/4/1
articleCard.readMore

New Write Barriers for G1

The Garbage First (G1) collector’s throughput sometimes trails that of other HotSpot VM collectors - by up to 20% (e.g. JDK-8253230 or JDK-8132937). The difference is caused by G1’s principle to be a garbage collector that balances latency and throughput and tries to meet a pause time goal. A large part of that can be attributed to the synchronization of the garbage collector with the application necessary for correct operation. With JDK-8340827 we substantially redesign how this synchronization works for much less impact on throughput. This post explains these fairly fundamental changes. Other sources with more information are the corresponding draft JEP and the implementation PR. Background G1 is an incremental garbage collector. During garbage collection, a large part of time can be spent on trying to find references to live objects in the areas of the application heap (just heap in the following) to evacuate them. The most simple and probably slowest way would be to look through (scan) the entire heap that is not going to be evacuated for such references. The stop-the-world Hotspot collectors, Serial, Parallel and G1 in principle employ a fairly old technique called Card Marking [Hölzle93] to limit the area to scan for references during the garbage collection pause. In an attempt to better keep pause time goals, G1 extended this card marking mechanism: concurrent to the application, G1 re-examines (refines) the cards marked by the application and classifies them. This classification helps in the garbage collection pause to only need to scan cards important for a particular garbage collection. extra code compiled into the application (write barriers) remove unnecessary card marks, reducing the amount of cards to scan further. This comes at additional cost as the next sections will show. Card Marking in G1 Card Marking divides the heap into fixed-size chunks called cards. Every card is represented as a byte in a separate fixed-size array called a card table. Each entry corresponds to a small area, typically 512 bytes, in the heap. A card may either be marked or unmarked, a mark indicating the potential presence of interesting references in the heap corresponding to the card. An interesting reference for garbage collection is a reference that refers from an area in the heap that is not going to be garbage collected to one that is. When the application modifies an object reference, additional code compiled into application, the write barrier, intercepts the modification and marks the corresponding card in the card table. Figure 1 above shows an example execution of a hypothetical assignment of the field a in an object x of type X with a value of y. After writing the value into the field, the write barrier code (to be exact, post write barrier code, i.e. code added after setting the value) marks the card. This is where the Serial and Parallel garbage collectors stop at: they let the application accumulate card marks until garbage collection occurs. At that time all of the heap corresponding to the marked cards is scanned for references into the evacuated area. In most applications this is okay effort-wise: the number of unique cards that need to be scanned during garbage collection is very limited. However, in other applications, scanning the heap corresponding to cards (scanning the cards) can take a very significant amount of total garbage collection time. G1 tries to reduce this amount of card scanning in the garbage collection pause by several means. The first is using extra garbage collection threads running concurrent to the application clearing, re-examining and classifying card marks because: references are often written over and over again between garbage collections. A card mark caused by a reference write may, by the time the next garbage collection occurs, not contain any interesting reference anymore. by classifying card marks according to where they originate from, it is possible to only scan marked cards that are relevant for this particular garbage collection during the garbage collection. Figure 2, 3 and 4 give details about this re-examination (refinement) process. First, Figure 2 shows that in addition to the actual card mark mentioned above, the write barrier stores (enqueues) the card location (in this case 0xabc) in an internal buffer (refinement buffer) shown in green so that the re-examination garbage collector threads (refinement threads) can later easily find them again. There is an artificial delay based on available time in the pause for scanning cards and the rate the application generates new marked cards between card mark and refinement. This delay helps decreasing the overhead of an application repeatedly marking the same cards, avoiding that the same cards will be repeatedly enqueued for refinement (as they are already marked). The delay also increases the probability that the references in the card itself are not interesting anymore. In the next step, shown in Figure 3, refinement threads will pick up previously enqueued cards for re-examination. In this example, the card at location 0xdef will be refined. The figure also shows the remembered sets in light blue: for every area to evacuate, G1 stores the set of interesting card locations for this area. In this figure every area has such a remembered set attached to it, but there may not be one currently for some. Areas may also be discontiguous in the heap (JDK-8343782 and JDK-8336086). The refinement thread also unmarks the card before looking at the corresponding contents of the heap. Finally, Figure 4 shows the step where the refinement threads put the examined card (0xdef) into the remembered sets of the areas for which that card contains an interesting reference at this point in time. Since the heap covered by a card may contain multiple interesting references, multiple remembered sets may receive that particular card location. The end result of this fairly complicated process is that, compared to the other throughput collectors in the Hotspot VM, in the garbage collection pause G1 only needs to scan cards from two sources: cards just recently marked dirty by the application and not yet refined. cards from the remembered sets of areas that are about to be collected. G1 will merge these two sources, marking cards from the remembered sets on the card table, before scanning the card table for card marks [JDK-8213108]. Depending on the application this can result in a significant reduction of time spent in card scanning during garbage collection compared to regular card marking. Write Barrier in G1 Barriers are small pieces of code that are executed to coordinate between the application and the VM. Garbage collection extensively uses them to intercept changes in memory. The Serial, Parallel and G1 garbage collectors use, write barriers on writes to references, i.e. every time the application writes a value into a reference. The VM executes additional code the vicinity of that write. The figure above visualizes this: for a write of the value y into a field x.a, next to the actual write, code unrelated to the write used for synchronization with the garbage collector is executed; Serial and Parallel GC only have a post-write barrier (after the write), while G1 has both pre- and post-write barriers (before and after the write). G1 uses the pre-write barrier for unrelated matters to this discussion, so the following text will just use “write barrier” or just “barrier”. The previous section hinted at the responsibilities of the write barrier in G1: mark the card as dirty store the card location in refinement buffers if the card has not been marked yet This sounds straightforward, but unfortunately there is some complication that Figure 5 shows: both the application and the refinement threads might write to the same card at the same time. The former marking it, the latter clearing it. This might result in lost updates to the remembered sets where the refinement thread would observe the card mark but not observe the write of the value without additional precautions. In fact, this requires some fairly costly memory synchronization in the write barrier. Additionally, concurrent refinement of cards can be expensive, particularly if there are no extra processing resources available. So, the G1 barrier contains extra filtering code to avoid card marks that are generated by reference writes that make no difference for garbage collection. The G1 write barrier will not mark a card if the reference assignment writes a reference that is not interesting, not crossing areas. the code assigns a null value: these do not generate links between objects, so the corresponding marked cards are unnecessary. the card is already marked, which means that it is already scheduled for refinement. Figure 6 compares the sizes of the resulting write barriers, the directly inlined part of the G1 write barrier (there is an additional part not shown here which is executed somewhat rarely) on the left with the whole Serial and Parallel GC write barrier on the right ([Protopopovs23]). Without going into detail, instead of three (x86-64) instructions, the G1 write barrier takes around 50 instructions. The bars to the left of the G1 write barrier roughly correspond to above tasks: blue for the filters, green for the memory synchronization and the actual card mark, and orange for storing the card in the refinement buffers for the refinement threads Impact G1 uses a large write barrier with many mechanisms to minimize performance impact of the memory synchronization necessary for correctness and overhead of concurrent work. All the effort to reduce memory synchronization is wasted if the application’s memory access pattern does not fit them and actually performs lots of fairly random reference assignments across the whole heap which to a large part result in enqueuing of many card locations for later refinement like BigRAMTester [JDK-8152438]. The other poor fit for this write barrier are applications that execute the write barrier in tight loops and at the same time extremely rarely generate a card mark with interesting references so that no concurrent refinement is necessary or actually ever performed (e.g. JDK-8253230). These are hampered by the large code footprint of the barrier, inhibiting compiler optimizations, or just execute slowly due to unnecessary allocation of CPU resources. A New Approach Previous sections showed that a large part of the G1 write barrier is required by the per-card mark memory synchronization to guarantee correctness in case of concurrent writes to the card table and storing the card locations for subsequent refinement. To remove the memory synchronization, similar to ZGC’s double-buffered remembered sets [JEP439], this new approach uses two card tables. Each set of threads, the application threads and the refinement threads, gets assigned their own card table, each initially completely unmarked. Each set of threads only ever writes different values to their card table (just “card table” and “refinement table*” respectively in the following), obviating the need for fine-grained memory synchronization during actual card mark. Similar to before there is a heuristic that tracks card mark rate on the application card table, and if the heuristic predicts that there will be too many card marks on that card table to meet pause time goal in the next garbage collection, G1 swaps the card tables atomically to the application. The application then continues to mark its card table (previously the refinement table), while garbage collection refinement threads work to re-examine all marks from the refinement table (the previous application card table). This removes the need for memory synchronization in the G1 write barrier. Additionally, the refinement table is directly used as storage for the to-be-refined marked cards, given that finding the marked cards is not prohibitively expensive. This completely removes the need for the write barrier code to store the card locations in refinement buffers, and the memory synchronization. New Card Marking in G1 Figure 7 shows this new arrangement of data structures: there are now two card tables, and the write barrier as before marks the card on the (application) card table. When the card table accrued enough marks, G1 switches card tables atomically. Figure 8 shows an example where the application card table and the refinement table were just switched, and the application already continued marking cards on the former refinement table. Refinement threads start re-examining cards from the refinement table (previously application card table) as before as described. Card marks are cleared until all marked cards have been processed. Since the application threads continue to mark cards on their card table, no synchronization between application threads and refinement threads is required. New Write Barrier in G1 At minimum, the new G1 write barrier requires the card mark, similar to Parallel GC. The difference is that unlike in the Parallel GC write barrier, the card table base address is not constant as it changes every card table switch. So Parallel GC can inline the card table address into the code stream, G1 needs to reload it every time from thread local storage. The final current post write barrier for G1 reduces to the filters and the actual card mark. For a given assignment x.a = y, the VM now adds the following pseudo-code after the assignment: (1) if (region(x.a) == region(y)) goto done; // Ignore references within the same region/area (2) if (y == nullptr) goto done; // Ignore null writes (3) if (card(x.a) != Clean) goto done; // Ignore if the card is non-clean (4) (5) *card(x.a) = Dirty; // Mark the card (6) done: Line (1) to (3) implement the filters. They are almost the same as before, with a slightly different condition for the last filter due to an optimization. Without the filters, there were some regressions compared to the original write barrier with the filters; the filters also decrease the number of cards that have not been scanned during the garbage collection pause, and the number of cards to be re-examined, so they were kept for now. Finally, line (5) actually marks the card with a “Dirty” color value. The original card marking paper uses two different values for card table entries, i.e. colors: “Marked” and “Unmarked”, or typically named “Dirty” and “Clean” in the Hotspot VM. In total G1 uses five colors which store some additional information about the corresponding heap area: clean - the card does not contain an interesting reference. dirty - the card may contain an interesting reference. already-scanned - used during garbage collection to indicate that this card has already been scanned. Necessary for the potentially staged approach of garbage collection to better keep pause time. to-collection-set - the card may contain an interesting reference to the areas of the heap that are going to be collected in the next garbage collection (the collection set, hence the name). This collection set always contains the young generation because it is always collected at every garbage collection. Refinement can skip scanning these cards because they will always be scanned later during garbage collection as G1 always collects the young generation. Adding these cards to remembered sets, even if they contained references to regions not in the collection set, is not needed either, they would actually represent duplicate information. Effectively, the set of to-collection-set cards also store the entire remembered set for the next collection set on the card table, avoiding extra memory. To keep the write barrier simple, it only colors cards as “Dirty”. The additional overhead determining whether a reference refers to an object in the collection set is too expensive here. from-remset - used during garbage collection to indicate that the origin of this card is a remembered set and not a recently marked card. This helps distinguishing cards from remembered sets from cards from not yet examined cards to more accurately model the application’s card marking rate used in heuristics. Only the last two card colors are new. The use of the to-collection-set color explains the condition used in line (3) of the write barrier above: it avoids the application overwriting this value unnecessarily excluding benign races. Switching the Card Tables and Refinement The goal of the card table switching process is to make sure that all threads in the system agree on that the refinement table is now the application card table and the other way around to avoid the problematic situation described earlier in Figure 5. The process is initiated by a special background thread, the refinement control thread. It regularly estimates whether the currently estimated number of cards at the start of the next garbage collection would exceed the allowed number of not re-examined cards given card examination rate. If refinement work is necessary, it also calculates the number of refinement worker threads, which do the actual work, needed to complete before the garbage collection. This refinement round consists of the following phases: Swap references to the card tables. This includes global card table pointers that newly created VM threads and VM runtime calls use, and every internal thread’s local copy of the current reference to the application card table. This step uses thread local handshakes (JEP 312) and a similar technique for VM internal, garbage collector related, threads. This ensures correct memory visibility: After this step, the entire VM uses the previous refinement table to mark new cards, and all card marks and reference writes by the application at that point are synchronized. Snapshot the heap to gather internal data about the refinement work. For every region of the heap the snapshot stores current examination progress. This allows resuming the refinement at any time and facilitates parallel work. Re-examine (sweep) the refinement table containing the marked cards. Refinement worker threads walk the card table linearly (hence sweeping), claiming parts of it to look for marked cards using the heap snapshot. As Figure 9 shows, the refinement threads re-examine marked cards, with the following potential results: Cards with references to the collection set are not added to any remembered set. The refinement thread marks these cards as to-collection-set in the application card table and skips any further refinement. If the card is marked as to-collection-set, the refinement thread will re-mark this card on the card table as such and not examine the corresponding heap contents at all. Dirty cards which correspond to the heap that can not be examined right now will be forwarded to the application card table as dirty. Interesting references in heap areas corresponding to dirty cards cause that card to be added to the remembered sets. During card examination, the card is always set to clean on the refinement table. Parts of the refinement table corresponding to the collection set are simply cleaned without re-examination as these cards are not necessary to keep. During evacuation of these regions, the references of live objects in this area will be found implicitly by the evacuation. Writing directly to the application card table by the refinement threads is safe: memory races with the application are benign, the result will be at worst another dirty card mark without the additional to-collection-set color information. Calculate statistics about the recent refinement and update predictors. Any of these steps may be interrupted by a safepoint, which may be a garbage collection pause that evacuates memory. If not, the refinement worker threads will continue their refinement work using the heap snapshot where they left off. Garbage Collection and Refinement Refinement heuristics try to avoid having garbage collection interrupt refinement. In this case, the refinement table is completely unmarked at the start of the garbage collection, and all the not-yet examined marked cards are on the application card table, just where the following card table scan phase expects them. No further action except putting the remembered sets of areas to be collected on the application card table must be taken to be able to search for marked cards on the card table efficiently. Previously G1 had information about the location of all marked cards, these locations were either recorded in the remembered sets, or in the refinement buffers to refine cards. Based on this, G1 could create a more detailed map of where marked cards were located, and only search those areas for marked cards instead of searching the whole card table. However, searching for marked cards is linear access to a relatively little area of memory, so very fast. The absence of more precise location information for marked cards is also offset by not needing to calculate this information. In the common case of garbage collections with only young generation heap areas to evacuate, there is nothing to do in this step as the young generation area’s remembered set is effectively tracked on the card table. If a young garbage collection pause occurs at any point during the refinement process, the garbage collection needs to perform some compensating work for the not yet swept parts of the refinement table. In this case the G1 garbage collector will execute a new Merge Refinement Table phase that performs a subset of the refinement process: (Optionally) Snapshot the heap as above, if the refinement had been interrupted in phase 1 of the process. Merge the refinement table into the card table. This step combines card marks from both card tables into the application card table. This is a logical or of both card tables. All marks on the refinement table are removed. Calculate statistics as above. The reason why the refinement table needs to be completely unmarked at the start of the garbage collection is that G1 uses it to collect card marks containing interesting references for objects evacuated during the garbage collection in the heap areas the objects are evacuated to. This is similar to previously used extra refinement buffers to store those. At the end of the young garbage collection, the two card tables are swapped so that all newly generated cards are on the application card table, and the refinement table is completely unmarked. A full collection clears both card tables as this type of garbage collection does not need this information. Performance Impact This section contains some information about the aspects throughput, (native) memory footprint and latency compared to before these changes. Throughput In summary, throughput improves significantly for applications that execute the write barrier in full often because of the removal of the fine-grained synchronization in the barrier. Another aspect is that the new write barrier also significantly reduces refinement overhead because finding marked cards is faster (linear search vs. random access). There are also savings here due to reducing refinement work by keeping the young generation remembered set on the card table as these cards are at most re-examined once. the reduced CPU resource cost (size) of the write barrier enables better performance, either due to better optimization in the compiler or due to being easier to execute. In G1, throughput improvements may manifest as reductions in used heap size. The extent of the improvements can differ by microarchitecture. Parallel GC is still slightly ahead, there is more work going on reducing this difference further. Native Memory Footprint The additional card table takes 0.2% of Java heap size compared to JDK 21 and above. In JDK 21 one card table sized data structure has been removed, so JDKs before that do not show that difference. Some of that additional memory is offset by removal of the refinement buffers for card locations. Additionally, memory usage is reduced by keeping the collection set’s remembered set on the card table, taking no extra space. The optimization to color these remembered set entries specially keeps duplicates from appearing in the other remembered sets, also reducing native memory. In some applications these memory reductions completely offset the additional card table memory usage, but this is fairly rare. Particularly applications that did not have large remembered sets for the young generation, which are mostly very throughput-oriented applications, show the above mentioned additional memory usage. The refinement table is only required if the application needs to do any refinement. So, the refinement table could be allocated lazily, i.e. only if there is some refinement. There is a large overlap between such applications and above very throughput-oriented applications. This is not implemented in the current version. Latency, Pause Times Pause times are not affected, if not tending to be slightly faster. Pause times typically decrease due to a shorter “Merge remembered sets” phase because there is no work to do for the remembered sets for the young generation in the common case - they are always already on the card table. Even if the refinement table needs to be merged into the card table it is extremely fast and is always faster than merging remembered sets for the young gen in my measurements. This work is linearly scanning some memory instead of many random writes, so this is embarassingly parallel. The cards created during garbage collection do not need to be redirtied, so another garbage collection phase that was removed. Code Size During review a colleague mentioned that the change reduces total code size by around 5%. Summary This change removes the need for a large part of G1’s write barrier using a dual card table approach to avoid fine-grained synchronization, increasing throughput significantly for some applications. Overall, I’m quite satisfied with the change - after many years thinking about and prototyping solutions to the problem without introducing some “G1 throughput mode” that would have huge implications on maintainability (basically another garbage collector) or making G1 unnecessarily complex this seems a very good solution, taking the advantages of these throughput barriers without too many drawbacks. A lot of people helped with this change, my thanks. Hth, Thomas

2025/2/21
articleCard.readMore

JDK 23 G1/Parallel/Serial GC changes

Given that JDK 23 is in rampdown phase 2 this post is going to present you the usual brief look on changes to the stop-the-world collectors in OpenJDK. Compared to the previous release JDK 23 is a fairly muted one in the GC area, but there are good things on the horizon that I will touch on at the end of this post :) The full list of changes for the entire Hotspot GC subcomponent for JDK 23 is here, showing around 230 changes in total being resolved or closed at the time of writing. Nothing particular unusual here. Parallel GC The probably largest change to Parallel GC for some time has been the replacement of the existing Parallel GC Full GC algorithm with a more traditional parallel Mark-Sweep-Compact algorithm. The original algorithm overwrites object headers during compaction, i.e. object movement, but still needs object sizes to (re-)calculate the final position of moved objects to update references. To enable that even in the presence of overwritten objects, Parallel GC Full GC stored the location where live objects end in a bitmap of the same size as the one that records live object starts during its first phase. Then, when needing an object size for a given live object, the algorithm scans forward from that object’s start in the end bitmap, looking for the next set bit and subtracts the former from the latter. This can be slow, and needs to be done every time as part of looking up the final position of moved objects. That is actually quadratic in the number of live objects as rediscovered in JDK-8320165, and although there were several related (JDK-8145318) and less related improvements targeted to this issue, this could not completely mitigate the problem as JDK-8320165 shows. With JDK-8329203 we replaced the somewhat unique algorithm of Parallel GC with (basically) G1’s parallel Full GC which does not suffer from these hiccups. At the same time that second end bitmap (taking 1.5% of Java heap size) could be removed as well, while our measurements showed that overall the performance stayed the same (and improved a lot in these problematic cases). Another performance bump in Parallel Full GC could be achieved by decreasing contention on the counters recording the number of live bytes per region: instead of every thread updating global counters immediately, with JDK-8325553 Parallel GC has every thread locally record per-region liveness as long as that thread only encountered live objects within the same (small set of) regions. Serial GC Cleanup and refactoring of Serial GC code continued. G1 GC One of the more long-standing issues that got resolved in JDK 23 has been JDK-8280087 where G1 did not expand some internal buffer used during reference processing as it should. Instead it prematurely exited with an unhelpful error message. Performance improvements like JDK-8327452 and JDK-8331048 improve pause times and reduce native memory overhead of the G1 collector. All (STW) GCs Some time ago we introduced dedicated (internal) filler objects which I wrote about here. The community pointed out that jdk.vm.internal.FillerArray is not a valid array class name, and some tools somewhat rightfully choke on such “variable sized regular” objects. With JDK-8319548 the class name of filler arrays has been changed to [Ljdk/internal/vm/FillerElement; to conform to that, and backported to JDK 21 and above. What’s next Reduction of G1 remembered set storage requirements is still a very important topic to us: one old idea is to use a single rememebered set for multiple regions, removing the need to store remembered set entries within the group of regions with the same remembered set. The figures below show the principle: Currently regions (the big rounded boxes at the top, with “Y” for “Young region” and “O” for “Old region”) may have remembered sets associated to them (the vertical small boxes below the regions). The entries refer to approximate locations (areas) outside of the respective region, where there may be a reference into that region. Young regions always have remembered sets associated with them, as they are required to find references into these regions that need to be fixed up when moving a live object within them, and they are evacuated/moved at every garbage collection. You might notice that remembered set entries of different regions refer to the same locations in the old generation regions. Now, as mentioned, particularly young generation regions are always evacuated at the same time, together, which means that all these remembered set entries that refer to the same locations are redundant. This is what the initial change to implement above idea does: use a single remembered set for all young generation regions, as depicted below, automatically removing the redundant remembered set entries. This not only saves memory, but also time during garbage collection to filter out these entries. (Technically, combining remembered sets also removes the need for storing remembered set entries that refer to locations between regions in a group; however, for young generation regions no such remembered set entries will ever be generated so they are not shown and do not add to savings). The gain for applying this technique for the young generation remembered sets only can already be substantial as shown below in some measurements of some larger application. The graph depicts remembered set memory usage before (blue line) and after (pink line) applying the change. Of particular note is the halving of memory usage for the remembered set in the first 40 seconds where only young generation regions have remembered sets. As old generation regions get remembered sets assigned to them, the improvement decreases in that prototype due to lack of support for merging old generation regions, but the impact of just young generation region rememembered set savings are still noticeable. This initial change for combining young generation region remembered sets is actually out for review, but a more generic variant to also merge old generation regions into groups using a single remembered set is in preparation. In this release cycle another fairly large focus for improvements to the G1 collector have been changes to the write barriers: we spent a lot of effort to investigate how to best close the throughput gap of the G1 collector to Parallel GC in some cases (e.g. here and the differences reported here), and we believe we found very good solutions without compromising the latency aspect of the G1 collector. One step in that direction is JEP 475: Late Barrier Expansion for G1 which seems to be on track for JDK 24 inclusion. It enables some of the optimizations we are planning. There are still some details and kinks to work out, and it will take some time to do so, and even more time to have everything in place, but more to come in the next months… Thanks go to… … as usual to everyone that contributed to another great JDK release. Looking forward to see you next release. Thomas

2024/7/22
articleCard.readMore

JDK 22 G1/Parallel/Serial GC changes

Another JDK release, another version of a post in this series: this time JDK 22 GA is almost here and so I am going to entertain you with latest changes for the stop-the-world garbage collectors for OpenJDK for that release. ;) Overall I think this release provides fairly significant changes in the stop-the-world collector area like JEP 423: Region Pinning for G1. Next to the actual change in functionality it required some at least technically interesting changes under the hood. Also both Serial and Parallel GC young collection got performance improvements. The full list of changes for the entire Hotspot GC subcomponent for JDK 22 is here, showing around 230 changes in total being resolved or closed at the time of writing. This does not completely reflect changes on garbage collectors some of the significant ones were attributed to other subsystems but had significant impact on garbage collection pause times. Following is the usual run-down of interesting changes for the Hotspot stop-the-world collectors in JDK 22: Parallel GC Finding old-to-young references is a significant task during generational garbage collection. Parallel (and Serial) GC use a card table for this purpose. First, mutators mark card table entries that potentially contain references as “dirty”. Then, during the pause the algorithm scans the card table for these dirty marks, and looks through the objects indicated by these marks given they potentially have old-to-young references. Parallel GC (as the name implies) uses multiple parallel threads when looking through the typically large card table. The work distribution mechanism for card table scanning partitions the heap into 64kB chunks. Any card marked objects starting in that area are owned by that thread for processing. JDK-8310031 implements two optimizations that lead to better work distribution and better performance: contents of large object arrays are split across partitions, now not only the thread that owns a large object array looks for old-to-young references in that object. This could previously lead to a single thread walking through a GBs large object by itself. Although there is some additional work distribution mechanism based on work stealing from queues following the card scanning, this stealing is relatively expensive compared to just having multiple threads looking at parts of that object in the first place. that single owner thread always looked through all elements of the array, although the dirty cards indicated the interesting locations already. So the processing thread often looked through many references which were known to not contain any old-to-young references. This changed to a thread only processing the parts of the large object array that were marked dirty. The old behavior resulted in inverse thread scaling in some cases, and very long pauses compared to for example the G1 collector. Another performance issue with Parallel GC related to large array objects in the Java heap has also been fixed. Parallel GC now uses the same exponential backskip in the block offset table for finding object starts as the other collectors, speeding up this process and overall pause time. The block offset table solves the problem of finding the object start preceeding a card in the card table. One application is during above mentioned card scanning where the garbage collector needs to quickly find the start of the Java object either starting at or reaching into that particular card to start decoding that object (looking for references) properly. For every card (that typically represents 512 bytes of the Java heap) there is a block offset table (BOT) entry. That entry stores either how many words back from the heap address corresponding to this card the object reaching into this card starts, or that in the preceeding card there is no object start and the algorithm needs to look at the previous BOT entry for more information. The change JDK-8321013 introduces changes this backskip value from just “look at the previous card” to the number of cards to go back used as exponent of base two. The figure above tries to depict that: the Java Heap Objects at the top gives an example for how some Java objects could be arranged in the Java heap; the middle figure shows the BOT with its entries and what they each refer to. Some of them refer to object starts in the heap, others that do not contain an object start refer to the previous BOT entry (the backskip value). The bottom figure shows the backskip value with the new BOT encoding (only showing backskip references) where backskip values do not necessarily refer to the previous BOT entry, but some entry far before the current one within the object. This is fairly obvious for the large Java object on the far right, where walking from an address from the end of that object to the start could take many steps, while with the new encoding the garbage collection algorithm only needs a few. This dramatically reduces the amount of memory accesses when trying to find object starts for larger objects and improves performance. Serial GC JDK-8319373 optimizes the card-scanning code (finding dirty cards) in Serial GC based on the new Parallel GC code added in JDK-8310031. This also significantly reduces young collection time if the dirty cards are rare. A lot of effort has been spent to clean up Serial GC code, removing dead code and abstractions for when the same code has been shared by the Concurrent Mark Sweep collector removed in JDK 14/JEP 363. G1 GC Here are the JDK 22 user facing changes for G1: G1 now (JDK-8140326) reclaims regions that failed evacuation in the next garbage collection of any type. This improves the resilience of the G1 collector against having its old generation swamped with evacuation failed regions. The main use case is region pinning: trying to evacuate a pinned region causes an evacuation failure that moves the affected regions into the old generation. Without any measure to reclaim these typically very quickly reclaimable regions as soon as the region has been unpinned, they can cause a significant buildup of old generation regions, resulting in more garbage collections and an unnecessary Full GC in the worst case. Obviously this also helps with reclaiming space in evacuation failed regions caused by being unable to copy an object due to out-of-memory. On another note this change removes the previous (self-imposed) limitation in G1 that only certain types of young collections could reclaim space in old generation regions. Now any young collection may evacuate old generation regions if they meet some requirements. The long journey to remove the use of the GCLocker in G1 is over, as JDK-8318706 has been integrated and the JEP 423: Region Pinning for G1 completed. In short, previously if an application accessed an array via the Get/ReleasePrimitiveArrayCritical methods when interfacing with JNI, no garbage collection could occur. This change modifies the garbage collection algorithm to keep these objects in place, “pinning” them and marking the corresponding region as such, but allow evacuation of any other regions or non-primitive arrays within pinned regions. The latter optimization is possible because Get/ReleasePrimitiveArrayCritical can only lock primitive array objects. Now Java threads will never stall due to JNI code with G1. There is some minor change in heap resizing during the Remark pause to make resizing a bit more consistent in JDK-8314573. Heap resizing now calculates the heap size change based on -XX:Min/MaxHeapFreeRatio without taking Eden regions into account. As the Remark pause can happen at any time during the mutator phase, the previous behavior made the heap size changes very dependent on current Eden occupancy (i.e. how far into the mutator phase the application has been when the Remark pause occurs, the amount of free regions used for the calculation can differ a lot, resizing the heap differently). This results in more deterministic and generally less aggressive heap sizing. This list is rounded out with an actual, direct performance improvement: a region’s code root set, i.e. the roots from compiled code, were previously handled by a single thread per region during garbage collection. In cases where the code root set is very unbalanced (lots of code having a embedded references into a single or few regions) this could cause this work stalling garbage collection. JDK-8315503 makes G1 distribute code root scan work across multiple threads even within regions, removing this potential bottleneck. All STW GCs Loom necessitated the code cache sweeper to be removed in JDK-8290025. Its work has been, in case of the STW collectors, moved into the appropriate pause. Unfortunately a part of the code sweeper job had a component that has had a runtime of O(n^2) where n is the number of unloaded methods. This was not that big of an issue as long as the code sweeper did its work concurrent to the application, but after the removal caused significant pause time regressions when unloading a lot of compiled code. With JDK-8317809, JDK-8317007, JDK-8317677 and a few others class unloading in the pause is now actually faster than even before the code cache sweeper had been removed but still doing all the work. What’s next Work for JDK 23 has already started. Probably the most interesting upcoming change is the JEP Draft: Late G1 Barrier Expansion. From a GC point of view this makes barrier generation much more accessible to non-C2 expert developers to allow much easier tinkering. Another interesting topic that may be tackled in the JDK 23 timeframe is further reducing class unloading time, both by improving the code but for G1 also moving out parts of it to the concurrent phase. More to come in the next months :) Thanks go to… Everyone that contributed to another great JDK release. See you next release :) Thomas

2024/2/6
articleCard.readMore

What is… a Concurrent Undo Cycle

Recently I got questions about what Concurrent Undo Cycle messages in the logs mean - in this post I would like to explain this optimization and what it means for your application. Introduction The G1 garbage collector performs concurrent whole heap liveness analysis (marking) concurrently to the application. This process starts when G1 notices that old generation Java heap occupancy reaches the Initiating Heap Occupancy Percent (IHOP). The IHOP value is dynamically calculated based on typical duration of the concurrent marking and allocation rate into the old generation. When the garbage collector notices that old generation heap occupancy reaches this IHOP value, the next garbage collection will be a Concurrent Start garbage collection pause, followed by concurrent marking. After completion, G1 would start a mixed phase, collecting old generation heap regions. Problem A concurrent marking cycle can take a while and takes CPU resources. For example, below log snippet shows that G1 required around 4,2 seconds to finish the concurrent marking cycle. This process includes traversing the old generation object graph and some setup for the following old generation evacuation (described in detail in this post). [15,431s][info ][gc ] GC(12) Pause Young (Concurrent Start) (G1 Evacuation Pause) 11263M->10895M(20480M) 153,495ms [15,431s][info ][gc ] GC(13) Concurrent Mark Cycle [15,431s][info ][gc,marking] GC(13) Concurrent Scan Root Regions [15,431s][debug ][gc,ergo ] GC(13) Running G1 Root Region Scan using 6 workers for 41 work units. [15,507s][info ][gc,marking] GC(13) Concurrent Scan Root Regions 75,589ms [15,507s][info ][gc,marking] GC(13) Concurrent Mark [15,507s][info ][gc,marking] GC(13) Concurrent Mark From Roots [18,731s][info ][gc,marking] GC(13) Concurrent Mark From Roots 3224,174ms [18,731s][info ][gc,marking] GC(13) Concurrent Preclean [18,731s][info ][gc,marking] GC(13) Concurrent Preclean 0,038ms [18,734s][info ][gc ] GC(13) Pause Remark 11287M->11287M(20480M) 2,292ms [18,734s][info ][gc,marking] GC(13) Concurrent Mark 3226,866ms [18,734s][info ][gc,marking] GC(13) Concurrent Rebuild Remembered Sets and Scrub Regions [19,687s][info ][gc,marking] GC(13) Concurrent Rebuild Remembered Sets and Scrub Regions 953,419ms [19,688s][info ][gc ] GC(13) Pause Cleanup 11447M->11447M(20480M) 0,523ms [19,688s][info ][gc,marking] GC(13) Concurrent Clear Claimed Marks [19,688s][info ][gc,marking] GC(13) Concurrent Clear Claimed Marks 0,044ms [19,688s][info ][gc,marking] GC(13) Concurrent Cleanup for Next Mark [19,695s][info ][gc,marking] GC(13) Concurrent Cleanup for Next Mark 6,145ms [19,695s][info ][gc ] GC(13) Concurrent Mark Cycle 4263,187ms So what if after the Concurrent Start pause the conditions for starting the concurrent mark cycle do not apply any more, i.e. the old generation heap occupancy went below the start threshold? G1 does not start the whole concurrent marking cycle because memory occupancy indicates that it is not necessary. The problem is that the pause did some global changes in the Concurrent Start garbage collection pause. This is the purpose of the Concurrent Undo Cycle. Background The Concurrent Start garbage collection pause is just like a regular young garbage collection pause. In a regular young collection pause, G1 selects the regions to collect (the whole young generation and parts of the collection set candidates), and scans references from root locations for references into these regions. The referenced objects and their followers are evacuated. At the same time, G1 also performs a limited and conservative liveness analysis for humongous objects to try to (eagerly) reclaim dead humongous objects. After some cleanup the application continues. The differences of the Concurrent Start pause from a regular young collection pause are minor: during scanning the VM root locations, G1 marks locations into the old generation on the mark bitmap. For example, G1 marks a reference from a variable on a thread stack that points to an object into the old generation. G1 determines root regions. These are memory areas (typically whole regions) that potentially contain references into the old generation (which will not be marked through). Root regions are similar to above root locations, but their referents are not marked in the bitmap during the garbage collection pause to save time. The following Concurrent Root Region Scan part of the marking cycle scans the live objects in the root regions for such references into the old generation and marks their referents in the bitmap. Examples for such regions are Survivor regions, but also collection set candidate regions that are going to be collected in the next garbage collection (potentially during marking). Examples for the latter are regions where evacuation failed. Both evacuation of collection set candidate regions and eager reclaim above means that in some cases, the old generation occupancy can be significantly smaller after the garbage collection pause. The Change JDK-8240556 implements short-cutting the concurrent marking cycle. If old generation heap occupancy falls below the IHOP value during the Concurrent Start pause due to large object eager reclamation, instead of executing the whole marking cycle as shown in the state diagram below, G1 will take the shortcut indicated as a blue arrow to the right. The log messages above show that the skipped states can result in a substantial saving of cpu usage (and time). The remaining states comprise a Concurrent Undo Cycle as shown below: These are reflected in the corresponding log output. The following example shows such a snippet for a Concurrent Undo Cycle: [6278.870s][info][gc ] GC(54) Concurrent Undo Cycle [6278.870s][info][gc,marking] GC(54) Concurrent Clear Claimed Marks [6278.886s][info][gc,marking] GC(54) Concurrent Clear Claimed Marks 0.016ms [6278.886s][info][gc,marking] GC(54) Concurrent Cleanup for Next Mark [6278.948s][info][gc,marking] GC(54) Concurrent Cleanup for Next Mark 77.814ms [6278.948s][info][gc ] GC(54) Concurrent Undo Cycle 78.038ms G1 needs to execute the Concurrent Clear Claimed Marks and Concurrent Cleanup for Next Mark phases only: the first one resets internal information about in-use class loaders, and the second resets the mark bitmap that has been used for marking root references from the VM internal data structures. Everything else is skipped. Note that the Concurrent Cleanup for Next Mark phase can still take a substantial amount of time, but compared to a regular concurrent marking cycle it should finish much more quickly. Some alternatives that could be even faster are recorded in JDK-8316355, so if it itches your scratch get in contact. :) Summary In G1 a Concurrent Undo Cycles is the result of an optimization to avoid the most cpu intensive parts of the concurrent marking cycle, and nothing to be worried about. If they occur, G1 simply found that it is unnecessary to do a complete whole concurrent marking. Concurrent Undo Cycles were introduced with JDK 16, so if you are running a recent version of the JDK, you might have already come across one or the other. See you next time, Thomas

2023/9/28
articleCard.readMore

JDK 21 G1/Parallel/Serial GC changes

JDK 21 GA is on track - as well as this update for changes of the stop-the-world garbage collectors for OpenJDK for that release. ;) This release is going to be designated as an LTS release by most if not all OpenJDK distributions. Please also have a look at previous blog posts for JDK 18, JDK 19, and JDK 20 to get more details about what the upgrade from the last LTS, JDK 17, offers you in the garbage collection area. Back to this release, before looking at stop-the-world garbage collectors, probably the most impactful change to garbage collection in this release is the introduction of Generational ZGC. It is an improvement over the Z Garbage Collector that maintains generations for young and old objects like the stop-the-world collectors, focusing garbage collection work on where most garbage is created typically. It significantly decreases required Java heap and CPU overhead to keep up with applications. To enable generational ZGC, along the -XX:+UseZGC flag pass the flag -XX:+ZGenerational. Other than that, the full list of changes for the entire Hotspot GC subcomponent for JDK 21 is here, showing around 250 changes in total being resolved or closed at the time of writing. This is a bit more than the previous releases. Here is a run-down of the Hotspot stop-the-world collectors and the most important changes to them in JDK 21: Parallel GC There were no notable changes for Parallel GC. Serial GC There were no notable changes for Serial GC. G1 GC Here are the JDK 21 user facing changes for G1: During Full GC, G1 will now move humongous objects to decrase the possibility of OOME due to region level fragmentation. While JDK 20 and below strictly never moved humongous (large) objects, this restriction has been lifted for last-ditch full collections. A last-ditch full collection occurs after a regular full collection did not yield enough (contiguous) space to satisfy the current allocation (JDK-8191565). This improved last-ditch full collection can avoid some previously occurring Out-of-Memory situations. Another somewhat related improvement for full collections (JDK-8302215) makes sure that the Java heap is more compact after full collection, reducing fragmentation a little. The so-called Hot Card Cache has been found to be without impact (at least) after concurrent refinement changes in JDK 20, and removed. The intent of this data structure has been to collect locations where the application very frequently modifies references. These locations were put into this Hot Card Cache to be processed once in the next garbage collection, instead of constantly concurrently refine them. The reason why this removal can be interesting is that its data structures used a fairly large amount of native memory, that is 0.2% of the Java heap size, to operate. This native memory is now available for use for other purposes. The corresponding option -XX:+/-UseHotCardCache to toggle its use is now obsolete. On applications with thousands of threads, in previous releases a significant amount of time of the pause could be spent in tearing down and setting up the (per-thread) TLABs. These two phases have been parallelized with JDK-8302122 and JDK-8301116 respectively. The preventive garbage collection feature has been removed completely with JDK-8297639 for the reasons stated previously. What’s next Work for JDK 22 has started, continuing right away with features that did not make the cut. Here is a short list of interesting changes that are already integrated or currently in development. As usual, there are no guarantees that they will make the next release ;). The change to improve the resilience of G1 against regions that failed evacuation swamping the old generation has been out for review for some time now (JDK-8140326, PR) and is about to be pushed. This change removes the previous limitation in G1 that only certain types of young collections could reclaim space in old generation regions. With this change any young collection may evacuate old generation regions if they meet the usual requirements. Currently this helps with reclaiming space after an evacuation failure quickly, but among other interesting ideas, this change provides the necessary infrastructure to allow efficient and timely evacuation of pinned regions right after they become un-pinned. With above change, region pinning is only one PR away. The fix to the long standing problem of the GCLocker starving threads from progress and causing unnecessary Out-Of-Memory exceptions is in development JDK-8308507 and actually out for review. More to come in the next months :) Thanks go to… Everyone that contributed to another great JDK release. See you in the next release :) Thomas

2023/8/4
articleCard.readMore

JDK 20 G1/Parallel/Serial GC changes

Yet another JDK release that is on track - JDK 20 GA is almost here. Another opportunity to summarize changes and improvements in Hotspot´s stop-the-world garbage collectors for the JDK 20 release. This release does not contain any JEP for the garbage collection area, but the JEP for Generational ZGC reached Candidate status just recently, so maybe it will be ready for JDK 21. :) Other than that, the full list of changes for the entire Hotspot GC subcomponent for JDK 20 is here, showing around 220 changes in total being resolved or closed. Parallel GC The only notable improvement to Parallel GC is parallelization of the handling of objects that cross compaction regions JDK-8292296 during Full GC. Instead of having a single-threaded phase at the end that iterates and fixes up these objects, worker threads collect objects crossing their local compaction regions and handle them on their own. M. Gasson, the contributor, reports 20% reduction of full gc pause times in select cases. Serial GC There were no notable changes for Serial GC. There have been quite a few changes cleaning up the Serial GC code. G1 GC Broadly speaking JDK 20 provides all items on that long “What’s next” list in the previous report - and more :) JDK-8210708 reduces G1 native memory footprint by ~1.5% of Java heap size by removing one of the mark bitmaps spanning the entire Java heap. The Concurrent Marking in G1 blog post includes a thorough discussion of the change. That posting also obsoletes the information about concurrent marking in the original G1 paper. There is not much left in the paper that is accurate about current G1. Actually, in some way that blog post is out-of-date already too: JDK-8295118 moves a sometimes lengthy part of preparation for concurrent marking called Clear Claimed Marks out of the concurrent start garbage collection pause. A new phase called Concurrent Clear Claimed Marks will show up in the logs with gc+marking=debug logging. Preparing for future region pinning support in G1, JDK-8256265 decreases the parallelization granularity when handling regions that were pinned by the user (or could not be evacuated because there is no space left in the Java heap). Instead of handing out entire regions on a per-thread basis the task granularity is parts of regions now. This makes threads share work better, significantly reducing unproductive waiting for completion. JDK-8137022 makes refinement thread control more adaptive. Previously, during a garbage collection pause G1 calculated discrete thresholds at which a particular refinement thread got activated (and deactivated) at mutator time to help with refinement. This calculation has been based on the allowed time the user wanted to spend on refining cards (the option -XX:G1RSetUpdatingPauseTimePercent) in the garbage collection pause, the most recent interval to the next pause and lots of magic. Due to not observing the application directly, like accounting for recent incoming and refinement thread processing rate of work, and the expected time left until the next garbage collection, refinement thread control was strongly biased to wake up and do more work than necessary in a bursty fashion to avoid too much work left for the collection pause. This behavior not only wastes CPU cycles in the refinement threads, but has another drawback: few cards were left unrefined on the Java heap. That sounds good to have, but below a certain level this can be somewhat counterproductive. The write barrier needs to perform more work whenever there is a new, unvisited card as described here than if the card remains in the queue for later processing. Not only does the additional activity of refinement threads takes away cpu resources, but G1 also repeatedly refines the same cards without reducing the work left for the pause much. Taking mutator activity into account with the change, G1 better distributes refinement thread activity between pauses, and leaves more cards on the refinement task queues for longer, which reduces the rate of generation of new cards and achieves the user’s intent (i.e. -XX:G1RSetUpdatingPauseTimePercent) more deterministically. In the end this often takes less cpu cycles away from the application, providing better throughput. Several G1 garbage collector options related to the old refinement control were obsoleted, causing the VM to exit at startup when used. The release note details them. G1 uses local allocation buffers (PLABs) to reduce synchronization overheads during the garbage collection pause. PLABs are resized based on recent allocation patterns to reduce unused space in these buffers at the end of the garbage collection pause. If there is little need for allocation, they shrink, otherwise they grow. This per-collection pause adaptation works well for applications that have fairly constant allocation between garbage collections; however this does not work well if allocation is extremely bursty. The PLABs will be too large a few garbage collections after there is little allocation during garbage collection, wasting space, or too small when allocation spikes, wasting cpu cycles reloading PLABs. On some platforms we noticed very long pause spikes in the range of seconds due to that. The change in JDK-8288966 adds some reasonably aggressive resizing of PLABs during garbage collection to counter these situations. A significant amount of effort has been spent in JDK 20 to improve predictions that are used to size the young generation which are ultimately responsible how long garbage collection takes (this bug tracker query gives an overview). Better prediction makes G1 observe the pause time goal better (as specified by -XX:MaxGCPauseMillis), which reduces pause time overshoots and increases usage of the available pause time goal by using more young generation regions per garbage collection. This increases pause time within the allowed goal, but can decrease the number of garbage collections significantly. We have measured applications that spend 10-15% less time in garbage collection pauses with these changes. JDK 20 disables preventive garbage collections by default with JDK-8293861. Preventive garbage collections were introduced in JDK 19 to avoid G1 not having enough Java heap memory for evacuating objects (also called “evacuation failure”) during garbage collection. Handling of regions that experienced an evacuation failure has traditionally been fairly slow, so the argument for that feature has been that it is better to do a garbage collection that does not have such an evacuation failure preemptively in the hope that it frees up enough memory to avoid these evacuation failures completely. The problem is how to anticipate this situation correctly. The predictions G1 used to determine whether to start a preventive collection proved to be suboptimal, and often started preventive collections unnecessarily and too early. This wastes time. There were also many cases where the preventive collection would not be triggered, making the application experience evacuation failures anyway. Finally, these kind of garbage collections made garbage collections more irregular, which if they occurred, generally made predictions harder. There is a new GarbageCollectorMXBean called G1 Concurrent GC introduced with JDK-8297247 that reports the occurrence and durations of G1 Remark and Cleanup pauses. These pauses also update the G1 Old Gen MemoryManagerMXBean memory pool information now. In summary, I believe there are significant additions to garbage collection worth upgrading, even if only later with JDK 21. What’s next Work for JDK 21 is already full steam ahead. Here is a short list of interesting changes that are already integrated or currently in development. As usual, there are no guarantees that they will make the next release - although it is highly likely that the ones already integrated will stay ;). After improving refinement thread control, JDK-8225409 removes the Hot Card Cache. This data structure has been in some way a workaround for the problem described above that refinement has been too aggressive, and it is advantageous to keep cards unrefined for longer. In this mechanism, for every card G1 kept a counter about how often it had been refined in this mutator cycle, and if that count exceeded a threshold, the card was not refined and kept in a small fixed-size ring buffer until it would either be evicted from there because of overflow or garbage collection occurred. After the JDK 20 refinement thread control changes above, we could not measure any impact of the Hot Card Cache any more. The current refinement control seems more than lazy enough with refinement to subsume Hot Card Cache functionality. This frees up 0.2% of Java heap size of native memory for other use. There is ongoing work on improving G1 behavior in presence of high region-level fragmentation. Currently, if G1 can not find a contiguous range of memory for a humongous object allocation even after a Full GC, the VM exits with an OutOfMemoryException although in aggregate there would be enough memory available. This PR changes the behavior of the “last-ditch” G1 Full collection to also move humongous objects to create more contiguous memory. This should in many cases avoid the OutOfMemoryException at the cost of a longer full collections. Since “last-ditch” G1 Full collections occur only right after there had been a regular G1 Full Collection, the lengthening of that collection seems an acceptable trade-off to avoid VM failure. To improve the resilience of G1 against regions that failed evacuation (or are pinned in the future) swamping old generation, there is work progressing on allowing G1 to evacuate these Old regions at any garbage collection as soon as possible after they are generated. The current policy for regions that failed evacuation is to make them Old regions, which means G1 can not allocate into them any more, although most often they only contain a few live objects. These also do not have remembered sets, so the only way to reclaim space from them is waiting for the next concurrent marking to complete. If many regions are failing evacuation, potentially across multiple garbage collections, the Java heap fills up quickly with these typically very lightly occupied regions. This can easily lead to a Full GC. This change will completely remove the previous assumption that young-only collections would never collect old generation regions. Although technically, G1 has already been trying to collect some humongous objects which are Old generation regions since JDK 8u65, so that assumption has not strictly held for a long time… Some unsuccessful attempts to improve the predictions for preventive garbage collections led to the decision to remove this functionality completely with JDK-8297639 given that handling of regions that failed evacuation has little overhead, and will be even less with the work suggested above. The additional garbage collections cause many problems without providing benefits. More to come in the next months :) Thanks go to… Everyone that contributed to another great JDK release. See you in the next (LTS) release :) Thomas

2023/3/14
articleCard.readMore

The Case of Ljdk.vm.internal.FillerArray;

While writing the JDK 19 update blog post I strongly considered writing about the “new” (jdk.vm.internal.FillerObject and jdk.vm.internal.FillerArray) objects that may appear in heap dumps. I thought they were not that interesting, but it looks like people quickly noticed them. This post describes their purpose after all. Update 2024/07/22: The name of the latter class is [Ljdk/internal/vm/FillerElement; beginning with JDK 21. Background In Hotspot there is a fairly interesting property of Java heaps we call parsability. It means that one can parse (linearly walk) the parts of the heap (whether these are called regions, spaces, or generations depending on the collector) from their respective bottom to their top addresses. The garbage collector guarantees that after an object in the Java heap another valid Java object of some kind always follows. Every object header contains information about the type of the object (i.e. class) which allows inference of the size of that object. This property is exploited in various ways. This is a non-exhaustive list: heap statistics (jmap -histo): just start at the bottom of all parts of the heap, walk to its top, and collect statistics. heap dumps some collectors require heap parsability for at least parts of the heap when they partially collect the heap in young collections: they use an inexact remembered set to record approximate locations where there are interesting references. These recorded areas (remembered set entries) need to walked quickly during garbage collection. This post explains how this works in a bit more detail in the introduction. So how could the Java heap get unparsable? After all objects are allocated in a linear fashion already? There are two (maybe more) causes: the first is LAB allocation: threads do not synchronize with others for all memory allocations, but they carve out larger buffers that they use for local allocation without synchronization to others. These LABs are fixed size - if a thread can’t fit an allocation into its current LAB any more, the remainder area needs to be formatted properly before getting a new LAB. the second is class unloading. Class unloading makes objects which classes were unloaded unparsable as their class information will be discarded. Some collectors avoid this issue by compacting the heap after class unloading to remove all these invalid objects from the parsable Java heap area. Before JDK 19 the Hotspot VM used instances of java.lang.Object and integer arrays ([I) to reformat (“fill”) these holes in the heap. The former are used only for the tiniest holes to fill, while everything else uses integer arrays. I.e. in a complete heap histogram or heap dump that included non-live data you may have noticed an abundance of java.lang.Object and [I instances that were not referenced from anywhere in your program. Here is an example jmap -histo <pid> run of some program with an older JVM: num #instances #bytes class name (module) ------------------------------------------------------- 1: 16015 350913960 [B (java.base@11.0.16) 2: 467918 33690096 java.util.concurrent.ScheduledThreadPoolExecutor$ScheduledFutureTask (java.base@11.0.16) 3: 4559 18664488 [I (java.base@11.0.16) 4: 1 2396432 [Ljava.util.concurrent.RunnableScheduledFuture; (java.base@11.0.16) 5: 10715 342880 java.util.concurrent.SynchronousQueue$TransferStack$SNode (java.base@11.0.16) 6: 10528 336896 java.util.concurrent.locks.AbstractQueuedSynchronizer$Node (java.base@11.0.16) 7: 10149 243576 java.lang.String (java.base@11.0.16) [...] A large part of the [I instances are likely filler objects. The Change JDK-8284435 added special, explicit filler objects with the names jdk.vm.internal.FillerObject and Ljdk.vm.internal.FillerArray; in Java heap histograms (Actually, in the original change there has been a typo in the filler array type name, it has errorneously been called Ljava.vm.internal.FillerArray;, fixed in JDK-8294000). Update 2024/07/22: There has been another renaming of the “array” filler to [Ljdk/internal/vm/FillerElement; to conform to Java’s array class name notation as of JDK-8319548 and backported to JDK 21. These classes are created up front at VM startup similar to most VM internal classes. These serve the same purpose as java.lang.Object and [I instances for filling holes, but have the added advantage to us Hotspot VM developers that if we see them referenced in crash logs, or references into these kinds of objects, there is a high likelihood of some bug related to dangling references to garbage objects. It’s not fool-proof, but makes crash investigation a bit easier as we can now more easily distinguish between dangling references and legitimate references to instances of the previous kinds of filler objects. Heap verification can now also better distinguish between filler and live objects independent of the garbage collector. Here is another run of jmap -histo of the same program as above with a current VM: num #instances #bytes class name (module) ------------------------------------------------------- 1: 10740 152377136 [B (java.base@20-internal) 2: 250078 18005616 java.util.concurrent.ScheduledThreadPoolExecutor$ScheduledFutureTask (java.base@20-internal) 3: 1699 14286176 Ljdk.internal.vm.FillerArray; (java.base@20-internal) 4: 1 1065104 [Ljava.util.concurrent.RunnableScheduledFuture; (java.base@20-internal) [...] 23: 361 11552 java.lang.module.ModuleDescriptor$Exports (java.base@20-internal) 24: 127 11264 [I (java.base@20-internal) 25: 274 10960 java.lang.invoke.MethodType (java.base@20-internal) [...] 76: 11 1056 java.lang.reflect.Method (java.base@20-internal) 77: 66 1056 jdk.internal.vm.FillerObject (java.base@20-internal) 78: 1 1048 [Ljava.lang.Integer; (java.base@20-internal) [...] Notice that in this histogram, there are quite a few Ljdk.internal.vm.FillerArray; instances and much less [I instances (these histograms were taken randomly, so they are not directly comparable). jdk.internal.vm.FillerObject instances are rather rare because holes with minimal size are rare. (On 64 bit machines they only show up when not using compressed class pointers via -XX:-UseCompressedClassPointers as otherwise filler arrays also fit minimum instance size). Obviously the names and everything related are Hotspot VM internal details and can change at any notice. Impact Discussion For the end user there should be no difference apart from instances of these objects showing up in complete heap dumps. That’s all for today, mystery solved :), Thomas

2022/9/26
articleCard.readMore

JDK 19 G1/Parallel/Serial GC changes

JDK 19 GA is almost here. Let me summarize changes and in particular improvements in Hotspot´s stop-the-world garbage collectors in that release - G1 and Parallel GC - for yet another time. :) Overall there has been no JEP specifically for the garbage collection area for this release. There were two other large changes that required assistance in the garbage collection area: First, the JEP 425: Virtual Threads (Preview) JEP, second JDK 8 Maintenance Release 4. Virtual threads introduce on-Java heap objects that contain thread stacks. These new kind of objects need to be handled properly by all garbage collection algorithms. However the main complication with these objects is that unloading stale compiled code referenced by them requires some fairly complex dance between garbage collectors and the code cache sweeper (“GC for compiled code”), at least in JDK 19. Work has started on improvements already, for example by having the garbage collector taking complete control over method unloading or too aggressive sweeping with Loom. When Loom is not enabled via the preview option, these problems do not occur. JDK 8 got a maintenance release. The changes detailed here remove undefined behavior in java.lang.ref.Reference processing that could previously cause crashes. A side effect of this change is that reference processing in the garbage collectors could be improved a bit. Other than that, the full list of changes for the entire Hotspot GC subcomponent for JDK 19 is here, showing around 200 changes in total being resolved or closed. Above two projects left their impact :) A brief look over to ZGC shows that the changes this release were minor: JDK-8255495 added partial CDS support (class data is archived, but not heap objects), in addition to some bug fixes. For some time now the main focus has been on generational ZGC - you can follow development here, please look forward to it. Parallel GC Parallel GC now also supports archived heap objects. This feature has been added with JDK-8274788. JDK-8280705 improves work distribution in the first phase of full collection, sometimes yielding very good performance improvements. Serial GC There were no notable changes for Serial GC. G1 GC Except JDK-8280396 that implements that same optimization mentioned above for Parallel GC for G1 Full GC there were no notable improvements. JDK 19 introduced a native memory usage regression very late in the JDK 19 release process so that the fix could not be integrated any more. The release note provides a workaround. The regression has already been addressed for JDK 19.0.1 and later. What’s next Of course we have long since begun working on JDK 20. Here is a short list of interesting changes that are currently in development and you may want to look out for. Many of them are actually already integrated - Virtual Threads and the JDK 8 MR4 took away quite a few resources that made changes slip past FC for JDK 19. Apart from those, there are no guarantees that they will make JDK 20. They are going to be integrated when they are done ;-) G1 native memory footprint has been reduced significantly with the integration of JDK-8210708 into JDK 20, removing one of the mark bitmaps spanning the entire Java heap. The Concurrent Marking in G1 blog post includes a thorough discussion of the change, and its impact. The change finally realizes the native memory footprint originally forecasted for JDK 19 in the “Prototype (calculated)” line here. Another important change that improves parallelism (and performance) during evacuation failure has already been integrated into JDK 20, as a further step towards support for region pinning as described in JEP 423: Region Pinning in G1 in the future. Concurrent refinement control is currently being revised as part of JDK-8137022. Generally, the rate of processing cards to refine concurrently is now smoother. As a side effect, G1 will then leave more dirty cards in the queue for longer, reducing the rate of generation of new dirty cards, and ultimately reducing the amount of concurrent refinement work overall. This can result in increased throughput in applications. Applications that promote objects in bursts could exhibit very long pause time spikes in the range of a low amount of seconds instead of 1-200ms. Particularly large Aarch64 machines were susceptible to that. This is caused by untrained (or mis-trained) sizes for promotion allocation buffers (PLABs). JDK-8288966 adds some reasonably aggressive resizing of PLABs during garbage collection to solve the problem. There is ongoing investigation to improve predictions for G1 to make it better observe pause time goal (-XX:MaxGCPauseMillis). The goal is to have less overshoots (less latency spikes) and less undershoots so that less garbage collections are required while keeping the pause time goal. In some applications this can significantly lessen the total time spent in garbage collections, leaving more time for the application. Parts that work include e.g. JDK-8231731. Just recently there has been renewed interest in implementing a few long-standing improvements to make Hotspot garbage collectors more runtime configurable (e.g. for G1: JDK-8236073, JDK-8204088) and nicer to other tenants (e.g. JDK-8238687) so that they can be more container friendly. More to come :) Thanks go to… Everyone that contributed to another great JDK release. See you in the next release :) Thomas

2022/9/16
articleCard.readMore